Multi-modal Information Extraction from Unstructured Data on HPC Systems
The UIMA-HPC has been initiated to enable data mining applications to make efficient use of high-performance computing resources. In the first phase, the focus will be on the bio-pharmacological area for which e.g. the PubMed database holds more than 20 million entries. Researchers in this field need to find answers to questions such as the following: For a given base structure, are there any structure variants already mentioned in literature, and if so, are there any indications of their effects? Are structure variants protected by third-party rights or are they freely available? These questions cannot be answered by sheer keyword searches. The information has to be made available to researchers in a compact and structured way in a timely manner.
The project will develop fast and efficient procedures to extract knowledge from unstructured data from all kinds of sources, such as texts, graphics, tables, diagrams, captions, and blogs. The targeted system will embed the de facto standard protocol for information extraction UIMA (Unstructured Information Management Architecture) into an HPC framework based on UNICORE, thus enabling a new class of applications.
The project is funded in part by the Bundesministerium fuer Bildung und Forschung (BMBF) under contract 01IH11012A-D.
The grant period is April 2011 until March 2014.
More detailed information about the project is available at the project's homepage.