The Community for Technology Leaders

A Structure for Unstructured Data Search


Abstract—The Unstructured Information Management Architecture is a software development framework developed at IBM to help realize the value of unstructured data search. IBM made UIMA open source in mid-2005 to encourage development of domain analytics. In early December 2006, IBM joined some of the technology industry's other leading governmental and academic sector players to form the UIMA Technical Committee at the Organization for the Advancement of Structured Information Standards. The UIMA framework was transferred to the Apache Foundation's incubator concurrent with the formation of the OASIS committee.

An often-repeated technology industry truism holds that 80 percent of enterprise data resides in unstructured formats such as text files, email, video documents, and audio samples. It's rare, however, to find a concrete example of the need to access that data. A sterling case showed up on 1 December 2006, when changes in the US Federal Rules of Civil Procedure went into effect. These arcane laws are usually an intellectual parsing ground for attorneys, but this particular set of amendments addressed the rules of " news-updates-ediscovery-amendments -to-the-federal-rules-of-civil-procedure-go-into-effect-today.html" legal discovery for electronic data. Discovery rules cover information-sharing protocols between court adversaries, and they suggest how lucrative the market for advanced technologies to search unstructured data is becoming.

"Expectations are growing," says David Ferrucci, senior manager of the semantics analysis and integration group at IBM's Thomas J. Watson Research Center, "and the real needs for exploiting this information in and out of the enterprise are clearly growing." Nevertheless, he says, a challenge of advancing the technology is that "its value isn't always obvious."

A development framework

Ferruci is the lead architect of the Unstructured Information Management Architecture, a software development framework developed at IBM to help realize this value. IBM made UIMA open source in mid-2005, posting it to Sourceforge. Then, in early December 2006, IBM joined with some of the technology industry's other leading governmental and academic sector players to form the UIMA Technical Committee at the Organization for the Advancement of Structured Information Standards. The UIMA framework was transferred to the Apache Foundation's incubator concurrent with the formation of the OASIS committee.

IBM does have some competition within the realm of unstructured data search, but Ferrucci says the idea behind open sourcing UIMA was to offer the architecture to a wide variety of vertical industries, hoping developers with domain expertise will be able to write the appropriate content analytic components.

"There is no way one company is going to create all these analytics," Ferrucci says. "The best way to get the business going is by accelerating that process with a piece of middleware and to get people talking about the technology and thinking in the same terms so we can start to make a business out of it."

How it works

Essentially, the UIMA framework eliminates the need to manually create metadata for unstructured data types. Developers have often cited the vagaries behind creating effective metadata as a big stumbling block to increasing multimedia search technology's efficiency.

According to a paper co-authored by Ferrucci and his colleague Adam Lally, lead software engineer for UIMA at IBM, we can think of any application capable of performing unstructured information management as comprising two phases:

  • an analysis phase that might include tokenization and semantic class detection, and
  • a delivery phase that employs a semantic search engine to let users search for analyzed documents that might contain some Boolean combination of tokens, entities, and relationships.

Search results are usually delivered as structured data that the application has derived from the unstructured input.

An application the manages unstructured information can analyze both single documents and collections. In document-level analysis, the fundamental processing component is the analysis engine, which houses the core analysis algorithms, or annotators. The engine's input data is the document being analyzed, and the analysis product is the metadata describing relevant portions of the document. Analysis engines can contain one or more annotators, and engines that perform similar tasks—text analysis, for example—use a common interface. So, applications can reuse or build further on them, thereby reducing or eliminating redundant development effort.

Collection-level analysis can include document-level analysis, but the results represent inferences made over an entire collection. Examples include glossaries, dictionaries, databases, search indexes, and ontologies.

Ferrucci says UIMA-enabled applications will be able to address a wide range of precision requirements in their results. "There are a lot of really good capabilities out there," he says, referring to technologies for generating metadata automatically, but UIMA-compliant applications entails use layers of specialized development expertise to support search precision. "What is the tolerance for the precision and recall of those analytics in your application? If you take a common Web search application, the tolerance for bad precision and low recall is very high. On the other end of the spectrum, if you take a transactional database system that holds your Social Security number and how much you get in your paycheck every week, the tolerance for mistakes there is very, very low.

"So, when you look at this automatic metadata generation technology, in some cases it's extraordinarily good, in some cases comparable to human-generated metadata. But it's not perfect. It depends on the domain, how sophisticated a particular analytic is—it depends on a lot of dimensions. The bottom line is, if you have an application that can tolerate wherever you are in that multidimensional space, you can build very effective applications based on automated technology that you couldn't dream of doing if you have to do it manually."

Building the community

One of the UIMA community's growing assets is a component repository at Carnegie Mellon University. The repository consists of 32 analysis engines and two Common Analysis System consumers; the CAS consumers are the ultimate stage of a collection-processing engine.

Eric Nyberg, an associate professor of computer science at CMU's Language Technologies Institute, and secretary of the OASIS UIMA committee, says he would like to see the UIMA repository grow to include resources such as UIMA development best practices and links to development tools such as UIMA-capable Eclipse plug-ins.

"To give UIMA credit, this is an example of an architecture that is actually gaining traction," Nyberg says. "It's dramatically more successful than past architectures addressing the same concept."

Nyberg says developers might turn to the CMU site to discuss aspects of UIMA development such as the best way to map I/O requirements from one UIMA-compliant system to another when certain parameters in each are slightly different. For example, perhaps one system refers to descriptions of geographic data as "location," and another system uses the term "geoloc." By using the forum to work these disparities out, Nyberg says developers could combine and reuse UIMA components even if they'd been designed for completely different systems.

Although UIMA is platform independent, the UIMA developers' guidelines include setup via the open source Eclipse platform. IBM's Lally, who is also one of the Apache incubator's "committer" developers with write access to the code repository), says UIMA is designed to complement Eclipse.

"The Apache implementation doesn't require anyone to use Eclipse, but we've built a lot of our development tooling around it," he says. "We provide plug-ins to Eclipse that are specific to UIMA and think it's a good platform for integrating different types of GUI components, as well as plug-ins that make a single coherent UIMA development environment that can integrate different tooling."

Ferrucci says the UIMA team is also working with the developers of open source applications meant to analyze unstructured data. One example of this type of application is the General Architecture for Text Engineering, developed by researchers at the University of Sheffield in the UK.

"The focus for GATE has been on higher level strong NLP tooling for creating annotators and doing experiments and testing," Ferrucci says. "UIMA focuses more on well-crafted, scalable, multiplatform support, on the nuts and bolts of the plumbing.

"There's already an interoperability layer, which allows you to take a GATE component and run it within a UIMA application, but both runtimes are required for it to work. We may look at some of the GATE tooling producing UIMA-compliant components so there's a better integration of the interfaces and both runtimes would not be required. The Sheffield team are in fact members of the UIMA technical committee at OASIS, so that might be an area of discussion."

Real-world deployment?

Some IBM products already deploy UIMA capabilities, and a few small companies have emerged with UIMA-compliant analysis applications. However, the UIMA effort is still new enough that much of the analysis and recovery sector is just discovering its principles.

"If they could productize it a little bit more I would love to get it," says Chuck Bokath, vice president of software development at eMag Solutions, an Atlanta-based company specializing in data discovery and restoration. "The tool we use right now can tear metadata apart from 400 document types. A lot of open source tools can do Word, Excel, and PDF, but a lot of other things have to be broken into a structured format before you can analyze them, and open source tooling isn't always up to the task."

Bokath says he doesn't use GATE, for example, because it isn't efficient enough.


Bokath says another potential obstacle for UIMA adoption might be its use of C++ and Java; Bokath's team prefers to use C for its performance premium, though he says he is encouraged by the effort the UIMA team is putting into improving and publicizing it.

"When I first looked at it I said, 'That's cool, that's kind of what we're doing.' It's a great idea and I like their start on it."

Related Links

  • DS Online's Web Systems Community
  • "Mining Text with Pimiento," IEEE Internet Computing
  • "Boosting the Feature Space: Text Classification for Unstructured Data on the Web," Proc. 6th Int'l Conf. Data Mining

Cite this article:

Greg Goth, "A Structure for Unstructured Data Search", IEEE Distributed Systems Online, vol. 8, no. 1, 2007, art. no. 0701-o1003.

59 ms
(Ver 3.x)