Internet Search Takes a Semantic Turn
Search has become one of the Internet's most important technologies, as evidenced by the rise of Google as one of the world's most important technology companies.
Google built its success on its PageRank algorithm, which combines keyword search with sophisticated technology for determining the relevancy of Web pages that represent potential search results based on the number of links that point to the pages.
Despite this success, keyword search has many limits caused by its inability to process the meaning of queries and Web pages. Because of potential confusion over the meaning of words, traditional searches generally return large numbers of pages, including many irrelevant to a query. Furthermore, keyword-based approaches let search-optimization techniques artificially make hackers' or other irrelevant pages rise to the top of search results.
Semantic search would solve many of these problems, said Kathleen Dahlgren, chief technology officer of Cognition Technologies, a vendor of semantic-based text-processing technology. Semantic-search tools use document tags and topic-based indexes of material to create a model that represents what various pieces of content mean. This lets a search engine more precisely respond to a query by disambiguating the multiple meanings of words in a document and determining how they relate to one another within a sentence. Semantic search could be the Semantic Web's killer app, said Peter Mika, a researcher and data architect at Yahoo! Research in Barcelona.
There are now several types of semantic search approaches. However, despite the technology's promise, it must clear numerous hurdles before it can be widely adopted.
Pushing Semantic Search
Interest in semantic search is growing because it promises to help make the finding of relevant information online via queries easier, quicker, and more effective.
Traditional keyword search — in which applications look for instances of query keywords in online documents — rose to prominence because it is efficient and good at simple searches.
In the early days of search, in the mid-1990s, Yahoo! rose to prominence with a directory in which human experts assigned various websites to topic-based categories. When there were far fewer websites than there are now, this approach was accurate and efficient. However, this approach didn't work well as the number of websites increased.
Google's approach, on the other hand, could work well with a far larger number of documents. But a simple Google search can still yield thousands or millions of results, including many irrelevant to the original query. This occurs because many words have multiple meanings. For example, a search for "tank" could return web pages on water containers or military vehicles.
More precise queries could help address this problem, but the lack of semantics can still make it difficult for even Google's search engine to return accurate responses with few irrelevant results, said Cognition's Dahlgren.
Under the Hood
Semantic search is most effective for complex queries, such as those involved in medical or scientific research, and legal discovery. The concepts behind semantic search were formulated years ago.
In the 1980s, Xerox began experimenting with various automated natural-language-processing (NLP) technologies that could parse sentences and create a semantic representation of them.
In 1998, Tim Berners-Lee laid out a plan to create a Semantic Web by adding information about the meaning of documents that could be stored with online content.
Semantic search operates on the meaning of words rather than just how they read as text. Basically, either a human or computer creates a semantic model of a document based on the Resource Description Framework and the Web Ontology Language (OWL), or on proprietary formats.
RDF is a World Wide Web Consortium (W3C) standard, providing an XML-based framework for metadata description and interchange. OWL, a markup language for publishing and sharing data online using ontologies, represents the meanings of terms and the relationships between them within sentences in a way that software can process.
Semantic tags create machine-readable code about Web page elements. For example, microformats — a Web-based approach to semantic markup — use HTML tags to label metadata about items on a Web page, such as whether a snippet of text refers to a displayed product's price, size, or color.
A document's semantic model is stored as a semantic index. This specially crafted index of all documents that a search engine has processed includes the context and meaning of words in the documents.
Interfaces. Search engines use an interface that lets users query the index to find either a relevant document or a relevant part of a document.
The simplest interfaces let users narrow their searches semantically by checking boxes that identify categories of concepts that must be included or excluded in the search. For example, a search of medical conditions for doctors would let them specify the symptoms a patient has and exclude those they don't.
Other engines let users enter search terms that the system parses via NLP techniques to determine their meaning. The system then uses this information to search an index generated by parsing a collection of documents to find those relevant to the query.
Indexing. Some tools — such as Yahoo!'s Search Monkey and the University of Maryland, Baltimore County's Swoogle — require users to manually enter tags to describe a document so that they can be indexed. Newer technologies use NLP technologies to parse documents and automatically convert text into a complex semantic network representing the relationship among concepts in the material.
According to Cognition's Dahlgren, semantic search engines generate indexes in four primary ways.
Manual tag-based systems such as Search Monkey and Swoogle generally use RDF- and OWL-generated information about documents and text within documents to create indexes. Document creators manually write tags to their web pages or to data within documents, which can be understood by RDF-compatible semantic-search engines such as Search Monkey.
Statistical systems such as Autonomy use Bayesian or Latent Semantic Indexing to guess at the meanings of words in a query or document. These techniques analyze documents and statistically identify relationships between words and sets of words in documents, thereby improving semantic accuracy and indexing.
Ontology-based systems like those from Cataphora, Hakia, and Stratify organize language into an ontology. They use the ontology to automatically classify text in documents, e-mails, instant messages, and other sources into semantic categories, used to generate the index and help with subsequent searches.
Linguistically based systems made by companies such as Cognition Technologies and Expert System use linguistic rules and mathematic associations of words in a document to automatically parse the meaning of text in a document, Web page, or other source into an ontology or semantic network.
Despite its promise, semantic search faces numerous obstacles, such as the increased cost of the software and the increased processing and storage overhead.
Creating detailed semantic indexes entails up to 100 times the computational overhead of building traditional search indexes and storing them uses up 10 times as much hard drive space, said Yaniv Golan, chief technology officer of semantic-search provider Yedda. This computational overhead causes slower performance.
Adding semantic tags to documents makes more work for website authors. Organizations have been trying without much success to get authors to tag documents for decades, Dahlgren said. Wider adoption probably won't happen until there are systems that tag documents on the fly, she added.
Semantic search must cope with the changing semantic landscape, marked by new words, the changing meanings of words, and changing associations among words. Semantic search systems will have to be adaptable enough to understand these changes quickly, said Golan.
Other challenges include the lack of perceived need for the technology by some potential users, user unfamiliarity with the approach, concern about it being new and untested, and lack of a proven business model.
Semantic-technology use is growing quickly, according to Lehigh University associate professor Jeff Heflin. About 4 billion pieces of data have been tagged via RDL and OWL, he added. This enables users to conduct semantic search on more documents.
In his research, Heflin, who directs Lehigh's Semantic Web and Agent Technologies Lab, is studying ways to make ontologies interoperate so that machine-based agents could generate answers to queries by synthesizing information from multiple sources. Ontologies frequently don't interoperate because authors use different categories and systems for describing their elements.
Now, said Yahoo!'s Mika, "Fully integrated semantic search engines such as Sindice and SWSE (Semantic Web search engine) are implementing the entire process from crawling to indexing, ranking, and visualization." Developers are experimenting with interfaces that allow a deeper understanding of content by creating maps, timelines, charts, and tables, he added.
The W3C has proposed RDFa (RDF in attributes), which would add semantics to the Web via extensions to XHTML (extensible HTML) that embed rich metadata about words, their meanings, and their context within documents.
Google has implemented a rich snippets feature for its search engine that uses RDFa tags created by website authors to better index documents. The engine will use RDFa to extract the meaning of words in a document to provide more detail in its query responses.
Companies such as Autonomy, Cognition Technologies, and Expert System are selling semantic tools to help companies and large organizations better index their data, which could encourage wider semantic-search use.
Organizations are using semantic search to find information to help with activities such as knowledge management, intelligence about competitors, scientific and medical R&D, and self-service customer support, said Expert System CEO Brook Aker.
Semantic search has a long way to go to reach its potential, Heflin said. For example, said Mika, significant adoption of Semantic Web standards began only during the last two years.
Semantic search's adoption might be somewhat limited if the technology depends on manually, rather than automatically, generated tags, according to Aker. User expectations and the technology itself might be limited until sophisticated language-processing techniques become more widely used, he added.
One of the most significant questions for the field is how the technology will compete with or complement search engines from the major providers such as Google. Dahlgren said the major players have built an infrastructure that would be difficult to change and thus have a vested interest in maintaining traditional search approaches.
The semantic-search technology that automatically parses text will be popular for complex, specialized corporate, enterprise, or scientific uses but would entail too much computational and storage overhead for general Web searches. Approaches that involve manual tagging create less overhead and thus will work better with Web searches.
In the long run, Heflin predicted, semantic search will never replace traditional search because there will always be content that is difficult to represent semantically. Thus, he said, semantic search will become a complementary technology to traditional search approaches.
Nonetheless, Mika said, "I'm hopeful that we have managed to kick-start a positive cycle."
George Lawton is a freelance technology writer based in Monte Rio, California. Contact him at firstname.lastname@example.org.