Issue No.06 - Nov.-Dec. (2011 vol.15)
Published by the IEEE Computer Society
Amit Sheth , Kno.e.sis, Wright State University
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/MIC.2011.157
<p>Concern for scalability — both in computational terms and in terms of human effort needed to develop semantic models and background knowledge — has hampered the adoption of semantic techniques and the Semantic Web. This concern is misplaced given the extensive progress the past decade has seen on standards, methods, and technologies for developing semantic models or ontologies, semantic annotations, and techniques for semantic integration, analysis, and reasoning. Such progress is complemented by myriad recent success stories that use semantics in broad-based applications such as Web search, as well as in a growing number of vertical domains. As the future of computing expands beyond cyberspace to cyber-physical-social computing, with extensive growth in social and sensor data, semantics will play an even larger and more pervasive role in exploiting larger amounts of increasingly heterogeneous and multimodal data.</p>
Semantics can enhance a broad variety of information processing — search, integration, analysis, pattern extraction and mining, discovery, situational awareness, and question-answering. Consider search: a search system that could distinguish between "Merry Christmas" as a greeting and one of the 60 or so songs named "Merry Christmas" as cataloged in MusicBrainz (a community-created music encyclopedia; http://musicbrainz.org) would have a powerful semantic search capability. Practical solutions utilizing semantics involve using a conceptual or domain model to organize information (MusicBrainz in our example), creating metadata (or annotations) with respect to the model (indicating as an attribute/label/facet of "Merry Christmas" whether it's a greeting or a song, and which one of the 60-plus songs), and then utilizing the metadata and model for enhanced computation.
Even in the Web's early days, simple schemas and metadata were used for faceted or attribute-based search of Web-based documents and data. An example is the InfoHarness system at Bellcore, commercialized in 1995 ( http://bit.ly/InfoHarness). It supported extracting metadata from heterogeneous data and provided Mozilla-browser-based faceted search. Tom Gruber introduced the concept of ontologies in an information systems context in the early 1990s ( http://bit.ly/gruber-ont). This term has since become increasingly used for conceptual or domain models that also capture shared vocabulary and agreement, often with relevant factual knowledge. Several efforts in the mid-to-late 1990s, such as SIMS, Observer, InfoSleuth, and InfoQuilt, demonstrated ontology-based Web data integration and querying.
In the late 1990s, I realized that it was possible to design conceptual models or ontologies — not too different from schema.org descriptions today — for many domains of practical interest for Web search (politics, business, finance, sports, entertainment, and so on). At Taalee, which I founded in 1999, and its follow-on mergers, we were able to develop products and services that extract, integrate, and repurpose high-quality datasets to populate these ontologies with factual information and background knowledge (such as Voquette's SCORE and Semagix's Freedom). For example, these systems extracted MLB.com (the official site for Major League Baseball) to populate part of a baseball ontology, and used sources similar to MusicBrainz to populate the music component of an entertainment ontology.
These ontologies, represented in a Resource Description Framework (RDF)-like language, supported a semantic and faceted Web search engine, MediaAnywhere (see http://slidesha.re/sw-ib, http://bit.ly/sw-p, and http://bit.ly/sw-ic). Although the system scaled to a few hundred websites and enterprise semantic applications, it was ahead of its time; the concept had to wait for technology and market acceptability to catch up. Would this approach have scaled to the Web? I believe yes, but I couldn't have convincingly argued or demonstrated this — until now, when Yahoo, Bing, and Google are all exploiting minimal ontologies, metadata provided by content developers, and large subject (domain)-specific object and knowledge bases. Before discussing this, however, let's first review why some subscribed to the perception during the past 10 years that semantic solutions can't scale.
Ontologies and Web Search
The Web has continued to see explosive growth in the number of documents and amount of data accessible through it. We can view earlier directory-based approaches by Yahoo and DMOZ as one type of semantic approach, given that they used human-created taxonomies to manually catalog information; these approaches soon failed to scale. Search became the primary way for people to find the information they needed. Its success led many (beginning with Googlers like Larry Page and Peter Norvig) to believe that all you need is enough data and you can adequately, or even exclusively, extract semantics in a bottom up manner — or that, given the Web's broad coverage, ontologies, models, and background knowledge simply aren't relevant nor would they scale.
Some in the semantics camp, including myself, felt that only a limited form of implicit semantics is embedded in most data on the Web, and that semantics can and will scale. I've argued that we can readily apply background knowledge in developing a semantic solution (as we did with MediaAnywhere). You can find links to Norvig's post and my views on this matter at http://bit.ly/s-search.
Early, albeit unnecessary, emphasis on an AI focus to defining a Semantic Web approach also hurt, given the lack of success in scaling AI solutions during the 1980s and '90s. If you were to interpret ontologies narrowly as models crafted in formal logic languages, or looked only at those ontologies exquisitely crafted with care by experts, the idea that they don't scale might be valid. Fortunately, we can develop ontologies and associate background knowledge in a variety of ways, especially when we're dealing with a Web search and browsing application or an information processing task not requiring complete consistency in the knowledge base. Examples include using ad hoc specifications such as schema.org and grounding concepts in the linked open data cloud; domain-specific community-maintained resources (such as MusicBrainz for Music); and dynamically generated domain models by tools such as Doozer++ ( http://bit.ly/k-doozer).
Semantics at the Web scale is gaining acceptance, and our ability to support it is increasing. In the case of Web search, several significant factors are helping semantics to improve it. First, each of the major Web search engines is creating domain-specific structured knowledge. Search engines can and are exploiting semantics via at least three methods.
The first is creating a concept base or object base of facts (entities and relationships) one domain at a time. For example, you can use MusicBrainz, which captures comprehensive knowledge about more than 550,000 artists and their creative works, as background knowledge for the music domain. Bing's support for specific domains such as entertainment, sports, or travel is in part powered by domain-specific models and relevant background knowledge in a way reminiscent of MediaAnywhere. This use of domain-specific knowledge will continue to expand now that Google has acquired Freebase ( www.freebase.com), and the adoption of linked open data (a method of publishing structured data so that it can be interlinked) increases from its current tens of billions of triples (facts expressed as subject-predicate-object in RDF) and tens of domains by an additional one or two orders.
Second is the recent collaboration between the three major search engine players in defining schema.org, which provides schemas or conceptual models for several common domains. The third method is content developers' increasing use and support of microdata (a simple way to embed semantic markup into HTML documents) and RDFa (for embedding rich metadata within Web documents using RDF) to improve search results. This in turn entices content developers to provide more metadata or annotations and use relevant models to add semantics. Quick on search engines' heels, social media (such as Facebook's Open Graph protocol), e-commerce (BestBuy's use of the GoodRelations ontology), and a wide variety of Web businesses and services are building a growing and synergistic Semantic Web ecosystem.
Search No Longer the King
Prior to the era when semantics didn't scale, search was the king of all Web applications. But the importance of search is highly overrated. Its best days were in the past. We're in an era with significant growth in heterogeneity (social data, mobile-device-generated data, data from sensors inside, on, and around humans, and so on) and quantity (the rate of data creation has already surpassed our ability to store it). Simply needing access to data (which a search engine can index and return as a document) no longer serves our needs. We need knowledge and insights for decision-making as well as answers to our questions. Semantics plays a pivotal role in helping us build solutions to meet these requirements. Relationships are at the heart of semantics and the Semantic Web, and, consequently, we can transition from focusing on keywords and objects, as we did with search, to focusing on relationships and richer abstractions, including events and experiences.
The Role of Semantics in Computing's Future
Semantics is being adopted on a wide scale in various scientific and some business domains that use W3C-defined Semantic Web languages and standards. For example, in the life sciences domain, we can find nearly 300 ontologies at the BioPortal ( http://bioportal.bioontology.org). Even more impressive is the growth of structured data on the Web, called the Web of data and best showcased by the linked open data initiative ( http://linkeddata.org), which surpassed 25 billion triples last year, and is tripling year over year. BioRDF, a collection of facts and knowledge in the life sciences domain from multiple sources, exceeded 5 billion triples last year. Note that the Web of data isn't simply data, but is structured and reusable information that we can utilize to consistently annotate or tag data, enabling better data analysis, which would be difficult to achieve via bottom-up processing of unstructured data on the Web.
Semantics plays a central role in Web 3.0 and beyond, and is becoming the driving force behind the future of computing for several reasons.
Semantics for Integration
Semantics, in the sense of archiving shared understanding and meaning, comes from agreement. Consequently, it has long since had a role in integrating data in heterogeneous syntax and structure. Increasingly, it also plays a role in integrating information about the same concept or object in different modalities and media — for example, to relate a person's images with his or her descriptive information, or to correlate information about an event on social media with corresponding sensor observations. In coming years, semantics will be crucial to integrating objects that straddle the cyber–physical–social or physical–virtual divide.
Semantics for Intelligent Processing and Reasoning
Much attention in the past has focused on data and information search and browsing, in which processing complexity is reduced because of significant human involvement in interpreting the results. As we move up the information processing value chain from search and browsing to integration, analysis, situational awareness, and question-answering, information processing's complexity increases significantly. Looked at from another perspective, information processing is moving from keyword-based to object-based processing and on to relationship and event-centric processing. As mentioned, relationships are at the heart of semantics, and, fundamentally, computations will need to focus on modeling, processing, and exploiting them ( http://bit.ly/rel-at-heart). In the case of formal languages, this will involve richer forms of integrated reasoning, incorporating inductive, deductive, abductive, and fuzzy reasoning. In addition, future advanced information processing won't be limited to silicon-based processing; rather, it will increasingly involve collaboration between humans and machines, with semantics-aware sensors as intermediates.
Semantics for Knowledge-Enabled Computing
The power of human reasoning comes not only from the sophisticated computing abilities our brains support but also from background knowledge and past experiences. Similarly, the application of background knowledge to improve information processing is rapidly growing — from the improvement of information extraction, natural language processing, and machine learning to better understanding and processing of social and sensor data. We can now apply domain-independent (related to time, space, and geographic concepts, for example) as well as domain-specific models of various complexities and comprehensiveness, such as nomenclatures, taxonomies, and ontologies, to improve information processing. The ability to utilize user- and community-created dictionaries (such as urbandictionary.com) and knowledge repositories (Musicbrainz, for example) to exploit structured information from unstructured data (DBpedia from Wikipedia) — and reuse such knowledge in improving computation — has added significant strength to semantic processing.
Semantics for Abstractions and Human Experience
The increasing amount of data generated by 5 billion mobile phone users (arguably the most important tool in human history, and many now with data connections), millions of social media users, and more than 40 billion mobile sensors, is finding its way to the Web. A single four-hour flight might generate 240 terabytes of data. How much of it is useful for a given human need? The ability to search this much data, however good, simply isn't scalable in terms of the search results humans can review and absorb. What we want is a few nuggets of information or insights that we can act on. We care about broader and aggregate understanding of events, improved decision making, and getting answers to our questions. And we care about enhancing our human experience.
Semantics is a core component of developing abstraction mechanisms so that we can use computing to support perception and cognition. Semantic approaches support abstractions that convert low-level data and observations into the high-level symbolic representations that constitute our human perception and cognition. Semantics-empowered solutions can now analyze constantly streaming sensor or social data to tell us abstractions and events of human interest (such as icy roads, blizzard conditions, the need for intervention to save crops, chances that a movie will succeed, or the progress of a mass protest). My earlier article, "Computing for Human Experience ( IEEE Internet Computing, January/February 2010), explores this topic a bit further. You can find examples of such approaches at http://knoesis.org/showcase.
We've entered an exciting time for semantic computing. Semantics is changing contemporary Web applications, such as search, and will play a pivotal role in future computing that will span cyber–physical–social systems.
Amit Sheth is the director of the Ohio Center of Excellence in Knowledge-enabled Computing (Kno.e.sis) at Wright State University, a fellow of IEEE, and a LexisNexis Ohio Eminent Scholar. Contact him at firstname.lastname@example.org; http://knoesis.org/amit.