New Project Could Promote Semantic Web
by George Lawton
The Semantic Web, long touted as the Web's next generation, has never really taken off.
According to Yahoo researcher and data architect Peter Mika, an important reason that website developers haven't flocked to the approach is that they must contend with multiple incompatible specifications to do so.
Now, though, Schema.org — a new project organized by major search-engine providers Google, Microsoft, and Yahoo — promises to improve Semantic Web adoption by building support for a single set of structured data specifications.
These specifications would let website publishers better use markup tags to more accurately describe the words, numbers, and other material on their pages, enabling search engines and other applications to more clearly determine their meaning in context.
The specification promises to enable applications to work with more Web-based data, noted Manu Sporny, chair of the World Wide Web Consortium's (W3C's) RDF Web Applications Working Group.
"[This would] make the Web more transparent, efficient, and open," said Sporny, who is also president of Digital Bazaar, which develops technology for buying and selling digital content online.
In the short run, agreement among the big search-engine providers could raise the visibility of semantic technologies and encourage website operators to incorporate semantic data.
However, Schema.org's exclusive use of microdata—just one of several structured-data approaches—also raises some concerns that the specifications' limitations could create problems such as scalability and data management.
The Semantic Web's Promise
The term Semantic Web was coined in 2001 by W3C director Tim Berners-Lee — the approach's most vocal advocate and the World Wide Web's inventor.
Since then, though, adoption of Semantic Web principles has been somewhat limited.
Nonetheless, said Mika, the approach has been successful so far in two major ways.
Linked data
Sophisticated Web users have built a network of linked data, a system of linked datasets describing the same people, places, and things.
This makes searching for various content sources that address the same concepts much easier, said Mika.
"These are powerful technologies that let sophisticated users find, share, and process data, as well as find related datasets," Mika explained.
Embedding data in webpages
Structured data specifications have emerged for embedding data directly in HTML pages.
The most common specifications are microdata (www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html#microdata), microformats (http://microformats.org), and RDFa. (Resource Description Framework in attributes, www.w3.org/TR/xhtml-rdfa-primer).
Using these specifications, embedding data in HTML can be as simple as adding a few new attribute values to the HTML tags.
However, they take different approaches to using tags to embed information about words, numbers, and other elements in webpages. They also represent different trade-offs in terms of expressiveness and markup complexity.
Trouble in paradise
The use of structured-data technologies, which would help advance Semantic Web implementation, has not been widespread. In fact, Mika said, only about 5 percent of all websites incorporate structured data, due in part to the existence of multiple, incompatible specifications.
"Unfortunately, [in the past], the [major] search engines settled on different syntaxes and different sets of vocabularies, often for the same type of information," he explained.
Thus, in the worst case, website developers who want their content to work with each of the engines have to mark up their pages in multiple ways.
This has made the process complex and error-prone, resulting in low adoption, Mika said.
Schema.org
As part of Schema.org, Google, Microsoft, and Yahoo have agreed to support a single set of structured data specifications and vocabularies based on the microdata approach.
This removes the largest obstacle to Semantic Web adoption, which is markup fragmentation, said Mika.
Website owners and search-engine-optimization specialists who use Schema.org markup will know that the three major search engines will understand their pages. And as Web-development tools increase support for microdata, the Schema.org specifications should become easier to implement.
However, some researchers are concerned about Schema.org's focus on microdata.
The Web is so diverse, creating a single vocabulary that addresses all markup needs has not proven to be practical, Sporny explained.
For example, he said, RDFa is better than microdata at data typing and managing large datasets.
In the long run, Sprony said, Schema.org's decision to focus exclusively on microdata might create scaling and management problems because microdata's computational and storage requirements grow faster with increased data volumes than the other structured-data specifications.
He noted that Google, Microsoft, and Yahoo agreed on their Schema.org approach without consulting the W3C.
The Schema.org website states, "We will … be monitoring the web for RDFa and microformats adoption, and if they pick up, we will look into supporting these syntaxes." But there are no plans to do so yet.
In the short run, the search engines will drive such moves, said Mika.
He added, "We expect that Schema.org will have an impact in the next few months. Currently, only 5 percent of webpages employ some form of markup, so there is room to grow."
George Lawton is a freelance journalist based in Guerneville, CA. You can contact him at glawton@glawton.com.