loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
21st International Conference on Data Engineering (ICDE'05)
Bootstrapping Semantic Annotation for Content-Rich HTML Documents
Tokyo, Japan
April 05-April 08
ISBN: 0-7695-2285-8
Saikat Mukherjee, State University of New York at Stony Brook
I. V. Ramakrishnan, State University of New York at Stony Brook
Amarjeet Singh, State University of New York at Stony Brook
Enormous amount of semantic data is still being encoded in HTML documents. Identifying and annotating the semantic concepts implicit in such documents makes them directly amenable for Semantic Web processing. In this paper we describe a highly automated technique for annotating HTML documents, especially template-based content-rich documents, containing many different semantic concepts per document. Starting with a (small) seed of hand-labeled instances of semantic concepts in a set of HTML documents we bootstrap an annotation process that automatically identifies unlabeled concept instances present in other documents. The bootstrapping technique exploits the observation that semantically related items in content-rich documents exhibit consistency in presentation style and spatial locality to learn a statistical model for accurately identifying different semantic concepts in HTML documents drawn from a variety ofWeb sources. We also present experimental results on the effectiveness of the technique.
Citation:
Saikat Mukherjee, I. V. Ramakrishnan, Amarjeet Singh, "Bootstrapping Semantic Annotation for Content-Rich HTML Documents," icde, pp.583-593, 21st International Conference on Data Engineering (ICDE'05), 2005
Usage of this product signifies your acceptance of the Terms of Use.