loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Fourth International Conference on Computer and Information Technology (CIT'04)
Extraction and Integration Information in HTML Tables
Wuhan, China
September 14-September 16
ISBN: 0-7695-2216-5
Shijun Li, Wuhan University
Zhiyong Peng, Wuhan University
Mengchi Liu, Carleton University
A large amount of information available on the Web is formatted in HTML tables, which are mainly presentation-oriented and are not suited for database applications. As a result, how to capture information in HTML tables semantically and integrate relevant information is a challenge. In this paper, we present a new approach that automatically captures the semantic hierarchies of HTML tables, and semi-automatically integrates HTML tables. It first automatically captures the attribute-value pairs in HTML tables by normalization, and introduces the notion of eigen-value in formatting information to recognize the headings of HTML tables. After generating the global concepts and global schema manually by defining what data to be integrated, it then learns the lexical semantic set for each global concept, the contexts via labelling the attributes of example HTML tables to their corresponding global concept. Finally, it integrates the data of each source HTML table using the lexical semantic sets and the contexts to eliminate the conflicts and solve the nondeterministic problems in mapping each source schema to the global schema.
Citation:
Shijun Li, Zhiyong Peng, Mengchi Liu, "Extraction and Integration Information in HTML Tables," cit, pp.315-320, Fourth International Conference on Computer and Information Technology (CIT'04), 2004
Usage of this product signifies your acceptance of the Terms of Use.