This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
2011 International Conference on Document Analysis and Recognition
Data Extraction from Web Tables: The Devil is in the Details
Beijing, China
September 18-September 21
ISBN: 978-0-7695-4520-2
We present a method based on header paths for efficient and complete extraction of labeled data from tables meant for humans. Although many table configurations yield to the proposed syntactic analysis, some require access to semantic knowledge. Clicking on one or two critical cells per table, through a simple interface, is sufficient to resolve most of these problem tables. Header paths, a purely syntactic representation of visual tables, can be transformed ("factored") into existing representations of structured data such as category trees, relational tables, and RDF triples. From a random sample of 200 web tables from ten large statistical web sites, we generated 376 relational tables and 34,110 subject-predicate-object RDF triples.
Index Terms:
visual table, relational table, RDF, header-paths
Citation:
George Nagy, Sharad Seth, Dongpu Jin, David W. Embley, Spencer Machado, Mukkai Krishnamoorthy, "Data Extraction from Web Tables: The Devil is in the Details," icdar, pp.242-246, 2011 International Conference on Document Analysis and Recognition, 2011
Usage of this product signifies your acceptance of the Terms of Use.