This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
2012 IEEE 12th International Conference on Data Mining Workshops
Endless and Scalable Knowledge Table Extraction from Semi-structured Websites
Brussels, Belgium Belgium
December 10-December 10
ISBN: 978-1-4673-5164-5
The problem of scalable knowledge extraction from the Web has attracted much attention in the past decade. However, it is under explored how to extract the structured knowledge from semi-structured Websites in a fully automatic and scalable way. In this work, we define the table-formatted structured data with clear schema as Knowledge Tables and propose a scalable learning system, which is named as Kable to extract knowledge from semi-structured Websites automatically in a never ending and scalable way. Kable consists of two major components, which are auto wrapper induction and schema matching respectively. In contrast to the state of the art auto wrappers for semi-structured Web sites, our adopted approach can run around 1'000 times faster, which makes the Web scale knowledge extraction possible. On the other hand, we propose a novel schema matching solution which can work effectively on the auto-extracted structured data. With 3 months' continuous run using ten Web servers, we successfully extracted 427,105,009 knowledge facts. The manual labeling over sampled knowledge extracted show the up to 87% precision for supporting various Web applications.
Index Terms:
Data mining,Knowledge engineering,Motion pictures,Clustering algorithms,Knowledge based systems,Algorithm design and analysis,Manganese,schema matching,information extraction system,knowledge table
Citation:
Yingqin Gu, Lei Ji, Ziheng Jiang, Jun He, "Endless and Scalable Knowledge Table Extraction from Semi-structured Websites," icdmw, pp.835-842, 2012 IEEE 12th International Conference on Data Mining Workshops, 2012
Usage of this product signifies your acceptance of the Terms of Use.