2012 IEEE 12th International Conference on Data Mining Workshops (2012)
Brussels, Belgium Belgium
Dec. 10, 2012 to Dec. 10, 2012
The problem of scalable knowledge extraction from the Web has attracted much attention in the past decade. However, it is under explored how to extract the structured knowledge from semi-structured Websites in a fully automatic and scalable way. In this work, we define the table-formatted structured data with clear schema as Knowledge Tables and propose a scalable learning system, which is named as Kable to extract knowledge from semi-structured Websites automatically in a never ending and scalable way. Kable consists of two major components, which are auto wrapper induction and schema matching respectively. In contrast to the state of the art auto wrappers for semi-structured Web sites, our adopted approach can run around 1'000 times faster, which makes the Web scale knowledge extraction possible. On the other hand, we propose a novel schema matching solution which can work effectively on the auto-extracted structured data. With 3 months' continuous run using ten Web servers, we successfully extracted 427,105,009 knowledge facts. The manual labeling over sampled knowledge extracted show the up to 87% precision for supporting various Web applications.
Data mining, Knowledge engineering, Motion pictures, Clustering algorithms, Knowledge based systems, Algorithm design and analysis, Manganese, schema matching, information extraction system, knowledge table
Y. Gu, L. Ji, Z. Jiang and J. He, "Endless and Scalable Knowledge Table Extraction from Semi-structured Websites," 2012 IEEE 12th International Conference on Data Mining Workshops(ICDMW), Brussels, Belgium Belgium, 2012, pp. 835-842.