Learning to Adapt Web Information Extraction Knowledge and Discovering New Attributes via a Bayesian Approach
Issue No. 04 - April (2010 vol. 22)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TKDE.2009.111
Tak-Lam Wong , The Chinese University of Hong Kong, Hong Kong
Wai Lam , The Chinese University of Hong Kong, Hong Kong
This paper presents a Bayesian learning framework for adapting information extraction wrappers with new attribute discovery, reducing human effort in extracting precise information from unseen Web sites. Our approach aims at automatically adapting the information extraction knowledge previously learned from a source Web site to a new unseen site, at the same time, discovering previously unseen attributes. Two kinds of text-related clues from the source Web site are considered. The first kind of clue is obtained from the extraction pattern contained in the previously learned wrapper. The second kind of clue is derived from the previously extracted or collected items. A generative model for the generation of the site-independent content information and the site-dependent layout format of the text fragments related to attribute values contained in a Web page is designed to harness the uncertainty involved. Bayesian learning and expectation-maximization (EM) techniques are developed under the proposed generative model for identifying new training data for learning the new wrapper for new unseen sites. Previously unseen attributes together with their semantic labels can also be discovered via another EM-based Bayesian learning based on the generative model. We have conducted extensive experiments from more than 30 real-world Web sites in three different domains to demonstrate the effectiveness of our framework.
Wrapper adaptation, Web mining, text mining, machine learning.
Tak-Lam Wong, Wai Lam, "Learning to Adapt Web Information Extraction Knowledge and Discovering New Attributes via a Bayesian Approach", IEEE Transactions on Knowledge & Data Engineering, vol. 22, no. , pp. 523-536, April 2010, doi:10.1109/TKDE.2009.111