The Community for Technology Leaders
RSS Icon
Issue No.04 - April (2010 vol.22)
pp: 523-536
Tak-Lam Wong , The Chinese University of Hong Kong, Hong Kong
Wai Lam , The Chinese University of Hong Kong, Hong Kong
This paper presents a Bayesian learning framework for adapting information extraction wrappers with new attribute discovery, reducing human effort in extracting precise information from unseen Web sites. Our approach aims at automatically adapting the information extraction knowledge previously learned from a source Web site to a new unseen site, at the same time, discovering previously unseen attributes. Two kinds of text-related clues from the source Web site are considered. The first kind of clue is obtained from the extraction pattern contained in the previously learned wrapper. The second kind of clue is derived from the previously extracted or collected items. A generative model for the generation of the site-independent content information and the site-dependent layout format of the text fragments related to attribute values contained in a Web page is designed to harness the uncertainty involved. Bayesian learning and expectation-maximization (EM) techniques are developed under the proposed generative model for identifying new training data for learning the new wrapper for new unseen sites. Previously unseen attributes together with their semantic labels can also be discovered via another EM-based Bayesian learning based on the generative model. We have conducted extensive experiments from more than 30 real-world Web sites in three different domains to demonstrate the effectiveness of our framework.
Wrapper adaptation, Web mining, text mining, machine learning.
Tak-Lam Wong, Wai Lam, "Learning to Adapt Web Information Extraction Knowledge and Discovering New Attributes via a Bayesian Approach", IEEE Transactions on Knowledge & Data Engineering, vol.22, no. 4, pp. 523-536, April 2010, doi:10.1109/TKDE.2009.111
[1] A. Arnold, R. Nallapati, and W. Cohen, "Exploiting Feature Hierarchy for Transfer Learning in Named Entity Recognition," Proc. 46th Ann. Meeting of the Assoc. for Computational Linguistics: Human Language Technologies (ACL-HLT), pp. 245-253, 2008.
[2] M. Banko, M. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni, "Open Information Extraction from the Web," Proc. 20th Int'l Joint Conf. Artificial Intelligence (IJCAI), pp. 2670-2676, 2007.
[3] D. Blei, J. Bagnell, and A. McCallum, "Learning with Scope, with Application to Information Extraction and Classification," Proc. 18th Conf. Uncertainty in Artificial Intelligence (UAI), pp. 53-60, 2002.
[4] D. Blei, A. Ng, and M. Jordan, "Latent Dirichlet Allocation," J. Machine Learning Research, vol. 2, pp. 993-1022, 2003.
[5] W. Cohen and W. Fan, "Learning Page-Independent Heuristics for Extracting Data from Web Pages," Computer Networks, vol. 31, nos. 11-16, pp. 1641-1652, 1999.
[6] V. Crescenzi and G. Mecca, "Automatic Information Extraction from Large Websites," J. ACM, vol. 51, no. 5, pp. 731-779, 2004.
[7] V. Crescenzi, G. Mecca, and P. Merialdo, "ROADRUNNER: Towards Automatic Data Extraction from Large Web Sites," Proc. 27th Very Large Databases Conf. (VLDB), pp. 109-118, 2001.
[8] W. Dai, Q. Yang, G.-R. Xue, and Y. Yu, "Boosting for Transfer Learning," Proc. 24th Int'l Conf. Machine Learning (ICML), pp. 193-200, 2007.
[9] H. Daumé,III, and D. Marcu, "Domain Adaptation for Statistical Classifiers," J. Artificial Intelligence Research, vol. 26, pp. 101-126, 2006.
[10] O. Etzioni, M. Cafarella, D. Downey, A.-M. Popescu, T. Shaked, S. Soderland, D. Weld, and A. Yates, "Unsupervised Named-Entity Extraction from the Web: An Experimental Study," Artificial Intelligence, vol. 165, pp. 91-134, 2005.
[11] O. Etzioni, C. Knoblock, R. Tuchinda, and A. Yates, "To Buy or Not to Buy: Mining Airfare Data to Minimize Ticket Purchase Price," Proc. 11th ACM SIGKDD, pp. 119-128, 2000.
[12] D. Freitag and A. McCallum, "Information Extraction with HMMs and Shrinkage," Proc. AAAI-99 Workshop Machine Learning for Information Extraction, pp. 31-36, 1999.
[13] P. Golgher and A. da Silva, "Bootstrapping for Example-Based Data Extraction," Proc. 10th ACM Int'l Conf. Information and Knowledge Management (CIKM), pp. 371-378, 2001.
[14] U. Irmak and T. Suel, "Interactive Wrapper Generation with Minimal User Effort," Proc. 15th Int'l World Wide Web Conf. (WWW), pp. 553-563, 2006.
[15] T. Kristjansson, A. Culotta, P. Viola, and A. McCallum, "Interactive Information Extraction with Constrained Conditional Random Fields," Proc. 19th Nat'l Conf. Artificial Intelligence (AAAI), pp. 412-418, 2004.
[16] N. Kushmerick, "Wrapper Induction: Efficiency and Expressiveness," Artificial Intelligence, vol. 118, nos. 1/2, pp. 15-68, 2000.
[17] J. Lafferty, A. McCallum, and F. Pereira, "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data," Proc. 18th Int'l Conf. Machine Learning (ICML), pp. 282-289, 2001.
[18] K. Lerman, C. Gazen, S. Minton, and C. Knoblock, "Populating the Semantic Web," Proc. AAAI Workshop Advances in Text Extraction and Mining, 2004.
[19] W.Y. Lin and W. Lam, "Learning to Extract Hierarchical Information from Semi-Structured Documents," Proc. Ninth Int'l Conf. Information and Knowledge Management (CIKM), pp. 250-257, 2000.
[20] B. Liu, R. Grossman, and Y. Zhai, "Mining Data Records in Web Pages," Proc. Ninth ACM SIGKDD, pp. 601-606, 2003.
[21] G.J. McLachlan and T. Krishnan, The EM Algorithm and Extensions. John Wiley & Sons, Inc., 1997.
[22] M. Michelson and C. Knoblock, "Semantic Annotation of Unstructured and Ungrammatical Text," Proc. 19th Int'l Joint Conf. Artificial Intelligence (IJCAI), pp. 1092-1098, 2005.
[23] K. Probst, R. Ghani, M. Krema, and A. Fano, "Semi-Supervised Learning of Attribute-Value Pairs from Product Descriptions," Proc. 20th Int'l Joint Conf. Artificial Intelligence (IJCAI), pp. 2838-2843, 2007.
[24] S. Ray and M. Craven, "Representing Sentence Structure in Hidden Markov Models for Information Extraction," Proc. 17th Int'l Joint Conf. Artificial Intelligence (IJCAI), pp. 1273-1279, 2001.
[25] E. Riloff and R. Jones, "Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping," Proc. 16th Nat'l Conf. Artificial Intelligence (AAAI), pp. 1044-1049, 1999.
[26] G. Sigletos, G. Paliouras, C. Sypropoulos, and M. Hatzopoulos, "Combining Information Extraction Systems Using Voting and Stacked Generalization," J. Machine Learning Research, vol. 6, pp. 1751-1782, 2005.
[27] J. Turmo, A. Ageno, and N. Catala, "Adaptive Information Extraction," ACM Computing Surveys, vol. 38, no. 2,article no. 4, 2006.
[28] P. Viola and M. Narasimhan, "Learning to Extract Information from Semi-Structured Text Using a Discriminative Context Free Grammar," Proc. 43rd Ann. Meeting of the Assoc. for Computational Linguistics, pp. 371-378, 2005.
[29] T.L. Wong and W. Lam, "Adapting Information Extraction Knowledge for Unseen Web Sites," Proc. IEEE Int'l Conf. Data Mining (ICDM), pp. 506-513, 2002.
[30] T.L. Wong and W. Lam, "A Probabilistic Approach for Adapting Information Extraction Wrappers and Discovering New Attributes," Proc. IEEE Int'l Conf. Data Mining (ICDM), pp. 257-264, 2004.
[31] T.L. Wong and W. Lam, "Text Mining from Site Invariant and Dependent Features for Information Extraction Knowledge Adaptation," Proc. SIAM Int'l Conf. Data Mining (SDM), pp. 45-56, 2004.
[32] T.L. Wong and W. Lam, "Adapting Web Information Extraction Knowledge via Mining Site Invariant and Site Dependent Features," ACM Trans. Internet Technology, vol. 7, no. 1,article no. 6, 2007.
[33] J. Yang, H. Seo, and J. Choi, "MORPHEUS: A More Scalable Comparison-Shopping Agent," Proc. Fifth Int'l Conf. Autonomous Agents (AGENTS), pp. 63-64, 2001.
[34] J. Zhu, Z. Nie, J.-R. Wen, B. Zhang, and H.-W. Hon, "Webpage Understanding: An Integrated Approach," Proc. 13th ACM SIGKDD, pp. 903-912, 2007.
25 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool