The Community for Technology Leaders
RSS Icon
Issue No.05 - May (2010 vol.22)
pp: 639-650
Yong Cao , Microsoft Research Asia, Beijing
Zaiqing Nie , Microsoft Research Asia, Beijing
Chunyu Yang , Tsinghua University, Beijing
Ji-Rong Wen , Microsoft Research Asia, Beijing
The two most important tasks in information extraction from the Web are webpage structure understanding and natural language sentences processing. However, little work has been done toward an integrated statistical model for understanding webpage structures and processing natural language sentences within the HTML elements. Our recent work on webpage understanding introduces a joint model of Hierarchical Conditional Random Fields (HCRFs) and extended Semi-Markov Conditional Random Fields (Semi-CRFs) to leverage the page structure understanding results in free text segmentation and labeling. In this top-down integration model, the decision of the HCRF model could guide the decision making of the Semi-CRF model. However, the drawback of the top-down integration strategy is also apparent, i.e., the decision of the Semi-CRF model could not be used by the HCRF model to guide its decision making. This paper proposed a novel framework called WebNLP, which enables bidirectional integration of page structure understanding and text understanding in an iterative manner. We have applied the proposed framework to local business entity extraction and Chinese person and organization name extraction. Experiments show that the WebNLP framework achieved significantly better performance than existing methods.
Natural language processing, webpage understanding, conditional random fields.
Yong Cao, Zaiqing Nie, Chunyu Yang, Ji-Rong Wen, "Closing the Loop in Webpage Understanding", IEEE Transactions on Knowledge & Data Engineering, vol.22, no. 5, pp. 639-650, May 2010, doi:10.1109/TKDE.2009.155
[1] J. Cowie and W. Lehnert, "Information Extraction," Comm. ACM, vol. 39, no. 1, pp. 80-91, 1996.
[2] C. Cardie, "Empirical Methods in Information Extraction," AI Magazine, vol. 18, no. 4, pp. 65-80, 1997.
[3] R. Baumgartner, S. Flesca, and G. Gottlob, "Visual Web Information Extraction with Lixto," Proc. Conf. Very Large Data Bases (VLDB), pp. 119-128, 2001.
[4] A. Arasu and H. Garcia-Molina, "Extracting Structured Data from Web Pages," Proc. ACM SIGMOD, pp. 337-348, 2003.
[5] D.W. Embley, Y.S. Jiang, and Y.-K. Ng, "Record-Boundary Discovery in Web Documents," Proc. ACM SIGMOD, pp. 467-478, 1999.
[6] N. Kushmerick, "Wrapper Induction: Efficiency and Expressiveness," Artificial Intelligence, vol. 118, nos. 1/2, pp. 15-68, 2000.
[7] K. Lerman, S. Minton, and C.A. Knoblock, "Wrapper Maintenance: A Machine Learning Approach," J. Artificial Intelligence Research (JAIR), vol. 18, pp. 149-181, 2003.
[8] I. Muslea, S. Minton, and C.A. Knoblock, "Hierarchical Wrapper Induction for Semistructured Information Sources," Autonomous Agents and Multi-Agent Systems, vol. 4, nos. 1/2, pp. 93-114, 2001.
[9] J. Zhu, Z. Nie, J.-R. Wen, B. Zhang, and W.-Y. Ma, "Simultaneous Record Detection and Attribute Labeling in Web Data Extraction," Proc. Int'l Conf. Knowledge Discovery and Data Mining (KDD), pp. 494-503, 2006.
[10] Z. Nie, Y. Ma, S. Shi, J.-R. Wen, and W.-Y. Ma, "Web Object Retrieval," Proc. Conf. World Wide Web (WWW), pp. 81-90, 2007.
[11] J. Zhu, B. Zhang, Z. Nie, J.-R. Wen, and H.-W. Hon, "Webpage Understanding: An Integrated Approach," Proc. Int'l Conf. Knowledge Discovery and Data Mining (KDD), pp. 903-912, 2007.
[12] S. Sarawagi and W.W. Cohen, "Semi-Markov Conditional Random Fields for Information Extraction," Proc. Conf. Neural Information Processing Systems (NIPS), 2004.
[13] R.C. Bunescu and R.J. Mooney, "Collective Information Extraction with Relational Markov Networks," Proc. Ann. Meeting on Assoc. for Computational Linguistics (ACL), pp. 438-445, 2004.
[14] H.L. Chieu and H.T. Ng, "Named Entity Recognition: A Maximum Entropy Approach Using Global Information," Proc. Int'l Conf. Computational Linguistics (COLING), 2002.
[15] C. Sutton and A. McCallum, "Collective Segmentation and Labeling of Distant Entities in Information Extraction," Proc. ICML Workshop Statistical Relational Learning and Its Connections to Other Fields, 2004.
[16] D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma, "Block-Based Web Search," Proc. ACM SIGIR, pp. 456-463, 2004.
[17] C.-H. Chang and S.-C. Lui, "Iepad: Information Extraction Based on Pattern Discovery," Proc. Conf. World Wide Web (WWW), pp. 681-688, 2001.
[18] V. Crescenzi, G. Mecca, and P. Merialdo, "Roadrunner: Towards Automatic Data Extraction from Large Web Sites," Proc. Conf. Very Large Data Bases (VLDB), pp. 109-118, 2001.
[19] H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C.T. Yu, "Fully Automatic Wrapper Generation for Search Engines," Proc. Conf. World Wide Web (WWW), pp. 66-75, 2005.
[20] K. Lerman, L. Getoor, S. Minton, and C.A. Knoblock, "Using the Structure of Web Sites for Automatic Segmentation of Tables," Proc. ACM SIGMOD, pp. 119-130, 2004.
[21] Y. Zhai and B. Liu, "Web Data Extraction Based on Partial Tree Alignment," Proc. Conf. World Wide Web (WWW), pp. 76-85, 2005.
[22] Y. Zhai and B. Liu, "Structured Data Extraction from the Web Based on Partial Tree Alignment," IEEE Trans. Knowledge and Data Eng., vol. 18, no. 12, pp. 1614-1628, Dec. 2006.
[23] R. Song, H. Liu, J.-R. Wen, and W.-Y. Ma, "Learning Block Importance Models for Web Pages," Proc. Conf. World Wide Web (WWW), pp. 203-211, 2004.
[24] J.D. Lafferty, A. McCallum, and F.C.N. Pereira, "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data," Proc. Int'l Conf. Machine Learning (ICML), pp. 282-289, 2001.
[25] A. Chen, F. Peng, R. Shan, and G. Sun, "Chinese Named Entity Recognition with Conditional Probabilistic Models," Proc. Fifth SIGHAN Workshop Chinese Language Processing, pp. 173-176, 2006.
[26] D. DiPasquo, "Using HTML Formatting to Aid in Natural Language Processing on the World Wide Web," http://citeseer. ist.psu.edudipasquo98using.html , 1998.
[27] C. Jacquemin and C. Bush, "Combining Lexical and Formatting Cues for Named Entity Acquisition from the Web," Proc. 2000 Joint SIGDAT Conf. Empirical Methods in Natural Language Processing and Very Large Corpora, pp. 181-189, 2000.
[28] R.G. Cowell, P.A. Dawid, S.L. Lauritzen, and D.J. Spiegelhalter, Probabilistic Networks and Expert Systems. Springer, 1999.
[29] D.C. Liu and J. Nocedal, "On the Limited Memory bfgs Method for Large Scale Optimization," Math. Programming, vol. 45, no. 3, pp. 503-528, 1989.
[30] O. Etzioni, M.J. Cafarella, D. Downey, A.-M. Popescu, T. Shaked, S. Soderland, D.S. Weld, and A. Yates, "Unsupervised Named-Entity Extraction from the Web: An Experimental Study," Artificial Intelligence, vol. 165, no. 1, pp. 91-134, 2005.
[31] D. Downey, M. Broadhead, and O. Etzioni, "Locating Complex Named Entities in Web Text," Proc. Int'l Joint Conf. Artificial Intelligence (IJCAI), pp. 2733-2739, 2007.
21 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool