This Article 
 Bibliographic References 
 Add to: 
A Survey of Web Information Extraction Systems
October 2006 (vol. 18 no. 10)
pp. 1411-1428
Chia-Hui Chang, IEEE Computer Society
Moheb Ramzy Girgis, IEEE Computer Society
The Internet presents a huge amount of useful information which is usually formatted for its users, which makes it difficult to extract relevant data from various sources. Therefore, the availability of robust, flexible Information Extraction (IE) systems that transform the Web pages into program-friendly structures such as a relational database will become a great necessity. Although many approaches for data extraction from Web pages have been developed, there has been limited effort to compare such tools. Unfortunately, in only a few cases can the results generated by distinct tools be directly compared since the addressed extraction tasks are different. This paper surveys the major Web data extraction approaches and compares them in three dimensions: the task domain, the automation degree, and the techniques used. The criteria of the first dimension explain why an IE system fails to handle some Web sites of particular structures. The criteria of the second dimension classify IE systems based on the techniques used. The criteria of the third dimension measure the degree of automation for IE systems. We believe these criteria provide qualitatively measures to evaluate various IE approaches.

[1] E. Riloff, “Automatically Constructing a Dictionary for Information Extraction Tasks,” Proc. 11th Nat'l Conf. Artificial Intelligence (AAAI '93), pp. 811-816, 1993.
[2] S. Huffman, “Learning Information Extraction Patterns from Examples,” Connectionist, Statistical, and Symbolic Approaches to Learning for Natural Language Processing, Springer-Verlag, 1996.
[3] J. Kim and D. Moldovan, “Acquisition of Linguistic Patterns for Knowledge-Based Information Extraction,” IEEE Trans. Knowledge and Data Eng., vol. 7, no. 5, pp. 713-724, Oct. 1995.
[4] G. Krupka, “Description of the SRA System as Used for MUC-6,” Proc. Sixth Message Understanding Conf. (MUC-6), pp. 221-235, 1995.
[5] S. Soderland, D. Fisher, J. Aseltine, and W. Lehnert, “CRYSTAL: Inducing a Conceptual Dictionary,” Proc. 14th Int'l Joint Conf. Artificial Intelligence (IJCAI), 1995.
[6] S. Soderland, “Learning Information Extraction Rules for Semi-Structured and Free Text,” J. Machine Learning, vol. 34, nos. 1-3, pp. 233-272, 1999.
[7] M. Califf and R. Mooney, “Relational Learning of Pattern-Match Rules for Information Extraction,” Proc. AAAI Spring Symp. Applying Machine Learning to Discourse Processing Mar. 1998.
[8] D. Freitag, “Information Extraction from HTML: Application of a General Learning Approach,” Proc. 15th Conf. Artificial Intelligence (AAAI '98), 1998.
[9] N. Kushmerick, D. Weld, and R. Doorenbos, “Wrapper Induction for Information Extraction,” Proc. 15th Int'l Conf. Artificial Intelligence (IJCAI), pp. 729-735, 1997.
[10] C.-N. Hsu and M. Dung, “Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web,” J. Information Systems, vol. 23, no. 8, pp. 521-538, 1998.
[11] I. Muslea, S. Minton, and C. Knoblock, “A Hierarchical Approach to Wrapper Induction,” Proc. Third Int'l Conf. Autonomous Agents (AA '99), 1999.
[12] C.-H. Chang, C.-N. Hsu, and S.-C. Lui, “Automatic Information Extraction from Semi-Structured Web Pages by Pattern Discovery,” Decision Support Systems J., vol. 35, no. 1, pp. 129-147, 2003.
[13] N. Kushmerick, “Adaptive Information Extraction: Core Technologies for Information Agents,” Intelligent Information Agents R&D in Europe: An AgentLink Perspective, M. Klusch, S. Bergamaschi, P. Edwards, and P. Petta, eds., Springer, 2003.
[14] S. Soderland, “Learning to Extract Text-Based Information from the World Wide Web,” Proc. Third Int'l Conf. Knowledge Discovery and Data Mining (KDD), pp. 251-254, 1997.
[15] F. Ciravegna, “Learning to Tag for Information Extraction from Text,” Proc. ECAI-2000 Workshop Machine Learning for Information Extraction, Aug. 2000.
[16] A.H.F. Laender, B.A. Ribeiro-Neto, A.S. da Silva, and J.S. Teixeira, “A Brief Survey of Web Data Extraction Tools,” SIGMOD Record, vol. 31, no. 2, pp. 84-93, 2002.
[17] V. Crescenzi and G. Mecca, “Grammars Have Exceptions,” Information Systems, vol. 23, no. 8, pp. 539-565, 1998.
[18] J. Hammer, J. McHugh, and H. Garcia-Molina, “Semistructured Data: The TSIMMIS Experience,” Proc. First East-European Symp. Advances in Databases and Information Systems (ADBIS), pp. 1-8, 1997.
[19] G.O. Arocena and A.O. Mendelzon, “WebOQL: Restructuring Documents, Databases, and Webs,” Proc. 14th IEEE Int'l Conf. Data Eng. (ICDE), pp. 24-33, 1998.
[20] A. Saiiuguet and F. Azavant, “Building Intelligent Web Applications Using Lightweight Wrappers,” Data and Knowledge Eng., vol. 36, no. 3, 283-316, 2001.
[21] L. Liu, C. Pu, and W. Han, “XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources,” Proc. 16th IEEE Int'l Conf. Data Eng. (ICDE), pp. 611-621, 2000.
[22] V. Crescenzi, G. Mecca, and P. Merialdo, “RoadRunner: Towards-Automatic Data Extraction from Large Web Sites,” Proc. the 26th Int'l Conf. Very Large Database Systems (VLDB), pp. 109-118, 2001.
[23] B. Adelberg, “NoDoSE: A Tool for Semiautomatically Extracting Structured and Semistructured Data from Text Documents,” SIGMOD Record, vol. 27, no. 2, pp. 283-294, 1998.
[24] A.H.F. Laender, B. Ribeiro-Neto, and A.S. DA Silva, “DEByE— Data Extraction by Example,” Data and Knowledge Eng., vol. 40, no. 2, pp. 121-154, 2002.
[25] B. Ribeiro-Neto, A.H.F. Laender, and A.S. DA Silva, “ Extracting Semistructured Data through Examples,” Proc. Eighth ACM Int'l Conf. Information and Knowledge Management (CIKM), pp. 94-101, 1999.
[26] D.W. Embley, D.M. Campbell, Y.S. Jiang, S.W. Liddle, Y. Kai Ng, D. Quass, and R.D. Smith, “Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages,” Data and Knowledge Eng., vol. 31, no. 3, pp. 227-251, 1999.
[27] S. Sarawagi, “Automation in Information Extraction and Integration,” Proc. Tutorial 28th Int'l Conf. Very Large Data Bases (VLDB), 2002.
[28] S. Kuhlins and R. Tredwell, “Toolkits for Generating Wrappers,” Net. ObjectDays 2002: Objects, Components, Architectures, Services, and Applications for a Networked World, LNCS 2591, http:/www.netobject, 2002.
[29] R. Elmasri and S.B. Navathe, Fundamentals of Database Systems, fourth ed. Addison Wesley, 2003.
[30] C.-N. Hsu and C.-C. Chang, “Finite-State Transducers for SemiStructured Text Mining,” Proc. IJCAI-99 Workshop Text Mining: Foundations, Techniques, and Applications, pp. 38-49, 1999.
[31] C.-H. Chang and S.-C. Lui, “IEPAD: Information Extraction Based on Pattern Discovery,” Proc. 10th Int'l Conf. World Wide Web (WWW), pp. 223-231, 2001.
[32] C.-H. Chang and S.-C. Kuo, “OLERA: A Semisupervised Approach for Web Data Extraction with Visual Support,” IEEE Intelligent Systems, vol. 19, no. 6, pp. 56-64, 2004.
[33] A. Hogue and D. Karger, “Thresher: Automating the Unwrapping of Semantic Content from the World Wide,” Proc. 14th Int'l Conf. World Wide Web (WWW), pp. 86-95, 2005.
[34] G. Yang, I.V. Ramakrishnan, and M. Kifer, “On the Complexity of Schema Inference from Web Pages in the Presence of Nullable Data Attributes,” Proc. 12th ACM Int'l Conf. Information and Knowledge Management (CIKM), pp. 224-231, 2003.
[35] J. Wang and F.H. Lochovsky, “Wrapper Induction Based on Nested Pattern Discovery,” Technical Report HKUST-CS-27-02, Dept. of Computer Science, Hong Kong, Univ. of Science & Tech nology, 2002.
[36] J. Wang and F.H. Lochovsky, “Data Extraction and Label Assignment for Web Databases,” Proc. 12th Int'l Conf. World Wide Web (WWW), pp. 187-196, 2003.
[37] A. Arasu and H. Garcia-Molina, “Extracting Structured Data from Web Pages,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 337-348, 2003.
[38] B. Liu, R. Grossman, and Y. Zhai, “Mining Data Records in Web Pages,” Proc. Int'l Conf. Knowledge Discovery in Databases and Data Mining (KDD), pp. 601-606, 2003.
[39] Y. Zhai and B. Liu, “Web Data Extraction Based on Partial Tree Alignment,” Proc. 14th Int'l Conf. World Wide Web (WWW), pp. 76-85, 2005.
[40] B. Liu and Y. Zhai, “NET— A System for Extracting Web Data from Flat and Nested Data Records,” Proc. Sixth Int'l Conf. Web Information Systems Eng., pp. 487-495, 2005.
[41] Y. Lan, L. Bing, and L. Xiaoli, “Eliminating Noisy Information in Web Pages for Data Mining,” Proc. ACM SIGKDD Int'l Conf. Knowledge Discovery & Data Mining (KDD '03), 2003.
[42] H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C. Yu, “Fully Automatic Wrapper Generation For Search Engines,” Proc. Fourth Int'l Conf. World Wide Web (WWW), pp. 66-75, 2005.
[43] K. Lerman, L. Getoor, S. Minton, and C.A. Knoblock, “Using the Structure of Web Sites for Automatic Segmentation of Tables,” Proc. SIGMOD Conf., pp. 119-130, 2004.
[44] D. Pinto, A .McCallum, X. Wei, and B.C. Croft, “Table Extraction Using Conditional Random Fields,” Proc. ACM SIGIR Conf., pp. 235-242, 2003.
[45] C.-N. Hsu, C.-H. Chang, C.-H. Hsieh, J.-J. Lu, and C.-C. Chang, “Reconfigurable Web Wrapper Agents for Biological Information Integration,” J. Am. Soc. for Information Science and Technology, special issue on bioinformatics, vol. 56, no. 5, pp. 505-517, 2005.
[46] B. He, K.C. Chang, and J. Han, “Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach,” Proc. 10th Int'l Conf. Knowledge Discovery and Data Mining (KDD), pp. 148-157, 2004.
[47] W. Wu, C. Yu, A. Doan, and W. Meng, “An Interactive Clustering-Based Approach to Integrating Source Query Interfaces on the Deep Web,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 95-106, 2004.
[48] I. Muslea, “Extraction Patterns for Information Extraction Tasks: A Survey,” Proc. AAAI-99 Workshop Machine Learning for Information Extraction, pp. 1-6, 1999.

Index Terms:
Information extraction, Web mining, wrapper, wrapper induction.
Chia-Hui Chang, Mohammed Kayed, Moheb Ramzy Girgis, Khaled F. Shaalan, "A Survey of Web Information Extraction Systems," IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 10, pp. 1411-1428, Oct. 2006, doi:10.1109/TKDE.2006.152
Usage of this product signifies your acceptance of the Terms of Use.