This Article 
 Bibliographic References 
 Add to: 
Structured Data Extraction from the Web Based on Partial Tree Alignment
December 2006 (vol. 18 no. 12)
pp. 1614-1628
This paper studies the problem of structured data extraction from arbitrary Web pages. The objective of the proposed research is to automatically segment data records in a page, extract data items/fields from these records, and store the extracted data in a database. Existing methods addressing the problem can be classified into three categories. Methods in the first category provide some languages to facilitate the construction of data extraction systems. Methods in the second category use machine learning techniques to learn wrappers (which are data extraction programs) from human labeled examples. Manual labeling is time-consuming and is hard to scale to a large number of sites on the Web. Methods in the third category are based on the idea of automatic pattern discovery. However, multiple pages that conform to a common schema are usually needed as the input. In this paper, we propose a novel and effective technique (called DEPTA) to perform the task of Web data extraction automatically. The method consists of two steps: 1) identifying individual records in a page and 2) aligning and extracting data items from the identified records. For step 1, a method based on visual information and tree matching is used to segment data records. For step 2, a novel partial alignment technique is proposed. This method aligns only those data items in a pair of records that can be aligned with certainty, making no commitment on the rest of the items. Experimental results obtained using a large number of Web pages from diverse domains show that the proposed two-step technique is highly effective.

[1] A. Arasu and H. Garcia-Molina, “Extracting Structured Data from Web Pages,” Proc. 2003 ACM SIGMOD Int'l Conf. Management of Data, pp. 337-348, 2003.
[2] G.J. Barton and M.J. Sternberg, “A Strategy for the Rapid Multiple Alignment of Protein Sequences: Confidence Levels from Tertiary Structure Comparisons,” J. Molecular Biology, vol. 198, no. 2, pp.327-337, 1987.
[3] R. Baumgartner, S. Flesca, and G. Gottlob, “Visual Web Information Extraction with Lixto,” Proc. 27th Int'l Conf. Very Large Data Bases, pp. 119-128, 2001.
[4] D. Buttler, L. Liu, and C. Pu, “A Fully Automated Object Extraction System for the World Wide Web,” Proc. 21st Int'l Conf. Distributed Computing Systems, pp. 361-370, 2001.
[5] H. Carrillo and D. Lipman, “The Multiple Sequence Alignment Problem in Biology,” SIAM J. Applied Math., vol. 48, no. 5, pp.1073-1082, 1988.
[6] C. Chang and S. Lui, “IEPAD: Information Extraction Based on Pattern Discovery,” Proc. 10th Int'l Conf. World Wide Web, 2001.
[7] W. Chen, “New Algorithm for Ordered Tree-to-Tree Correction Problem,” J. Algorithms, vol 40, no. 2, pp. 135-158, 2001.
[8] W.W. Cohen, M. Hurst, and L.S. Jensen, “A Flexible Learning System for Wrapping Tables and Lists in HTML Documents,” Proc. 11th Int'l Conf. World Wide Web, pp. 232-241, 2002.
[9] V. Crescenzi, G. Mecca, and P. Merialdo, “Roadrunner: Towards Automatic Data Extraction from Large Web Sites,” Proc. 27th Int'l Conf. Very Large Data Bases, pp. 109-118, 2001.
[10] D.W. Embley, Y. Jiang, and Y.K. Ng, “Record-Boundary Discovery in Web Documents,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 467-478, 1999.
[11] M.R. Garey and D.S. Johnson, Computers and Intractability: A Guide to the Theory of P-Completeness. W.H. Freeman, 1979.
[12] G.H. Gonnet and R.B. Yates, Handbook of Algorithms and Data Structures in Pascal and C. Addison-Wesley, 1991.
[13] C.M. Hoffmann and M.J. O'Donnell, “Pattern Matching in Trees,” J. ACM, pp. 68-95, 1982.
[14] P. Hogeweg and B. Hesper, “The Alignment of Sets of Sequences and the Construction of Phylogenetic Trees: An Integrated Method,” J. Molecular Evolution, vol. 20, pp. 175-186, 1984.
[15] A. Hogue and D. Karger, “Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web,” Proc. 14th Int'l Conf. World Wide Web, 2005.
[16] C.N. Hsu and M.T. Dung, “Generating Finite-State Transducers for Semistructured Data Extraction from the Web,” Information Systems, vol. 23, no. 9, pp. 521-538, 1998.
[17] T. Jiang, L. Wang, and K. Zhang, “Alignment of Trees—An Alternative to Tree Edit,” CPM '94: Proc. Fifth Ann. Symp. Combinatorial Pattern Matching, pp. 75-86, 1994.
[18] N. Kushmerick, “Wrapper Induction: Efficiency and Expressiveness,” Artificial Intelligence, nos. 1-2, pp. 15-68, 2000.
[19] A.H.F. Laender, B.-R. Neto, and A.S. da Silva, “Debye—Date Extraction by Example,” Data and Knowledge Eng., vol. 40, no. 2, pp. 121-154, 2002.
[20] L. Arllota, V. Crescenzi, G. Mecca, and P. Merialdo, “Automatic Annotation of Data Extraction from Large Web Sitess,” Proc. Int'l Workshop Web and Databases, pp. 7-12, 2003.
[21] K. Lerman, L. Getoor, S. Minton, and C. Knoblock, “Using the Structure of Web Sites for Automatic Segmentation of Tables,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 119-130, 2004.
[22] B. Liu, R. Grossman, and Y. Zhai, “Mining Data Records in Web Pages,” Proc. Ninth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 601-606, 2003.
[23] I. Muslea, S. Minton, and C. Knoblock, “A Hierarchical Approach to Wrapper Induction,” Proc. Third Ann. Conf. Autonomous Agents, pp. 190-197, 1999.
[24] C. Notredame, “Recent Progresses in Multiple Sequence Alignment: A Survey,” technical report, Information Génétique et, 2002.
[25] D. Pinto, A. McCallum, X. Wei, and W.-B. Croft, “Table Extraction Using Conditional Random Fields,” Proc. 26th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp.235-242, 2003.
[26] J. Raposo, A. Pan, M. Alvarez, J. Hidalgo, and A. Vina, “The Wargo System: Semi-Automatic Wrapper Generation in Presence of Complex Data Access Modes,” Proc. 13th Int'l Workshop Database and Expert Systems Applications, pp. 313-320, 2002.
[27] D.C. Reis, P.B. Golgher, A.S. Silva, and A.F. Laender, “Automatic Web News Extraction Using Tree Edit Distance,” Proc. 13th Int'l Conf. World Wide Web, pp. 502-511, 2004.
[28] B. Rosenfeld, R. Feldman, and Y. Aumann, “Structural Extraction from Visual Layout of Documents,” Proc. 11th Int'l Conf. Information and Knowledge Management, pp. 203-210, 2002.
[29] M.S. Selkow, “The Tree-to-Tree Editing Problem,” Information Processing Letters, vol. 6, no. 6, pp. 184-186, 1977.
[30] R. Song, H. Liu, J.R. Wen, and W.Y. Ma, “Learning Block Importance Models for Web Pages,” Proc. 13th Int'l Conf. World Wide Web, pp. 203-211, 2004.
[31] K.C. Tai, “The Tree-to-Tree Correction Problem,” J. ACM, vol. 26, no. 3, pp. 422-433, 1979.
[32] E. Tanaka and K. Tanaka, “The Tree-to-Tree Editing Problem,” Int'l J. Pattern Recognition and Artificial Intelligence, pp. 221-240, 1988.
[33] G. Valiente, “An Efficient Bottom-Up Distance between Trees,” Proc. Eighth Int'l Symp. String Processing and Information Retrieval, pp. 212-219, 2001.
[34] J. Wang and F.H. Lochovsky, “Data Extraction and Label Assignment for Web Databases.,” Proc. 12th Int'l Conf. World Wide Web, pp 187-196, 2003.
[35] W. Yang, “Identifying Syntactic Differences between Two Programs,” Software—Practice and Experience, vol. 21, no. 7, pp. 739-755, 1991.
[36] Y. Zhai and B. Liu, “Web Data Extraction Based on Partial Tree Alignment,” Proc. 14th Int'l Conf. World Wide Web, pp. 76-85, 2005.
[37] Y. Zhai and B. Liu, “Extracting Web Data Using Instance-Based Learning,” Proc. Sixth Int'l Conf. Web Information Systems Eng., 2005.
[38] K. Zhang, R. Statman, and D. Shasha, “On the Editing Distance between Unordered Labeled Trees,” Information Processing Letters, vol. 42, no. 3, pp 133-139, 1992.
[39] H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C. Yu, “Fully Automatic Wrapper Generation for Search Engines,” Proc. 14th Int'l Conf. World Wide Web, pp. 66-75, 2005.
[40] L. Zhao and N.K. Wee, “WICCAP: From Semi-Structured Data to Structured Data,” Proc. 11th IEEE Int'l Conf. and Workshop Eng. of Computer-Based Systems (ECBS '04), p. 86, 2004.

Index Terms:
Web data extraction, wrapper generation, partial tree alignement, Web mining.
Yanhong Zhai, Bing Liu, "Structured Data Extraction from the Web Based on Partial Tree Alignment," IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 12, pp. 1614-1628, Dec. 2006, doi:10.1109/TKDE.2006.197
Usage of this product signifies your acceptance of the Terms of Use.