The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.03 - March (2010 vol.22)
pp: 447-460
Wei Liu , Renmin University of China, Beijing
Xiaofeng Meng , Renmin University of China, Beijing
Weiyi Meng , Binghamton University, Binghamton
ABSTRACT
Deep Web contents are accessed by queries submitted to Web databases and the returned data records are enwrapped in dynamically generated Web pages (they will be called deep Web pages in this paper). Extracting structured data from deep Web pages is a challenging problem due to the underlying intricate structures of such pages. Until now, a large number of techniques have been proposed to address this problem, but all of them have inherent limitations because they are Web-page-programming-language-dependent. As the popular two-dimensional media, the contents on Web pages are always displayed regularly for users to browse. This motivates us to seek a different way for deep Web data extraction to overcome the limitations of previous works by utilizing some interesting common visual features on the deep Web pages. In this paper, a novel vision-based approach that is Web-page-programming-language-independent is proposed. This approach primarily utilizes the visual features on the deep Web pages to implement deep Web data extraction, including data record extraction and data item extraction. We also propose a new evaluation measure revision to capture the amount of human effort needed to produce perfect extraction. Our experiments on a large set of Web databases show that the proposed vision-based approach is highly effective for deep Web data extraction.
INDEX TERMS
Web mining, Web data extraction, visual features of deep Web pages, wrapper generation.
CITATION
Wei Liu, Xiaofeng Meng, Weiyi Meng, "ViDE: A Vision-Based Approach for Deep Web Data Extraction", IEEE Transactions on Knowledge & Data Engineering, vol.22, no. 3, pp. 447-460, March 2010, doi:10.1109/TKDE.2009.109
REFERENCES
[1] G.O. Arocena and A.O. Mendelzon, “WebOQL: Restructuring Documents, Databases, and Webs,” Proc. Int'l Conf. Data Eng. (ICDE), pp. 24-33, 1998.
[2] D. Buttler, L. Liu, and C. Pu, “A Fully Automated Object Extraction System for the World Wide Web,” Proc. Int'l Conf. Distributed Computing Systems (ICDCS), pp. 361-370, 2001.
[3] D. Cai, X. He, J.-R. Wen, and W.-Y. Ma, “Block-Level Link Analysis,” Proc. SIGIR, pp. 440-447, 2004.
[4] D. Cai, S. Yu, J. Wen, and W. Ma, “Extracting Content Structure for Web Pages Based on Visual Representation,” Proc. Asia Pacific Web Conf. (APWeb), pp. 406-417, 2003.
[5] C.-H. Chang, M. Kayed, M.R. Girgis, and K.F. Shaalan, “A Survey of Web Information Extraction Systems,” IEEE Trans. Knowledge and Data Eng., vol. 18, no. 10, pp. 1411-1428, Oct. 2006.
[6] C.-H. Chang, C.-N. Hsu, and S.-C. Lui, “Automatic Information Extraction from Semi-Structured Web Pages by Pattern Discovery,” Decision Support Systems, vol. 35, no. 1, pp. 129-147, 2003.
[7] V. Crescenzi and G. Mecca, “Grammars Have Exceptions,” Information Systems, vol. 23, no. 8, pp. 539-565, 1998.
[8] V. Crescenzi, G. Mecca, and P. Merialdo, “RoadRunner: Towards Automatic Data Extraction from Large Web Sites,” Proc. Int'l Conf. Very Large Data Bases (VLDB), pp. 109-118, 2001.
[9] D.W. Embley, Y.S. Jiang, and Y.-K. Ng, “Record-Boundary Discovery in Web Documents,” Proc. ACM SIGMOD, pp. 467-478, 1999.
[10] W. Gatterbauer, P. Bohunsky, M. Herzog, B. Krpl, and B. Pollak, “Towards Domain Independent Information Extraction from Web Tables,” Proc. Int'l World Wide Web Conf. (WWW), pp. 71-80, 2007.
[11] J. Hammer, J. McHugh, and H. Garcia-Molina, “Semistructured Data: The TSIMMIS Experience,” Proc. East-European Workshop Advances in Databases and Information Systems (ADBIS), pp. 1-8, 1997.
[12] C.-N. Hsu and M.-T. Dung, “Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web,” Information Systems, vol. 23, no. 8, pp. 521-538, 1998.
[13] http://daisen.cc.kyushu-u.ac.jpTBDW/, 2009.
[14] http://www.w3.org/html/wghtml5/, 2009.
[15] N. Kushmerick, “Wrapper Induction: Efficiency and Expressiveness,” Artificial Intelligence, vol. 118, nos. 1/2, pp. 15-68, 2000.
[16] A. Laender, B. Ribeiro-Neto, A. da Silva, and J. Teixeira, “A Brief Survey of Web Data Extraction Tools,” SIGMOD Record, vol. 31, no. 2, pp. 84-93, 2002.
[17] B. Liu, R.L. Grossman, and Y. Zhai, “Mining Data Records in Web Pages,” Proc. Int'l Conf. Knowledge Discovery and Data Mining (KDD), pp. 601-606, 2003.
[18] W. Liu, X. Meng, and W. Meng, “Vision-Based Web Data Records Extraction,” Proc. Int'l Workshop Web and Databases (WebDB '06), pp. 20-25, June 2006.
[19] L. Liu, C. Pu, and W. Han, “XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources,” Proc. Int'l Conf. Data Eng. (ICDE), pp. 611-621, 2000.
[20] Y. Lu, H. He, H. Zhao, W. Meng, and C.T. Yu, “Annotating Structured Data of the Deep Web,” Proc. Int'l Conf. Data Eng. (ICDE), pp. 376-385, 2007.
[21] J. Madhavan, S.R. Jeffery, S. Cohen, X.L. Dong, D. Ko, C. Yu, and A. Halevy, “Web-Scale Data Integration: You Can Only Afford to Pay As You Go,” Proc. Conf. Innovative Data Systems Research (CIDR), pp. 342-350, 2007.
[22] I. Muslea, S. Minton, and C.A. Knoblock, “Hierarchical Wrapper Induction for Semi-Structured Information Sources,” Autonomous Agents and Multi-Agent Systems, vol. 4, nos. 1/2, pp. 93-114, 2001.
[23] Z. Nie, J.-R. Wen, and W.-Y. Ma, “Object-Level Vertical Search,” Proc. Conf. Innovative Data Systems Research (CIDR), pp. 235-246, 2007.
[24] A. Sahuguet and F. Azavant, “Building Intelligent Web Applications Using Lightweight Wrappers,” Data and Knowledge Eng., vol. 36, no. 3, pp. 283-316, 2001.
[25] K. Simon and G. Lausen, “ViPER: Augmenting Automatic Information Extraction with Visual Perceptions,” Proc. Conf. Information and Knowledge Management (CIKM), pp. 381-388, 2005.
[26] R. Song, H. Liu, J.-R. Wen, and W.-Y. Ma, “Learning Block Importance Models for Web Pages,” Proc. Int'l World Wide Web Conf. (WWW), pp. 203-211, 2004.
[27] J. Wang and F.H. Lochovsky, “Data Extraction and Label Assignment for Web Databases,” Proc. Int'l World Wide Web Conf. (WWW), pp. 187-196, 2003.
[28] X. Xie, G. Miao, R. Song, J.-R. Wen, and W.-Y. Ma, “Efficient Browsing of Web Search Results on Mobile Devices Based on Block Importance Model,” Proc. IEEE Int'l Conf. Pervasive Computing and Comm. (PerCom), pp. 17-26, 2005.
[29] Y. Zhai and B. Liu, “Web Data Extraction Based on Partial Tree Alignment,” Proc. Int'l World Wide Web Conf. (WWW), pp. 76-85, 2005.
[30] H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C.T. Yu, “Fully Automatic Wrapper Generation for Search Engines,” Proc. Int'l World Wide Web Conf. (WWW), pp. 66-75, 2005.
[31] H. Zhao, W. Meng, and C.T. Yu, “Automatic Extraction of Dynamic Record Sections from Search Engine Result Pages,” Proc. Int'l Conf. Very Large Data Bases (VLDB), pp. 989-1000, 2006.
[32] J. Zhu, Z. Nie, J. Wen, B. Zhang, and W. Ma, “Simultaneous Record Detection and Attribute Labeling in Web Data Extraction,” Proc. Int'l Conf. Knowledge Discovery and Data Mining (KDD), pp.494-503, 2006.
29 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool