This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
QA-Pagelet: Data Preparation Techniques for Large-Scale Data Analysis of the Deep Web
September 2005 (vol. 17 no. 9)
pp. 1247-1262
Ling Liu, IEEE Computer Society
This paper presents the QA-Pagelet as a fundamental data preparation technique for large-scale data analysis of the Deep Web. To support QA-Pagelet extraction, we present the Thor framework for sampling, locating, and partioning the QA-Pagelets from the Deep Web. Two unique features of the Thor framework are 1) the novel page clustering for grouping pages from a Deep Web source into distinct clusters of control-flow dependent pages and 2) the novel subtree filtering algorithm that exploits the structural and content similarity at subtree level to identify the QA-Pagelets within highly ranked page clusters. We evaluate the effectiveness of the Thor framework through experiments using both simulation and real data sets. We show that Thor performs well over millions of Deep Web pages and over a wide range of sources, including e-Commerce sites, general and specialized search engines, corporate Web sites, medical and legal resources, and several others. Our experiments also show that the proposed page clustering algorithm achieves low-entropy clusters, and the subtree filtering algorithm identifies QA-Pagelets with excellent precision and recall.

[1] B. Adelberg, “NoDoSEA Tool for Semi-Automatically Extracting Structured and Semistructured Data from Text Documents,” Proc. SIGMOD, 1998.
[2] A. Arasu and H. Garcia-Molina, “Extracting Structured Data from Web Pages,” Proc. SIGMOD, 2003.
[3] R.A. Baeza-Yates and B.A. Ribeiro-Neto, Modern Information Retrieval. ACM Press, 1999.
[4] Z. Bar-Yossef and S. Rajagopalan, “Template Detection via Data Mining and Its Applications,” Proc. World Wide Web Conf., 2002.
[5] D. Beeferman and A. Berger, “Agglomerative Clustering of a Search Engine Query Log,” Knowledge Discovery and Data Mining, pp. 407-416, 2000.
[6] K. Bharat and M.R. Henzinger, “Improved Algorithms for Topic Distillation in a Hyperlinked Environment,” Proc. ACM SIGIR Conf., 1998.
[7] A.Z. Broder, S.C. Glassman, M.S. Manasse, and G. Zweig, “Syntactic Clustering of the Web,” Proc. World Wide Web Conf., 1997.
[8] J. Caverlee, L. Liu, and D. Buttler, “Probe, Cluster, and Discover: Focused Extraction of QA-Pagelets from the Deep Web,” Proc. Int'l Conf. Data Eng., 2004.
[9] W. Cohen, “Recognizing Structure in Web Pages Using Similarity Queries,” Proc. Am. Assoc. for Artificial Intelligence Conf., 1999.
[10] I.S. Dhillon and D.S. Modha, “Concept Decompositions for Large Sparse Text Data Using Clustering,” Machine Learning, vol. 42, nos. 1/2, pp. 143-175, 2001.
[11] L. Gravano, P.G. Ipeirotis, and M. Sahami, “QProber: A System for Automatic Classification of Hidden-Web Databases,” ACM Trans. Information Systems, vol. 21, no. 1, pp. 1-41, 2003.
[12] M. Halkidi, Y. Batistakis, and M. Vazirigiannis, “Clustering Validity Checking Methods: Part II,” SIGMOD Record, vol. 31, no. 3, pp. 19-27, 2002.
[13] J.M. Kleinberg, “Authoritative Sources in a Hyperlinked Environment,” J. ACM, vol. 46, no. 5, pp. 604-632, 1999.
[14] R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins, “Trawling the Web for Emerging Cyber-Communities,” Proc. World Wide Web Conf., 1999.
[15] L. Liu, C. Pu, and W. Han, “XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources,” Proc. Int'l Conf. Data Eng., 2000.
[16] Z. Liu, C. Luo, J. Cho, and W. Chu, “A Probabilistic Approach to Metasearching with Adaptive Probing,” Proc. Int'l Conf. Data Eng., 2004.
[17] A. Nierman and H.V. Jagadish, “Evaluating Structural Similarity in XML Documents,” Proc. Fifth Int'l Workshop Web and Databases, 2002.
[18] M.F. Porter, “An Algorithm for Suffix Stripping,” Program, vol. 14, no. 3, pp. 130-137, 1980.
[19] S. Raghavan and H. Garcia-Molina, “Crawling the Hidden Web,” Proc. Very Large Databases Conf., 2001.
[20] C.E. Shannon, “A Mathematical Theory of Communication,” The Bell System Technical J., vol. 27, pp. 379-423, 623-656, July, Oct. 1948.
[21] M. Steinbach, G. Karypis, and V. Kumar, “A Comparison of Document Clustering Techniques,” Proc. KDD Workshop Text Mining, 2000.
[22] J. Wang, J.-R. Wen, F. Lochovsky, and W.-Y. Ma, “Instance-Based Schema Matching for Web Databases by Domain-Specific Query Probing,” Proc. Very Large Databases Conf., 2004.
[23] W. Wu, C.T. Yu, A. Doan, and W. Meng, “An Interactive Clustering-Based Approach to Integrating Source Query Interfaces on the Deep Web,” Proc. SIGMOD, 2004.
[24] O. Zamir and O. Etzioni, “Web Document Clustering: A Feasibility Demonstration,” Proc. SIGIR, 1998.
[25] Z. Zhang, B. He, and K.C. -C. Chang, “Understanding Web Query Interfaces: Best-Effort Parsing with Hidden Syntax,” Proc. SIGMOD, 2004.
[26] Y. Zhao and G. Karypis, “Criterion Functions for Document Clustering: Experiments and Analysis,” technical report, Univ. of Minnesota, Dept. of Computer Science, Minneapolis, 2002.

Index Terms:
Index Terms- Deep Web, data preparation, data extraction, pagelets, clustering.
Citation:
James Caverlee, Ling Liu, "QA-Pagelet: Data Preparation Techniques for Large-Scale Data Analysis of the Deep Web," IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 9, pp. 1247-1262, Sept. 2005, doi:10.1109/TKDE.2005.151
Usage of this product signifies your acceptance of the Terms of Use.