This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Effectively Mining and Using Coverage and Overlap Statistics for Data Integration
May 2005 (vol. 17 no. 5)
pp. 638-651
Recent work in data integration has shown the importance of statistical information about the coverage and overlap of sources for efficient query processing. Despite this recognition, there are no effective approaches for learning the needed statistics. The key challenge in learning such statistics is keeping the number of needed statistics low enough to have the storage and learning costs manageable. In this paper, we present a set of connected techniques that estimate the coverage and overlap statistics,while keeping the needed statistics tightly under control. Our approach uses a hierarchical classification of the queries and threshold-based variants of familiar data mining techniques to dynamically decide the level of resolution at which to learn the statistics. We describe the details of our method,and present experimental results demonstrating the efficiency of the learning algorithms and the effectiveness of the learned statistics over both controlled data sources and in the context of BibFinder with autonomous online sources.

[1] S. Adali, K. Candan, Y. Papakonstantinou, and V.S. Subrahmanian, “Query Caching and Optimization in Distributed Mediator Systems,” Proc. SIGMOD '96, 1996.
[2] Amazon, http:/www.amazon.com, 2004.
[3] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules,” Proc. Very Large Data Bases Conf., 1994.
[4] J. Callan, “Distributed Information Retrieval,” Advances in Information Retrieval: Recent Research from the Center for Intelligent Information Retrieval, W. Bruce Croft, ed., pp. 127-150, Kluwer Academic, 2000.
[5] W.G. Cochran, Sampling Techniques, third ed. John Wiley & Sons, 1977.
[6] S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantino, J. Ullman, and J. Widom, “The TSIMMIS Project: Integration of Heterogeneous Information Sources,” Proc. 16th Meeting of the Information Processing Soc. of Japan, 1994.
[7] CiteSeer, Computer and Information Science Papers, http:/www.citeseer.org, 2004.
[8] O.M. Duschka, M.R. Genesereth, and A.Y. Levy, “Recursive Query Plans for Data Integration,” J. Logic Programming, vol. 43, no. 1, pp. 49-73, 2000.
[9] A. Doan and A. Halevy, “Efficiently Ordering Plans for Data Integration,” Proc. Int'l Conf. Data Eng., 2002.
[10] D. Florescu, D. Koller, and A. Levy, “Using Probabilistic Information in Data Integration,” Proc. Int'l Conf. Very Large Data Bases (VLDB), 1997.
[11] L. Gravano and H. Garcia-Molina, “Generalizing Gloss to Vector-Space Databases and Broker Hierarchies,” Proc. Int'l Conf. Very Large Data Bases (VLDB), 1995.
[12] J.-R. Gruser, L. Raschid, V. Zadorozhny, and T. Zhan, “Learning Response Time for WebSources Using Query Feedback and Application in Query Optimization,” Very Large Data Bases J., vol. 9, no. 1, pp. 18-37, 2000.
[13] J. Han and M. Kamber, Data Mining: Concepts and Techniques. Morgan Kaufmman, 2000.
[14] T. Hernandez and S. Kambhampati, “Improving Text Collection Selection with Coverage/Overlap Statistics,” ASU CSE technical report, Oct. 2004.
[15] E. Lambrecht, S. Kambhampati, and S. Gnanaprakasam, “Optimizing Recursive Information Gathering Plans,” Proc. Int'l Joint Conf. Artificial Intelligence (IJCAI), 1999.
[16] A. Levy, A. Rajaraman, and J. Ordille, “Query Heterogeneous Information Sources Using Source Descriptions,” Proc. Very Large Data Bases Conf., 1996.
[17] M. Meng, K. Liu, C. Yu, W. Wu, and N. Rishe, “Estimating the Usefulness of Search Engines,” Proc. Int'l Conf. Data Eng., 1999.
[18] F. Naumann, U. Leser, and J. Freytag, “Quality-Driven Integration of Heterogeneous Information Systems,” Proc. Very Large Data Bases Conf., 1999.
[19] Z. Nie and S. Kambhampati, “Joint Optimization of Cost and Coverage of Query Plans in Data Integration,” Proc. ACM Conf. Information and Knowledge Management, 2001.
[20] Z. Nie and S. Kambhampati, “A Frequency-Based Approach for Mining Coverage Statistics in Data Integration,” Proc. Int'l Conf. Data Eng., 2004.
[21] Z. Nie, S. Kambhampati, and T. Hernandez, “BibFinder/StatMiner: Effectively Mining and Using Coverage and Overlap Statistics in Data Integration,” Proc. Very Large Data Bases Conf., 2003.
[22] Z. Nie, S. Kambhampati, U. Nambiar, and S. Vaddi, “Mining Source Coverage Statistics for Data Integration,” Proc. Third Int'l Workshop Web Information and Data Management, 2001.
[23] Z. Nie, U. Nambiar, S. Vaddi, and S. Kambhampati, “Mining Coverage Statistics for Websource Selection in a Mediator,” Proc. ACM Conf. Information and Knowledge Management, 2002.
[24] R. Pottinger and A.Y. Levy, “A Scalable Algorithm for Answering Queries Using Views,” Proc. Int'l Conf. Very Large Data Bases (VLDB), 2000.
[25] Transaction Processing Council, http:/www.tpc.org, 2004.
[26] W. Wang, W. Meng, and C. Yu, “Concept Hierarchy Based Text Database Categorization in a Metasearch Engine Environment,” Proc. First Int'l Conf. Web Information Systems Eng. (WISE '00), 2000.
[27] J. Xu and J. Callan, “Effective Retrieval with Distributed Collections,” Proc. ACM SIGIR Conf. (SIGIR), 1998.
[28] Q. Zhu and P.-A. Larson, “Developing Regression Cost Models for Multi-Database Systems,” Proc. Int'l Conf. Parallel and Distributed Information Systems (PDIS), 1996.

Index Terms:
Query optimization for data integration, coverage and overlap statistics, association rule mining.
Citation:
Zaiqing Nie, Subbarao Kambhampati, Ullas Nambiar, "Effectively Mining and Using Coverage and Overlap Statistics for Data Integration," IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 5, pp. 638-651, May 2005, doi:10.1109/TKDE.2005.76
Usage of this product signifies your acceptance of the Terms of Use.