This Article 
 Bibliographic References 
 Add to: 
Record Matching over Query Results from Multiple Web Databases
April 2010 (vol. 22 no. 4)
pp. 578-589
Weifeng Su, BNU-HKBU United International College and PKU-HKUST Shenzhen Hong Kong Institution, China
Jiying Wang, City University of Hong Kong, Hong Kong
Frederick H. Lochovsky, The Hong Kong University of Science and Technology, Hong Kong
Record matching, which identifies the records that represent the same real-world entity, is an important step for data integration. Most state-of-the-art record matching methods are supervised, which requires the user to provide training data. These methods are not applicable for the Web database scenario, where the records to match are query results dynamically generated on-the-fly. Such records are query-dependent and a prelearned method using training examples from previous query results may fail on the results of a new query. To address the problem of record matching in the Web database scenario, we present an unsupervised, online record matching method, UDD, which, for a given query, can effectively identify duplicates from the query result records of multiple Web databases. After removal of the same-source duplicates, the “presumed” nonduplicate records from the same source can be used as training examples alleviating the burden of users having to manually label training examples. Starting from the nonduplicate set, we use two cooperating classifiers, a weighted component similarity summing classifier and an SVM classifier, to iteratively identify duplicates in the query results from multiple Web databases. Experimental results show that UDD works well for the Web database scenario where existing supervised methods do not apply.

[1] R. Ananthakrishna, S. Chaudhuri, and V. Ganti, "Eliminating Fuzzy Duplicates in Data Warehouses," Proc. 28th Int'l Conf. Very Large Data Bases, pp. 586-597, 2002.
[2] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval. ACM Press, 1999.
[3] R. Baxter, P. Christen, and T. Churches, "A Comparison of Fast Blocking Methods for Record Linkage," Proc. KDD Workshop Data Cleaning, Record Linkage, and Object Consolidation, pp. 25-27, 2003.
[4] O. Bennjelloun, H. Garcia-Molina, D. Menestrina, Q. Su, S.E. Whang, and J. Widom, "Swoosh: A Generic Approach to Entity Resolution," The VLDB J., vol. 18, no. 1, pp. 255-276, 2009.
[5] M. Bilenko and R.J. Mooney, "Adaptive Duplicate Detection Using Learnable String Similarity Measures," Proc. ACM SIGKDD, pp. 39-48, 2003.
[6] S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani, "Robust and Efficient Fuzzy Match for Online Data Cleaning," Proc. ACM SIGMOD, pp. 313-324, 2003.
[7] S. Chaudhuri, V. Ganti, and R. Motwani, "Robust Identification of Fuzzy Duplicates," Proc. 21st IEEE Int'l Conf. Data Eng., pp. 865-876, 2005.
[8] P. Christen, "Automatic Record Linkage Using Seeded Nearest Neighbour and Support Vector Machine Classification," Proc. ACM SIGKDD, pp. 151-159, 2008.
[9] P. Christen, T. Churches, and M. Hegland, "Febrl—A Parallel Open Source Data Linkage System," Advances in Knowledge Discovery and Data Mining, pp. 638-647, Springer, 2004.
[10] P. Christen and K. Goiser, "Quality and Complexity Measures for Data Linkage and Deduplication," Quality Measures in Data Mining, F. Guillet and H. Hamilton, eds., vol. 43, pp. 127-151, Springer, 2007.
[11] W.W. Cohen, H. Kautz, and D. McAllester, "Hardening Soft Information Sources," Proc. ACM SIGKDD, pp. 255-259, 2000.
[12] W.W. Cohen and J. Richman, "Learning to Match and Cluster Large High-Dimensional Datasets for Data Integration," Proc. ACM SIGKDD, pp. 475-480, 2002.
[13] A. Culotta and A. McCallum, "A Conditional Model of Deduplication for Multi-Type Relational Data," Technical Report IR-443, Dept. of Computer Science, Univ. of Massachusetts Amherst, 2005.
[14] F. DeComite, F. Denis, and R. Gilleron, "Positive and Unlabeled Examples Help Learning," Proc. 11th Int'l Conf. Algorithmic Learning Theory, pp. 219-230, 1999.
[15] F. Denis, "PAC Learning from Positive Statistical Queries," Proc. 10th Int'l Conf. Algorithmic Learning Theory, pp. 112-126, 1998.
[16] X. Dong, A. Halevy, and J. Madhavan, "Reference Reconciliation in Complex Information Spaces," Proc. ACM SIGMOD, pp. 85-96, 2005.
[17] A.K. Elmagarmid, P.G. Ipeirotis, and V.S. Verykios, "Duplicate Record Detection: A Survey," IEEE Trans. Knowledge and Data Eng., vol. 19, no. 1, pp. 1-16, Jan. 2007.
[18] L. Gravano, P.G. Ipeirotis, H.V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava, "Approximate String Joins in a Database (Almost) for Free," Proc. 27th Int'l Conf. Very Large Data Bases, pp. 491-500, 2001.
[19] B. He and K.C.-C. Chang, "Automatic Complex Schema Matching Across Web Query Interfaces: A Correlation Mining Approach," ACM Trans. Database Systems, vol. 31, no. 1, pp. 346-396, 2006.
[20] M.A. Hernandez and S.J. Stolfo, "The Merge/Purge Problem for Large Databases," ACM SIGMOD Record, vol. 24, no. 2, pp. 127-138, 1995.
[21] M.A. Jaro, "Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida," J. Am. Statistical Assoc., vol. 89, no. 406, pp. 414-420, 1989.
[22] D.V. Kalashnikov, S. Mehrotra, and Z. Chen, "Exploiting Relationships for Domain-Independent Data Cleaning," Proc. SIAM Int'l Conf. Data Mining, pp. 262-273, 2005.
[23] N. Koudas, S. Sarawagi, and D. Srivastava, "Record Linkage: Similarity Measures and Algorithms (Tutorial)," Proc. ACM SIGMOD, pp. 802-803, 2006.
[24] F. Letouzey, F. Denis, and R. Gilleron, "Learning from Positive and Unlabeled Examples," Proc. 11th Int'l Conf. Algorithmic Learning Theory, pp. 71-85, 2000.
[25] L.M. Manevitz and M. Yousef, "One-Class SVMs for Document Classification," J. Machine Learning Research, vol. 2, pp. 139-154, 2001.
[26] A. McCallum, "Cora Citation Matching," http://www.cs.umass. edu/~mccallum/datacora-refs.tar.gz , 2004.
[27] A. McCallum, K. Nigam, and L.H. Ungar, "Efficient Clustering of High-Dimensional Datasets with Application to Reference Matching," Proc. ACM SIGKDD, pp. 169-178, 2000.
[28] S. Sarawagi and A. Bhamidipaty, "Interactive Deduplication Using Active Learning," Proc. ACM SIGKDD, pp. 269-278, 2002.
[29] K. Simon and G. Lausen, "ViPER: Augmenting Automatic Information Extraction with Visual Perceptions," Proc. 14th ACM Int'l Conf. Information and Knowledge Management, pp. 381-388, 2005.
[30] W. Su, J. Wang, and F.H. Lochovsky, "Holistic Schema Matching for Web Query Interfaces," Proc. 10th Int'l. Conf. Extending Database Technology, pp. 77-94, 2006.
[31] S. Tejada, C.A. Knoblock, and S. Minton, "Learning Domain-Independent String Transformation Weights for High Accuracy Object Identification," Proc. ACM SIGKDD, pp. 350-359, 2002.
[32] Y. Thibaudeau, "The Discrimination Power of Dependency Structures in Record Linkage," Survey Methodology, vol. 19, pp. 31-38, 1993.
[33] V. Vapnik, The Nature of Statistical Learning Theory, second ed. Springer, 2000.
[34] V.S. Verykios, G.V. Moustakides, and M.G. Elfeky, "A Bayesian Decision Model for Cost Optimal Record Matching," The VLDB J., vol. 12, no. 1, pp. 28-40, 2003.
[35] W.E. Winkler, "Using the EM Algorithm for Weight Computation in the Fellegi-Sunter Model of Record Linkage," Proc. Section Survey Research Methods, pp. 667-671, 1988.
[36] H. Yu, J. Han, and C.C. Chang, "PEBL: Web Page Classification without Negative Examples," IEEE Trans. Knowledge and Data Eng., vol. 16, no. 1, pp. 70-81, Jan. 2004.
[37] Y. Zhai and B. Liu, "Structured Data Extraction from the Web Based on Partial Tree Alignment," IEEE Trans. Knowledge and Data Eng., vol. 18, no. 12, pp. 1614-1628, Dec. 2006.
[38] H. Zhao, W. Meng, A. Wu, V. Raghavan, and C. Yu, "Fully Automatic Wrapper Generation for Search Engines," Proc. 14th World Wide Web Conf., pp. 66-75, 2005.

Index Terms:
Record matching, duplicate detection, record linkage, data deduplication, data integration, Web database, query result record, SVM.
Weifeng Su, Jiying Wang, Frederick H. Lochovsky, "Record Matching over Query Results from Multiple Web Databases," IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 4, pp. 578-589, April 2010, doi:10.1109/TKDE.2009.90
Usage of this product signifies your acceptance of the Terms of Use.