This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Duplicate Record Detection: A Survey
January 2007 (vol. 19 no. 1)
pp. 1-16
Panagiotis G. Ipeirotis, IEEE Computer Society
Vassilios S. Verykios, IEEE Computer Society
Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a difficult task. Errors are introduced as the result of transcription errors, incomplete information, lack of standard formats, or any combination of these factors. In this paper, we present a thorough analysis of the literature on duplicate record detection. We cover similarity metrics that are commonly used to detect similar field entries, and we present an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database. We also cover multiple techniques for improving the efficiency and scalability of approximate duplicate detection algorithms. We conclude with coverage of existing tools and with a brief discussion of the big open problems in the area.

[1] A. Chatterjee and A. Segev, “Data Manipulation in Heterogeneous Databases,” ACM SIGMOD Record, vol. 20, no. 4, pp. 64-68, Dec. 1991.
[2] IEEE Data Eng. Bull., S. Sarawagi, ed., special issue on data cleaning, vol. 23, no. 4, Dec. 2000.
[3] J. Widom, “Research Problems in Data Warehousing,” Proc. 1995 ACM Conf. Information and Knowledge Management (CIKM '95), pp.25-30, 1995.
[4] A.Z. Broder, S.C. Glassman, M.S. Manasse, and G. Zweig, “Syntactic Clustering of the Web,” Proc. Sixth Int'l World Wide Web Conf. (WWW6), pp. 1157-1166, 1997.
[5] J. Cho, N. Shivakumar, and H. Garcia-Molina, “Finding Replicated Web Collections,” Proc. 2000 ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '00), pp. 355-366, 2000.
[6] R. Mitkov, Anaphora Resolution, first ed. Longman, Aug. 2002.
[7] A. McCallum, “Information Extraction: Distilling Structured Data from Unstructured Text,” ACM Queue, vol. 3, no. 9, pp. 48-57, 2005.
[8] H.B. Newcombe, J.M. Kennedy, S. Axford, and A. James, “Automatic Linkage of Vital Records,” Science, vol. 130, no. 3381, pp. 954-959, Oct. 1959.
[9] H.B. Newcombe and J.M. Kennedy, “Record Linkage: Making Maximum Use of the Discriminating Power of Identifying Information,” Comm. ACM, vol. 5, no. 11, pp. 563-566, Nov. 1962.
[10] H.B. Newcombe, “Record Linking: The Design of Efficient Systems for Linking Records into Individual and Family Histories,” Am. J. Human Genetics, vol. 19, no. 3, pp. 335-359, May 1967.
[11] B.J. Tepping, “A Model for Optimum Linkage of Records,” J. Am. Statistical Assoc., vol. 63, no. 324, pp. 1321-1332, Dec. 1968.
[12] I.P. Fellegi and A.B. Sunter, “A Theory for Record Linkage,” J. Am. Statistical Assoc., vol. 64, no. 328, pp. 1183-1210, Dec. 1969.
[13] H.B. Newcombe, Handbook of Record Linkage. Oxford Univ. Press, 1988.
[14] M.A. Hernández and S.J. Stolfo, “Real-World Data Is Dirty: Data Cleansing and the Merge/Purge Problem,” Data Mining and Knowledge Discovery, vol. 2, no. 1, pp. 9-37, Jan. 1998.
[15] S. Sarawagi and A. Bhamidipaty, “Interactive Deduplication Using Active Learning,” Proc. Eighth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '02), pp. 269-278, 2002.
[16] Y.R. Wang and S.E. Madnick, “The Inter-Database Instance Identification Problem in Integrating Autonomous Systems,” Proc. Fifth IEEE Int'l Conf. Data Eng. (ICDE '89), pp. 46-55, 1989.
[17] W.W. Cohen, H. Kautz, and D. McAllester, “Hardening Soft Information Sources,” Proc. Sixth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '00), pp. 255-259, 2000.
[18] M. Bilenko, R.J. Mooney, W.W. Cohen, P. Ravikumar, and S.E. Fienberg, “Adaptive Name Matching in Information Integration,” IEEE Intelligent Systems, vol. 18, no. 5, pp. 16-23, Sept./Oct. 2003.
[19] R. Kimball and J. Caserta, The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data. John Wiley & Sons, 2004.
[20] IEEE Data Eng. Bull., E. Rundensteiner, ed., special issue on date transformation, vol. 22, no. 1, Jan. 1999.
[21] A. McCallum, D. Freitag, and F.C.N. Pereira, “Maximum Entropy Markov Models for Information Extraction and Segmentation,” Proc. 17th Int'l Conf. Machine Learning (ICML '00), pp. 591-598, 2000.
[22] V.R. Borkar, K. Deshmukh, and S. Sarawagi, “Automatic Segmentation of Text into Structured Records,” Proc. 2001 ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '01), pp. 175-186, 2001.
[23] E. Agichtein and V. Ganti, “Mining Reference Tables for Automatic Text Segmentation,” Proc. 10th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '04), pp. 20-29, 2004.
[24] C. Sutton, K. Rohanimanesh, and A. McCallum, “Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data,” Proc. 21st Int'l Conf. Machine Learning (ICML '04), 2004.
[25] V. Raman and J.M. Hellerstein, “Potter's Wheel: An Interactive Data Cleaning System,” Proc. 27th Int'l Conf. Very Large Databases (VLDB '01), pp. 381-390, 2001.
[26] M. Perkowitz, R.B. Doorenbos, O. Etzioni, and D.S. Weld, “Learning to Understand Information on the Internet: An Example-Based Approach,” J. Intelligent Information Systems, vol. 8, no. 2, pp. 133-153, Mar. 1997.
[27] T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk, “Mining Database Structure; or, How to Build a Data Quality Browser,” Proc. 2002 ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '02), pp. 240-251, 2002.
[28] V.I. Levenshtein, “Binary Codes Capable of Correcting Deletions, Insertions and Reversals,” Doklady Akademii Nauk SSSR, vol. 163, no. 4, pp. 845-848, 1965, original in Russian—translation in Soviet Physics Doklady, vol. 10, no. 8, pp. 707-710, 1966.
[29] G. Navarro, “A Guided Tour to Approximate String Matching,” ACM Computing Surveys, vol. 33, no. 1, pp. 31-88, 2001.
[30] G.M. Landau and U. Vishkin, “Fast Parallel and Serial Approximate String Matching,” J. Algorithms, vol. 10, no. 2, pp. 157-169, June 1989.
[31] S.B. Needleman and C.D. Wunsch, “A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins,” J. Molecular Biology, vol. 48, no. 3, pp. 443-453, Mar. 1970.
[32] E.S. Ristad and P.N. Yianilos, “Learning String Edit Distance,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 20, no. 5, pp. 522-532, May 1998.
[33] M.S. Waterman, T.F. Smith, and W.A. Beyer, “Some Biological Sequence Metrics,” Advances in Math., vol. 20, no. 4, pp. 367-387, 1976.
[34] T.F. Smith and M.S. Waterman, “Identification of Common Molecular Subsequences,” J. Molecular Biology, vol. 147, pp. 195-197, 1981.
[35] S.F. Altschula, W. Gisha, W. Millerb, E.W. Meyersc, and D.J. Lipmana, “Basic Local Alignment Search Tool,” J. Molecular Biology, vol. 215, no. 3, pp. 403-410, Oct. 1990.
[36] R. Baeza-Yates and G.H. Gonnet, “A New Approach to Text Searching,” Comm. ACM, vol. 35, no. 10, pp. 74-82, Oct. 1992.
[37] S. Wu and U. Manber, “Fast Text Searching Allowing Errors,” Comm. ACM, vol. 35, no. 10, pp. 83-91, Oct. 1992.
[38] J.C. Pinheiro and D.X. Sun, “Methods for Linking and Mining Heterogeneous Databases,” Proc. Int'l Conf. Knowledge Discovery and Data Mining (KDD '98), pp. 309-313, 1998.
[39] M.A. Jaro, “Unimatch: A Record Linkage System: User's Manual,” technical report, US Bureau of the Census, Washington, D.C., 1976.
[40] W.E. Winkler and Y. Thibaudeau, “An Application of the Fellegi-Sunter Model of Record Linkage to the 1990 US Decennial Census,” Technical Report Statistical Research Report Series RR91/09, US Bureau of the Census, Washington, D.C., 1991.
[41] J.R. Ullmann, “A Binary $n{\hbox{-}}{\rm{Gram}}$ Technique for Automatic Correction of Substitution, Deletion, Insertion, and Reversal Errors in Words,” The Computer J., vol. 20, no. 2, pp. 141-147, 1977.
[42] E. Ukkonen, “Approximate String Matching with $q{\hbox{-}}{\rm{Grams}}$ and Maximal Matches,” Theoretical Computer Science, vol. 92, no. 1, pp.191-211, 1992.
[43] K. Kukich, “Techniques for Automatically Correcting Words in Text,” ACM Computing Surveys, vol. 24, no. 4, pp. 377-439, Dec. 1992.
[44] E. Sutinen and J. Tarhio, “On Using $q{\hbox{-}}{\rm{Gram}}$ Locations in Approximate String Matching,” Proc. Third Ann. European Symp. Algorithms (ESA '95), pp. 327-340, 1995.
[45] L. Gravano, P.G. Ipeirotis, H.V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava, “Approximate String Joins in a Database (Almost) for Free,” Proc. 27th Int'l Conf. Very Large Databases (VLDB '01), pp. 491-500, 2001.
[46] L. Gravano, P.G. Ipeirotis, H.V. Jagadish, N. Koudas, S. Muthukrishnan, L. Pietarinen, and D. Srivastava, “Using $q{\hbox{-}}{\rm{Grams}}$ in a DBMS for Approximate String Processing,” IEEE Data Eng. Bull., vol. 24, no. 4, pp. 28-34, Dec. 2001.
[47] A.E. Monge and C.P. Elkan, “The Field Matching Problem: Algorithms and Applications,” Proc. Second Int'l Conf. Knowledge Discovery and Data Mining (KDD '96), pp. 267-270, 1996.
[48] W.W. Cohen, “Integration of Heterogeneous Databases without Common Domains Using Queries Based on Textual Similarity,” Proc. 1998 ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '98), pp. 201-212, 1998.
[49] L. Gravano, P.G. Ipeirotis, N. Koudas, and D. Srivastava, “Text Joins in an RDBMS for Web Data Integration,” Proc. 12th Int'l World Wide Web Conf. (WWW12), pp. 90-101, 2003.
[50] R.C. Russell Index, U.S. Patent 1,261,167, http://patft.uspto. gov/netahtmlsrchnum.htm , Apr. 1918.
[51] R.C. Russell Index, U.S. Patent 1,435,663, http://patft.uspto. gov/netahtmlsrchnum.htm , Nov. 1922.
[52] R.L. Taft, “Name Search Techniques,” Technical Report Special Report No. 1, New York State Identification and Intelligence System, Albany, N.Y., Feb. 1970.
[53] L.E. Gill, “OX-LINK: The Oxford Medical Record Linkage System,” Proc. Int'l Record Linkage Workshop and Exposition, pp.15-33, 1997.
[54] L. Philips, “Hanging on the Metaphone,” Computer Language Magazine, vol. 7, no. 12, pp. 39-44, Dec. 1990, http://www.cuj.com/documents/s=8038cuj0006philips /.
[55] L. Philips, “The Double Metaphone Search Algorithm,” C/C++ Users J., vol. 18, no. 5, June 2000.
[56] N. Koudas, A. Marathe, and D. Srivastava, “Flexible String Matching against Large Databases in Practice,” Proc. 30th Int'l Conf. Very Large Databases (VLDB '04), pp. 1078-1086, 2004.
[57] R. Agrawal and R. Srikant, “Searching with Numbers,” Proc. 11th Int'l World Wide Web Conf. (WWW11), pp. 420-431, 2002.
[58] W.E. Yancey, “Evaluating String Comparator Performance for Record Linkage,” Technical Report Statistical Research Report Series RRS2005/05, US Bureau of the Census, Washington, D.C., June 2005.
[59] S. Tejada, C.A. Knoblock, and S. Minton, “Learning Domain-Independent String Transformation Weights for High Accuracy Object Identification,” Proc. Eighth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '02), 2002.
[60] T. Hastie, R. Tibshirani, and J.H. Friedman, The Elements of Statistical Learning. Springer Verlag, Aug. 2001.
[61] M.A. Jaro, “Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida,” J. Am. Statistical Assoc., vol. 84, no. 406, pp. 414-420, June 1989.
[62] A.P. Dempster, N.M. Laird, and D.B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” J. Royal Statistical Soc., vol. B, no. 39, pp. 1-38, 1977.
[63] W.E. Winkler, “Improved Decision Rules in the Felligi-Sunter Model of Record Linkage,” Technical Report Statistical Research Report Series RR93/12, US Bureau of the Census, Washington, D.C., 1993.
[64] W.E. Winkler, “Methods for Record Linkage and Bayesian Networks,” Technical Report Statistical Research Report Series RRS2002/05, US Bureau of the Census, Washington, D.C., 2002.
[65] K. Nigam, A. McCallum, S. Thrun, and T.M. Mitchell, “Text Classification from Labeled and Unlabeled Documents Using EM,” Machine Learning, vol. 39, nos. 2/3, pp. 103-134, 2000.
[66] N.S.D. Du Bois Jr., “A Solution to the Problem of Linking Multivariate Documents,” J. Am. Statistical Assoc., vol. 64, no. 325, pp. 163-174, Mar. 1969.
[67] R.O. Duda and P.E. Hart, Pattern Classification and Scene Analysis. Wiley, 1973.
[68] V.S. Verykios, G.V. Moustakides, and M.G. Elfeky, “A Bayesian Decision Model for Cost Optimal Record Matching,” VLDB J., vol. 12, no. 1, pp. 28-40, May 2003.
[69] V.S. Verykios and G.V. Moustakides, “A Generalized Cost Optimal Decision Model for Record Matching,” Proc. 2004 Int'l Workshop Information Quality in Information Systems, pp. 20-26, 2004.
[70] M. Cochinwala, V. Kurien, G. Lalk, and D. Shasha, “Efficient Data Reconciliation,” Information Sciences, vol. 137, nos. 1-4, pp. 1-15, Sept. 2001.
[71] L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone, Classification and Regression Trees. CRC Press, July 1984.
[72] T. Joachims, “Making Large-Scale SVM Learning Practical,” Advances in Kernel Methods—Support Vector Learning, B. Schölkopf, C.J.C. Burges, and A.J. Smola, eds., MIT Press, 1999.
[73] A.E. Monge and C.P. Elkan, “An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records,” Proc. Second ACM SIGMOD Workshop Research Issues in Data Mining and Knowledge Discovery (DMKD '97), pp. 23-29, 1997.
[74] N. Bansal, A. Blum, and S. Chawla, “Correlation Clustering,” Machine Learning, vol. 56, nos. 1-3, pp. 89-113, 2004.
[75] W.W. Cohen and J. Richman, “Learning to Match and Cluster Large High-Dimensional Data Sets for Data Integration,” Proc. Eighth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '02), 2002.
[76] A. McCallum and B. Wellner, “Conditional Models of Identity Uncertainty with Application to Noun Coreference,” Advances in Neural Information Processing Systems (NIPS '04), 2004.
[77] P. Singla and P. Domingos, “Multi-Relational Record Linkage,” Proc. KDD-2004 Workshop Multi-Relational Data Mining, pp. 31-48, 2004.
[78] H. Pasula, B. Marthi, B. Milch, S.J. Russell, and I. Shpitser, “Identity Uncertainty and Citation Matching,” Advances in Neural Information Processing Systems (NIPS '02), pp. 1401-1408, 2002.
[79] D.A. Cohn, L. Atlas, and R.E. Ladner, “Improving Generalization with Active Learning,” Machine Learning, vol. 15, no. 2, pp. 201-221, 1994.
[80] S. Tejada, C.A. Knoblock, and S. Minton, “Learning Object Identification Rules for Information Integration,” Information Systems, vol. 26, no. 8, pp. 607-633, 2001.
[81] W.W. Cohen, “Data Integration Using Similarity Joins and a Word-Based Information Representation Language,” ACM Trans. Information Systems, vol. 18, no. 3, pp. 288-321, 2000.
[82] D. Dey, S. Sarkar, and P. De, “Entity Matching in Heterogeneous Databases: A Distance Based Decision Model,” Proc. 31st Ann. Hawaii Int'l Conf. System Sciences (HICSS '98), pp. 305-313, 1998.
[83] S. Guha, N. Koudas, A. Marathe, and D. Srivastava, “Merging the Results of Approximate Match Operations,” Proc. 30th Int'l Conf. Very Large Databases (VLDB '04), pp. 636-647, 2004.
[84] R.K. Ahuja, T.L. Magnanti, and J.B. Orlin, Network Flows: Theory, Algorithms, and Applications, first ed. Prentice Hall, Feb. 1993.
[85] R. Ananthakrishna, S. Chaudhuri, and V. Ganti, “Eliminating Fuzzy Duplicates in Data Warehouses,” Proc. 28th Int'l Conf. Very Large Databases (VLDB '02), 2002.
[86] S. Chaudhuri, V. Ganti, and R. Motwani, “Robust Identification of Fuzzy Duplicates,” Proc. 21st IEEE Int'l Conf. Data Eng. (ICDE '05), pp. 865-876, 2005.
[87] E.-P. Lim, J. Srivastava, S. Prabhakar, and J. Richardson, “Entity Identification in Database Integration,” Proc. Ninth IEEE Int'l Conf. Data Eng. (ICDE '93), pp. 294-301, 1993.
[88] H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita, “Declarative Data Cleaning: Language, Model, and Algorithms,” Proc. 27th Int'l Conf. Very Large Databases (VLDB '01), pp. 371-380, 2001.
[89] V.S. Verykios, A.K. Elmagarmid, and E.N. Houstis, “Automating the Approximate Record Matching Process,” Information Sciences, vol. 126, nos. 1-4, pp. 83-98, July 2000.
[90] A. Blum and T. Mitchell, “Combining Labeled and Unlabeled Data with Co-Training,” COLT '98: Proc. 11th Ann. Conf. Computational Learning Theory, pp. 92-100, 1998.
[91] P. Cheeseman and J. Sturz, “Bayesian Classification (Autoclass): Theory and Results,” Advances in Knowledge Discovery and Data Mining, pp. 153-180, AAAI Press/The MIT Press, 1996.
[92] M.G. Elfeky, A.K. Elmagarmid, and V.S. Verykios, “TAILOR: A Record Linkage Tool Box,” Proc. 18th IEEE Int'l Conf. Data Eng. (ICDE '02), pp. 17-28, 2002.
[93] P. Ravikumar and W.W. Cohen, “A Hierarchical Graphical Model for Record Linkage,” 20th Conf. Uncertainty in Artificial Intelligence (UAI '04), 2004.
[94] I. Bhattacharya and L. Getoor, “Latent Dirichlet Allocation Model for Entity Resolution,” Technical Report CS-TR-4740, Computer Science Dept., Univ. of Maryland, Aug. 2005.
[95] A. McCallum, K. Nigam, and L.H. Ungar, “Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching,” Proc. Sixth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '00), pp. 169-178, 2000.
[96] S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani, “Robust and Efficient Fuzzy Match for Online Data Cleaning,” Proc. 2003 ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '03), pp. 313-324, 2003.
[97] R. Baxter, P. Christen, and T. Churches, “A Comparison of Fast Blocking Methods for Record Linkage,” Proc. ACM SIGKDD '03 Workshop Data Cleaning, Record Linkage, and Object Consolidation, pp. 25-27, 2003.
[98] A. Soffer, D. Carmel, D. Cohen, R. Fagin, E. Farchi, M. Herscovici, and Y.S. Maarek, “Static Index Pruning for Information Retrieval Systems,” Proc. 24th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, (SIGIR '01), pp. 43-50, 2001.
[99] N. Mamoulis, “Efficient Processing of Joins on Set-Valued Attributes,” Proc. 2003 ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '03), pp. 157-168, 2003.
[100] J. Zobel, A. Moffat, and K. Ramamohanarao, “Inverted Files versus Signature Files for Text Indexing,” ACM Trans. Database Systems, vol. 23, no. 4, pp. 453-490, Dec. 1998.
[101] S. Sarawagi and A. Kirpal, “Efficient Set Joins on Similarity Predicates,” Proc. 2004 ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '04), pp. 743-754, 2004.
[102] D. Koller and M. Sahami, “Hierarchically Classifying Documents Using Very Few Words,” Proc. 14th Int'l Conf. Machine Learning (ICML '97), pp. 170-178, 1997.
[103] W.E. Yancey, “Bigmatch: A Program for Extracting Probable Matches from a Large File for Record Linkage,” Technical Report Statistical Research Report Series RRC2002/01, US Bureau of the Census, Washington, D.C., Mar. 2002.
[104] W.E. Winkler, “Overview of Record Linkage and Current Research Directions,” Technical Report Statistical Research Report Series RRS2006/02, US Bureau of the Census, Washington, D.C., 2006.
[105] IEEE Data Eng. Bull., N. Koudas, ed., special issue on data quality, vol. 29, no. 2, June 2006.
[106] W.E. Winkler, “The State of Record Linkage and Current Research Problems,” Technical Report Statistical Research Report Series RR99/04, US Bureau of the Census, Washington, D.C., 1999.

Index Terms:
Duplicate detection, data cleaning, data integration, record linkage, data deduplication, instance identification, database hardening, name matching, identity uncertainty, entity resolution, fuzzy duplicate detection, entity matching.
Citation:
Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, Vassilios S. Verykios, "Duplicate Record Detection: A Survey," IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 1, pp. 1-16, Jan. 2007, doi:10.1109/TKDE.2007.9
Usage of this product signifies your acceptance of the Terms of Use.