The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.09 - Sept. (2012 vol.24)
pp: 1537-1555
Peter Christen , The Australian National University, Canberra
ABSTRACT
Record linkage is the process of matching records from several databases that refer to the same entities. When applied on a single database, this process is known as deduplication. Increasingly, matched data are becoming important in many application areas, because they can contain information that is not available otherwise, or that is too costly to acquire. Removing duplicate records in a single database is a crucial step in the data cleaning process, because duplicates can severely influence the outcomes of any subsequent data processing or data mining. With the increasing size of today's databases, the complexity of the matching process becomes one of the major challenges for record linkage and deduplication. In recent years, various indexing techniques have been developed for record linkage and deduplication. They are aimed at reducing the number of record pairs to be compared in the matching process by removing obvious nonmatching pairs, while at the same time maintaining high matching quality. This paper presents a survey of 12 variations of 6 indexing techniques. Their complexity is analyzed, and their performance and scalability is evaluated within an experimental framework using both synthetic and real data sets. No such detailed survey has so far been published.
INDEX TERMS
Couplings, Indexing, Cleaning, Encoding, Complexity theory, scalability, Data linkage, data matching, entity resolution, index techniques, blocking, experimental evaluation
CITATION
Peter Christen, "A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication", IEEE Transactions on Knowledge & Data Engineering, vol.24, no. 9, pp. 1537-1555, Sept. 2012, doi:10.1109/TKDE.2011.127
REFERENCES
[1] W.E. Winkler, "Methods for Evaluating and Creating Data Quality," Elsevier Information Systems, vol. 29, no. 7, pp. 531-550, 2004.
[2] D.E. Clark, "Practical Introduction to Record Linkage for Injury Research," Injury Prevention, vol. 10, pp. 186-191, 2004.
[3] C.W. Kelman, J. Bass, and D. Holman, "Research Use of Linked Health Data—A Best Practice Protocol," Australian NZ J. Public Health, vol. 26, pp. 251-255, 2002.
[4] W.E. Winkler, "Overview of Record Linkage and Current Research Directions," Technical Report RR2006/02, US Bureau of the Census, 2006.
[5] J. Jonas and J. Harper, "Effective Counterterrorism and the Limited Role of Predictive Data Mining," Policy Analysis, no. 584, pp. 1-11, 2006.
[6] H. Hajishirzi, W. Yih, and A. Kolcz, "Adaptive Near-Duplicate Detection via Similarity Learning," Proc. 33rd Int'l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR '10), pp. 419-426, 2010.
[7] W. Su, J. Wang, and F.H. Lochovsky, "Record Matching over Query Results from Multiple Web Databases," IEEE Trans. Knowledge and Data Eng., vol. 22, no. 4, pp. 578-589, Apr. 2010.
[8] M. Bilenko, S. Basu, and M. Sahami, "Adaptive Product Normalization: Using Online Learning for Record Linkage in Comparison Shopping," Proc. IEEE Int'l Conf. Data Mining (ICDM '05), pp. 58-65, 2005.
[9] P. Christen and K. Goiser, "Quality and Complexity Measures for Data Linkage and Deduplication," Quality Measures in Data Mining, ser. Studies in Computational Intelligence, F. Guillet and H. Hamilton, eds., vol. 43, Springer, pp. 127-151, 2007.
[10] M.G. Elfeky, V.S. Verykios, and A.K. Elmagarmid, "TAILOR: A Record Linkage Toolbox," Proc. 18th Int'l Conf. Data Eng. (ICDE '02), 2002.
[11] I.P. Fellegi and A.B. Sunter, "A Theory for Record Linkage," J. Am. Statistical Soc., vol. 64, no. 328, pp. 1183-1210, 1969.
[12] W.W. Cohen, P. Ravikumar, and S. Fienberg, "A Comparison of String Distance Metrics for Name-Matching Tasks," Proc. Workshop Information Integration on the Web (IJCAI '03), 2003.
[13] W.W. Cohen, "Integration of Heterogeneous Databases Without Common Domains Using Queries Based on Textual Similarity," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '98), pp. 201-212, 1998.
[14] H. Galhardas, D. Florescu, D. Shasha, and E. Simon, "An Extensible Framework for Data Cleaning," Proc. 16th Int'l Conf. Data Eng. (ICDE '00), 2000.
[15] E. Rahm and H.H. Do, "Data Cleaning: Problems and Current Approaches," IEEE Technical Committee Data Eng. Bull., vol. 23, no. 4, pp. 3-13, Dec. 2000.
[16] J.I. Maletic and A. Marcus, "Data Cleansing: Beyond Integrity Analysis," Proc. Fifth Conf. Information Quality (IQ '00), pp. 200-209, 2000.
[17] M. Bilenko and R.J. Mooney, "On Evaluation and Training-Set Construction for Duplicate Detection," Proc. Workshop Data Cleaning, Record Linkage and Object Consolidation (SIGKDD '03), pp. 7-12, 2003.
[18] A.K. Elmagarmid, P.G. Ipeirotis, and V.S. Verykios, "Duplicate Record Detection: A Survey," IEEE Trans. Knowledge and Data Eng., vol. 19, no. 1, pp. 1-16, Jan. 2007.
[19] A. Aizawa and K. Oyama, "A Fast Linkage Detection Scheme for Multi-Source Information Integration," Proc. Int'l Workshop Challenges in Web Information Retrieval and Integration (WIRI '05), 2005.
[20] I. Bhattacharya and L. Getoor, "Collective Entity Resolution in Relational Data," ACM Trans. Knowledge Discovery from Data, vol. 1, no. 1, pp. 5-es, 2007.
[21] P. Christen, R. Gayler, and D. Hawking, "Similarity-Aware Indexing for Real-Time Entity Resolution," Proc. 18th ACM Conf. Information and Knowledge Management (CIKM '09), pp. 1565-1568, 2009.
[22] S.E. Whang, D. Menestrina, G. Koutrika, M. Theobald, and H. Garcia-Molina, "Entity Resolution with Iterative Blocking," Proc. 35th ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '09), pp. 219-232, 2009.
[23] X. Dong, A. Halevy, and J. Madhavan, "Reference Reconciliation in Complex Information Spaces," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '05), pp. 85-96, 2005.
[24] M.A. Hernandez and S.J. Stolfo, "The Merge/Purge Problem for Large Databases," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '95), 1995.
[25] T. Churches, P. Christen, K. Lim, and J.X. Zhu, "Preparation of Name and Address Data for Record Linkage Using Hidden Markov Models," BioMed Central Medical Informatics and Decision Making, vol. 2, no. 9, 2002.
[26] P. Christen, "Febrl: An Open Source Data Cleaning, Deduplication and Record Linkage System With a Graphical User Interface," Proc. 14th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '08), pp. 1065-1068, 2008.
[27] L. Gu and R. Baxter, "Decision Models for Record Linkage," Selected Papers from AusDM, LNCS 3755, Springer, 2006.
[28] P. Christen, "Automatic Record Linkage Using Seeded Nearest Neighbour and Support Vector Machine Classification," Proc. 14th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '08), pp. 151-159, 2008.
[29] W.W. Cohen and J. Richman, "Learning to Match and Cluster Large High-Dimensional Data Sets for Data Integration," Proc. Eighth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '02), pp. 475-480, 2002.
[30] S. Sarawagi and A. Bhamidipaty, "Interactive Deduplication Using Active Learning," Proc. Eighth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '02), 2002.
[31] S. Tejada, C.A. Knoblock, and S. Minton, "Learning Domain-Independent String Transformation Weights for High Accuracy Object Identification," Proc. Eighth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '02), 2002.
[32] R. Baxter, P. Christen, and T. Churches, "A Comparison of Fast Blocking Methods for Record Linkage," Proc. ACM Workshop Data Cleaning, Record Linkage and Object Consolidation (SIGKDD '03), pp. 25-27, 2003.
[33] J. Nin, V. Muntes-Mulero, N. Martinez-Bazan, and J.-L. Larriba-Pey, "On the Use of Semantic Blocking Techniques for Data Cleansing and Integration," Proc. 11th Int'l Database Eng. and Applications Symp. (IDEAS '07), 2007.
[34] M.A. Hernandez and S.J. Stolfo, "Real-World Data is Dirty: Data Cleansing and the Merge/Purge Problem," Data Mining and Knowledge Discovery, vol. 2, no. 1, pp. 9-37, 1998.
[35] P. Christen, "A Comparison of Personal Name Matching: Techniques and Practical Issues," Proc. IEEE Sixth Data Mining Workshop (ICDM '06), 2006.
[36] K. Goiser and P. Christen, "Towards Automated Record Linkage," Proc. Fifth Australasian Conf. Data Mining and Analystics (AusDM '06), vol. 61, pp. 23-31, 2006.
[37] I.H. Witten, A. Moffat, and T.C. Bell, Managing Gigabytes, second ed. Morgan Kaufmann, 1999.
[38] M. Harada, S. Sato, and K. Kazama, "Finding Authoritative People from the Web," Proc. ACM/IEEE-CS Joint Conf. Digital Libraries, pp. 306-313, 2004.
[39] P. Christen, "Towards Parameter-Free Blocking for Scalable Record Linkage," Technical Report TR-CS-07-03, Dept. of Computer Science, The Australian Nat'l Univ., 2007.
[40] S. Yan, D. Lee, M.Y. Kan, and L.C. Giles, "Adaptive Sorted Neighborhood Methods for Efficient Record Linkage," Proc. Seventh ACM/IEEE-CS Joint Conf. Digital Libraries (JCDL '07), 2007.
[41] U. Draisbach and F. Naumann, "A Comparison and Generalization of Blocking and Windowing Algorithms for Duplicate Detection," Proc. Workshop Quality in Databases (VLDB '09), 2009.
[42] L. Gravano, P.G. Ipeirotis, H.V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava, "Approximate String Joins in a Database (Almost) for Free," Proc. 27th Int'l Conf. Very Large Data Bases (VLDB '01), pp. 491-500, 2001.
[43] T. de Vries, H. Ke, S. Chawla, and P. Christen, "Robust Record Linkage Blocking Using Suffix Arrays," Proc. ACM Conf. Information and Knowledge Management (CIKM '09), pp. 305-314. 2009.
[44] A. McCallum, K. Nigam, and L.H. Ungar, "Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching," Proc. Sixth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '00), pp. 169-178, 2000.
[45] L. Jin, C. Li, and S. Mehrotra, "Efficient Record Linkage in Large Data Sets," Proc. Eighth Int'l Conf. Database Systems for Advanced Applications (DASFAA '03), pp. 137-146, 2003.
[46] C. Faloutsos and K.-I. Lin, "Fastmap: A Fast Algorithm for Indexing, Data-Mining and Visualization of Traditional and Multimedia Datasets," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '95), pp. 163-174, 1995.
[47] C.C. Aggarwal and P.S. Yu, "The IGrid Index: Reversing the Dimensionality Curse for Similarity Indexing in High Dimensional Space," Proc. Sixth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '00), pp. 119-129, 2000.
[48] N. Adly, "Efficient Record Linkage Using a Double Embedding Scheme," Proc. Int'l Conf. Data Mining (DMIN '09), pp. 274-281, 2009.
[49] P. Christen and A. Pudjijono, "Accurate Synthetic Generation of Realistic Personal Information," Proc. 13th Pacific-Asia Conf. Advances in Knowledge Discovery and Data Mining (PAKDD '09), vol. 5476, pp. 507-514, 2009.
[50] T. de Vries, H. Ke, S. Chawla, and P. Christen, "Robust Record Linkage Blocking Using Suffix Arrays and Bloom Filters," ACM Trans. Knowledge Discovery from Data, vol. 5, no. 2, pp. 1-27, 2011.
[51] M. Weis, F. Naumann, U. Jehle, J. Lufter, and H. Schuster, "Industry-Scale Duplicate Detection," Proc. VLDB Endowment, vol. 1, no. 2, pp. 1253-1264, 2008.
[52] M. Bilenko, B. Kamath, and R.J. Mooney, "Adaptive Blocking: Learning to Scale up Record Linkage," Proc. Sixth Int'l Conf. Data Mining (ICDM '06), pp. 87-96, 2006.
[53] M. Michelson and C.A. Knoblock, "Learning Blocking Schemes for Record Linkage," Proc. 21st Nat'l Conf. Artificial Intelligence (AAAI '06), 2006.
[54] D. Dey, V. Mookerjee, and D. Liu, "Efficient Techniques for Online Record Linkage," IEEE Trans. Knowledge and Data Eng., vol. 23, no. 3, pp. 373-387, Mar. 2011.
[55] G.V. Moustakides and V.S. Verykios, "Optimal Stopping: A Record-Linkage Approach," J. Data and Information Quality, vol. 1, pp. 9:1-9:34, 2009.
[56] A. Behm, S. Ji, C. Li, and J. Lu, "Space-Constrained Gram-Based Indexing for Efficient Approximate String Search," Proc. IEEE Int'l Conf. Data Eng. (ICDE '09), pp. 604-615, 2009.
[57] N. Koudas, A. Marathe, and D. Srivastava, "Flexible String Matching against Large Databases in Practice," Proc. 13th Int'l Conf. Very Large Data Bases (VLDB '04), pp. 1086-1094, 2004.
[58] S. Sarawagi and A. Kirpal, "Efficient Set Joins on Similarity Predicates," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '04), pp. 754-765, 2004.
[59] C. Xiao, W. Wang, and X. Lin, "Ed-Join: An Efficient Algorithm for Similarity Joins with Edit Distance Constraints," Proc. VLDB Endowment, vol. 1, no. 1, pp. 933-944, 2008.
[60] Y. Zhang, X. Lin, W. Zhang, J. Wang, and Q. Lin, "Effectively Indexing the Uncertain Space," IEEE Trans. Knowledge and Data Eng., vol. 22, no. 9, pp. 1247-1261, Sept. 2010.
[61] T. Bernecker, H.-P. Kriegel, N. Mamoulis, M. Renz, and A. Zuefle, "Scalable Probabilistic Similarity Ranking in Uncertain Databases," IEEE Trans. Knowledge and Data Eng., vol. 22, no. 9, pp. 1234-1246, Sept. 2010.
16 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool