The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.03 - March (2012 vol.24)
pp: 399-412
Moisés G. de Carvalho , Nokia INdT, Brazil
Alberto H.F. Laender , Federal University of Minas Gerais, Belo Horizonte
Marcos André Gonçalves , Federal University of Minas Gerais, Belo Horizonte
Altigran S. da Silva , Federal University of Amazonas, Manaus
ABSTRACT
Several systems that rely on consistent data to offer high-quality services, such as digital libraries and e-commerce brokers, may be affected by the existence of duplicates, quasi replicas, or near-duplicate entries in their repositories. Because of that, there have been significant investments from private and government organizations for developing methods for removing replicas from its data repositories. This is due to the fact that clean and replica-free repositories not only allow the retrieval of higher quality information but also lead to more concise data and to potential savings in computational time and resources to process this data. In this paper, we propose a genetic programming approach to record deduplication that combines several different pieces of evidence extracted from the data content to find a deduplication function that is able to identify whether two entries in a repository are replicas or not. As shown by our experiments, our approach outperforms an existing state-of-the-art method found in the literature. Moreover, the suggested functions are computationally less demanding since they use fewer evidence. In addition, our genetic programming approach is capable of automatically adapting these functions to a given fixed replica identification boundary, freeing the user from the burden of having to choose and tune this parameter.
INDEX TERMS
Database administration, evolutionary computing and genetic algorithms, database integration.
CITATION
Moisés G. de Carvalho, Alberto H.F. Laender, Marcos André Gonçalves, Altigran S. da Silva, "A Genetic Programming Approach to Record Deduplication", IEEE Transactions on Knowledge & Data Engineering, vol.24, no. 3, pp. 399-412, March 2012, doi:10.1109/TKDE.2010.234
REFERENCES
[1] M. Wheatley, "Operation Clean Data," CIO Asia Magazine, http:/www.cio-asia.com, Aug. 2004.
[2] N. Koudas, S. Sarawagi, and D. Srivastava, "Record Linkage: Similarity Measures and Algorithms," Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 802-803, 2006.
[3] S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani, "Robust and Efficient Fuzzy Match for Online Data Cleaning," Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 313-324, 2003.
[4] I. Bhattacharya and L. Getoor, "Iterative Record Linkage for Cleaning and Integration," Proc. Ninth ACM SIGMOD Workshop Research Issues in Data Mining and Knowledge Discovery, pp. 11-18, 2004.
[5] I.P. Fellegi and A.B. Sunter, "A Theory for Record Linkage," J. Am. Statistical Assoc., vol. 66, no. 1, pp. 1183-1210, 1969.
[6] V.S. Verykios, G.V. Moustakides, and M.G. Elfeky, "A Bayesian Decision Model for Cost Optimal Record Matching," The Very Large Databases J., vol. 12, no. 1, pp. 28-40, 2003.
[7] R. Bell and F. Dravis, "Is You Data Dirty? and Does that Matter?," Accenture Whiter Paper, http:/www.accenture.com, 2006.
[8] J.R. Koza, Gentic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, 1992.
[9] W. Banzhaf, P. Nordin, R.E. Keller, and F.D. Francone, Genetic Programming - An Introduction: On the Automatic Evolution of Computer Programs and Its Applications. Morgan Kaufmann Publishers, 1998.
[10] H.M. de Almeida, M.A. Gonçalves, M. Cristo, and P. Calado, "A Combined Component Approach for Finding Collection-Adapted Ranking Functions Based on Genetic Programming," Proc. 30th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 399-406, 2007.
[11] T.P.C. Silva, E.S. de Moura, J.M.B. Cavalcanti, A.S. da Silva, M.G. de Carvalho, and M.A. Gonçalves, "An Evolutionary Approach for Combining Different Sources of Evidence in Search Engines," Information Systems, vol. 34, no. 2, pp. 276-289, 2009.
[12] B. Zhang, Y. Chen, W. Fan, E.A. Fox, M. Gonçalves, M. Cristo, and P. Calado, "Intelligent gp Fusion from Multiple Sources for Text Classification," Proc. 14th ACM Int'l Conf. Information and Knowledge Management, pp. 477-484, 2005.
[13] R.d.S. Torres, A.X. Falcao, M.A. Gonçalves, J.P. Papa, B. Zhang, W. Fan, and E.A. Fox, "A Genetic Programming Framework for Content-Based Image Retrieval," Pattern Recognition, vol. 42, no. 2, pp. 283-292, 2009.
[14] A. Lacerda, M. Cristo, M.A. Gonçalves, W. Fan, N. Ziviani, and B. Ribeiro-Neto, "Learning to Advertise," Proc. 29th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 549-556, 2006.
[15] M.G. de Carvalho, M.A. Gonçalves, A.H.F. Laender, and A.S. da Silva, "Learning to Deduplicate," Proc. Sixth ACM/IEEE CS Joint Conf. Digital Libraries, pp. 41-50, 2006.
[16] M.G. de Carvalho, A.H.F. Laender, M.A. Gonçalves, and A.S. da Silva, "Replica Identification Using Genetic Programming," Proc. 23rd Ann. ACM Symp. Applied Computing (SAC), pp. 1801-1806, 2008.
[17] M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg, "Adaptive Name Matching in Information Integration," IEEE Intelligent Systems, vol. 18, no. 5, pp. 16-23, Sept./Oct. 2003.
[18] M. Bilenko and R.J. Mooney, "Adaptive Duplicate Detection Using Learnable String Similarity Measures," Proc. Ninth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 39-48, 2003.
[19] S. Lawrence, C.L. Giles, and K.D. Bollacker, "Autonomous Citation Matching," Proc. Third Int'l Conf. Autonomous Agents, pp. 392-393, 1999.
[20] S. Lawrence, L. Giles, and K. Bollacker, "Digital Libraries and Autonomous Citation Indexing," Computer, vol. 32, no. 6, pp. 67-71, June 1999.
[21] A.K. Elmagarmid, P.G. Ipeirotis, and V.S. Verykios, "Duplicate Record Detection: A Survey," IEEE Trans. Knowledge and Data Eng., vol. 19, no. 1, pp. 1-16, Jan. 2007.
[22] R.A. Baeza-Yates and B.A. Ribeiro-Neto, Modern Information Retrieval. ACM Press/Addison-Wesley, 1999.
[23] W.W. Cohen, "Data Integration Using Similarity Joins and a Word-Based Information Representation Language," ACM Trans. Information Systems, vol. 18, no. 3, pp. 288-321, 2000.
[24] J.C.P. Carvalho and A.S. da Silva, "Finding Similar Identities among Objects from Multiple Web Sources," Proc. Fifth ACM Int'l Workshop Web Information and Data Management, pp. 90-93, 2003.
[25] H.B. Newcombe, J.M. Kennedy, S. Axford, and A. James, "Automatic Linkage of Vital Records," Science, vol. 130, no. 3381, pp. 954-959, Oct. 1959.
[26] "Freely Extensible Biomedical Record Linkage," http:// sourceforge.net/projectsfebrl , 2011.
[27] W.W. Cohen and J. Richman, "Learning to Match and Cluster Large High-Dimensional Data Sets for Data Integration," Proc. Ninth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 475-480, 2002.
[28] S. Tejada, C.A. Knoblock, and S. Minton, "Learning Object Identification Rules for Information Integration," Information Systems, vol. 26, no. 8, pp. 607-633, 2001.
[29] S. Guha, N. Koudas, A. Marathe, and D. Srivastava, "Merging the Results of Approximate Match Operations," Proc. 30th Int'l Conf. Very Large Data Bases, pp. 636-647, 2004.
[30] P.J. Angeline, "Genetic Programming's Continued Evolution," Advances in Genetic Programming, vol. 2, ch. 1, MIT Press, 1996.
[31] S. Tejada, C.A. Knoblock, and S. Minton, "Learning Domain-Independent String Transformation Weights for High Accuracy Object Identification," Proc. Eighth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 350-359, 2002.
[32] P. Christen, "Probabilistic Data Generation for Deduplication and Data Linkage," Intelligent Data Eng. and Automated Learning, pp. 109-116, Springer, 2005.
[33] S. Sarawagi and A. Bhamidipaty, "Interactive Deduplication Using Active Learning," Proc. Eighth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 269-278, 2002.
19 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool