The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.05 - May (2013 vol.25)
pp: 1111-1124
Steven Euijong Whang , Google, Inc., Mountain View
David Marmaros , Google, Inc., Mountain View
Hector Garcia-Molina , Stanford University, Stanford
ABSTRACT
Entity resolution (ER) is the problem of identifying which records in a database refer to the same entity. In practice, many applications need to resolve large data sets efficiently, but do not require the ER result to be exact. For example, people data from the web may simply be too large to completely resolve with a reasonable amount of work. As another example, real-time applications may not be able to tolerate any ER processing that takes longer than a certain amount of time. This paper investigates how we can maximize the progress of ER with a limited amount of work using “hints,” which give information on records that are likely to refer to the same real-world entity. A hint can be represented in various formats (e.g., a grouping of records based on their likelihood of matching), and ER can use this information as a guideline for which records to compare first. We introduce a family of techniques for constructing hints efficiently and techniques for using the hints to maximize the number of matching records identified using a limited amount of work. Using real data sets, we illustrate the potential gains of our pay-as-you-go approach compared to running ER without using hints.
INDEX TERMS
Erbium, Approximation algorithms, Partitioning algorithms, Clustering algorithms, Tin, Data structures, Companies, data cleaning, Entity resolution, pay-as-you-go
CITATION
Steven Euijong Whang, David Marmaros, Hector Garcia-Molina, "Pay-As-You-Go Entity Resolution", IEEE Transactions on Knowledge & Data Engineering, vol.25, no. 5, pp. 1111-1124, May 2013, doi:10.1109/TKDE.2012.43
REFERENCES
[1] A.K. Elmagarmid, P.G. Ipeirotis, and V.S. Verykios, "Duplicate Record Detection: A Survey," IEEE Trans. Knowledge Data Eng., vol. 19, no. 1, pp. 1-16, Jan. 2007.
[2] A.K. Jain, M.N. Murty, and P.J. Flynn, "Data Clustering: A Review," ACM Computing Surveys, vol. 31, no. 3, pp. 264-323, 1999.
[3] H.B. Newcombe and J.M. Kennedy, "Record Linkage: Making Maximum Use of the Discriminating Power of Identifying Information," Comm. ACM, vol. 5, no. 11, pp. 563-566, 1962.
[4] M.A. Hernández and S.J. Stolfo, "The Merge/Purge Problem for Large Databases," Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 127-138, 1995.
[5] A.K. McCallum, K. Nigam, and L. Ungar, "Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching," Proc. ACM Sixth SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 169-178, 2000.
[6] A. Gionis, P. Indyk, and R. Motwani, "Similarity Search in High Dimensions via Hashing," Proc. 25th Int'l Conf. Very Large Databases (VLDB), pp. 518-529, 1999.
[7] S.E. Whang, D. Marmaros, and H. Garcia-Molina, "Pay-As-You-Go Entity Resolution," technical report, Stanford Univ., available at http:/ilpubs.stanford.edu:8090/979/, 2012.
[8] C.D. Manning, P. Raghavan, and H. Schütze, Introduction to Information Retrieval. Cambridge Univ. Press, 2008.
[9] O. Benjelloun, H. Garcia-Molina, D. Menestrina, Q. Su, S.E. Whang, and J. Widom, "Swoosh: A Generic Approach to Entity Resolution," VLDB J., vol. 18, no. 1, pp. 255-276, 2009.
[10] W. Winkler, "Overview of Record Linkage and Current Research Directions," technical report, US Bureau of the Census, Washington, DC, 2006.
[11] P. Indyk, "A Small Approximately Min-Wise Independent Family of Hash Functions," J. Algorithms, vol. 38, no. 1, pp. 84-90, 2001.
[12] A. Arasu, V. Ganti, and R. Kaushik, "Efficient Exact Set-Similarity Joins," Proc. 32nd Int'l Conf. Very Large Data Bases (VLDB), pp. 918-929, 2006.
[13] W.W. Cohen, "Data Integration Using Similarity Joins and a Word-Based Information Representation Language," ACM Trans. Information Systems, vol. 18, no. 3, pp. 288-321, 2000.
[14] X. Dong, A.Y. Halevy, and J. Madhavan, "Reference Reconciliation in Complex Information Spaces," Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 85-96, 2005.
[15] M. Weis and F. Naumann, "Detecting Duplicates in Complex XML Data," Proc. 22nd Int'l Conf. Data Eng. (ICDE), p. 109, 2006.
[16] O. Hassanzadeh, F. Chiang, R.J. Miller, and H.C. Lee, "Framework for Evaluating Clustering Algorithms in Duplicate Detection," Proc. VLDB Endowment, vol. 2, no. 1, pp. 1282-1293, 2009.
[17] J. Madhavan, S. Cohen, X.L. Dong, A.Y. Halevy, S.R. Jeffery, D. Ko, and C. Yu, "Web-Scale Data Integration: You Can Afford to Pay As You Go," Proc. Conf. Innovative Data Systems Research (CIDR), pp. 342-350, 2007.
[18] S.R. Jeffery, M.J. Franklin, and A.Y. Halevy, "Pay-As-You-Go User Feedback for Dataspace Systems," Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 847-860, 2008.
36 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool