The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.11 - Nov. (2012 vol.24)
pp: 2094-2108
Melanie Herschel , University of Tübingen, Tübungen
Felix Naumann , University of Potsdam, Potsdam
Sascha Szott , Konrad-Zuse-Zentrum für Informationstechnik, Berlin
Maik Taubert , Biotronik SE & Co. KG, Berlin
ABSTRACT
Duplicate detection determines different representations of real-world objects in a database. Recent research has considered the use of relationships among object representations to improve duplicate detection. In the general case where relationships form a graph, research has mainly focused on duplicate detection quality/effectiveness. Scalability has been neglected so far, even though it is crucial for large real-world duplicate detection tasks. We scale-up duplicate detection in graph data (ddg) to large amounts of data and pairwise comparisons, using the support of a relational database management system. To this end, we first present a framework that generalizes the ddg process. We then present algorithms to scale ddg in space (amount of data processed with bounded main memory) and in time. Finally, we extend our framework to allow batched and parallel ddg, thus further improving efficiency. Experiments on data of up to two orders of magnitude larger than data considered so far in ddg show that our methods achieve the goal of scaling ddg to large volumes of data.
INDEX TERMS
Motion pictures, Sorting, Databases, Scalability, Classification algorithms, Image edge detection, Runtime, parallelization, Duplicate detection, data cleaning, data integration, record linkage, entity resolution, scalability
CITATION
Melanie Herschel, Felix Naumann, Sascha Szott, Maik Taubert, "Scalable Iterative Graph Duplicate Detection", IEEE Transactions on Knowledge & Data Engineering, vol.24, no. 11, pp. 2094-2108, Nov. 2012, doi:10.1109/TKDE.2011.99
REFERENCES
[1] A.K. Elmagarmid, P.G. Ipeirotis, and V.S. Verykios, "Duplicate Record Detection: A Survey," IEEE Trans. Knowledge Data Eng., vol. 19, no. 1, pp. 1-16, Jan. 2007.
[2] E. Rahm and H.H. Do, "Data Cleaning: Problems and Current Approaches," IEEE Data Eng. Bull., vol. 23, no. 4, pp. 3-13, Dec. 2000.
[3] A. Doan, Y. Lu, Y. Lee, and J. Han, "Object Matching for Information Integration: A Profiler-Based Approach," IEEE Intelligent Systems, vol. 18, no. 5, pp. 54-59, Sept. 2003.
[4] X. Dong, A. Halevy, and J. Madhavan, "Reference Reconciliation in Complex Information Spaces," Proc. ACM SIGMOD Int'l Conf. Management of Data, 2005.
[5] M. Weis and F. Naumann, "Detecting Duplicates in Complex XML Data," Proc. 22nd Int'l Conf. Data Eng. (ICDE), 2006.
[6] I. Bhattacharya and L. Getoor, "Collective Entity Resolution in Relational Data," ACM Trans. Knowledge Discovery from Data, vol. 1, no. 1, pp. 1-36, Mar. 2007.
[7] M. Herschel and F. Naumann, "Scaling Up Duplicate Detection in Graph Data," Proc. 17th ACM Conf. Information and Knowledge Management (CIKM) Conf., 2008.
[8] P. Singla and P. Domingos, "Object Identification with Attribute-Mediated Dependences," Proc. European Conf. Principles and Practice of Knowledge Discovery in Databases (PKDD), 2005.
[9] W. Shen, P. DeRose, L. Vu, A. Doan, and R. Ramakrishnan, "Source-Aware Entity Matching: A Compositional Approach," Proc. IEEE 23rd Int'l Conf. Data Eng. (ICDE), 2007.
[10] D.V. Kalashnikov and S. Mehrotra, "Domain-Independent Data Cleaning via Analysis of Entity-relationship Graph," ACM Trans. Database Systems, vol. 31, no. 2, pp. 716-767, 2006.
[11] Z. Chen, D.V. Kalashnikov, and S. Mehrotra, "Exploiting Relationships for Object Consolidation," Proc. Second Int'l Workshop Information Quality in Information Systems (IQIS), 2005.
[12] X. Yin, J. Han, and P.S. Yu, "LinkClus: Efficient Clustering via Heterogeneous Semantic Links," Proc. 32nd Int'l Conf. Very Large Data Bases (VLDB), 2006.
[13] A. Arasu, C. Ré, and D. Suciu, "Large-Scale Deduplication with Constraints Using Dedupalog," Proc. IEEE 25th Int'l Conf. Data Eng. (ICDE) Conf., 2009.
[14] M. Tachibana and H. Garcia-Molina, "Joint Entity Resolution," technical report, ID 900, Stanford InfoLab, 2009.
[15] R. Ananthakrishna, S. Chaudhuri, and V. Ganti, "Eliminating Fuzzy Duplicates n Data Warehouses," Proc. 28th Int'l Conf. Very Large Data Bases (VLDB), 2002.
[16] M.A. Hernández and S.J. Stolfo, "The Merge/purge Problem for Large Databases," Proc. ACM SIGMOD Int'l Conf. Management of Data, 1995.
[17] S. Puhlmann, M. Weis, and F. Naumann, "XML Duplicate Detection Using Sorted Neigborhoods," Proc. 10th Int'l Conf. Advances in Database Technology (EDBT), 2006.
[18] M.J. Quinn and N. Deo, "Parallel Graph Algorithms," ACM Computing Survey, vol. 16, no. 3, pp. 319-348, 1984.
[19] H. sik Kim and D. Lee, "Parallel Linkage," Proc. ACM Int'l Conf. Information and Knowledge Management (CIKM), 2007.
[20] J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," Proc. Sixth Conf. Symp. Operating Systems Design & Implementation (OSDI), 2004.
[21] M. Weis and F. Naumann, "DogmatiX Tracks Down Duplicates in XML," Proc. ACM SIGMOD Int'l Conf. Management of Data, 2005.
[22] I. Bhattacharya and L. Getoor, "Iterative Record Linkage for Cleaning and Integration," Proc. Ninth ACM SIGMOD Workshop Research Issues in Data Mining and Knowledge Discovery (DMKD), 2004.
[23] A.E. Monge and C.P. Elkan, "An Efficient Domain-independent Algorithm for Detecting Approximately Duplicate Database Records," Proc. SIGMOD Workshop Data Mining and Knowledge Discovery (DMKD), 1997.
[24] M. Weis and F. Naumann, "Relationship-Based Duplicate Detection," Technical Report HU-IB-206, Humboldt Univ. Berlin, 2006.
[25] M. Weis, F. Naumann, U. Jehle, J. Lufter, and H. Schuster, "Industry-Scale Duplicate Detection," Proc. Int'l Conf. Very Large Data Bases (VLDB), 2008.
[26] M. Weis and F. Naumann, "Space and Time Scalability of Duplicate Detection in Graph Data," technical report, Nr. 25, Hasso-Plattner-Insitut Potsdam, 2008.
[27] D. Milano, M. Scannapieco, and T. Catarci, "Structure Aware XML Object Identification," Proc. First Int'l Very Large Data Bases (VLDB) Workshop Clean Databases (CleanDB), 2006.
[28] W.W. Cohen, P. Ravikumar, and S.E. Fienberg, "A Comparison of String Distance Metrics for Name-Matching Tasks," Proc. IJCAI Workshop Information Integration on the Web (IIWeb), pp. 73-78, 2003.
[29] B.-W. On, N. Koudas, D. Lee, and D. Srivastava, "Group Linkage," Proc. IEEE 23rd Int'l Conf. Data Eng. (ICDE), 2007.
[30] N. Reddy and J.R. Haritsa, "Analyzing Plan Diagrams of Database Query Optimizers," Proc.31st Int'l Conf. Very Large Data Bases (VLDB), 2005.
15 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool