The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.10 - Oct. (2013 vol.25)
pp: 2217-2230
Chuitian Rong , Renmin University of China, Bejing
Wei Lu , National University of Singapore, Singapore
Xiaoli Wang , National University of Singapore, Singapore
Xiaoyong Du , Renmin University of China, Beijing
Yueguo Chen , Renmin University of China, Beijing
Anthony K.H. Tung , National University of Singapore, Singapore
ABSTRACT
The string similarity join is a basic operation of many applications that need to find all string pairs from a collection given a similarity function and a user-specified threshold. Recently, there has been considerable interest in designing new algorithms with the assistant of an inverted index to support efficient string similarity joins. These algorithms typically adopt a two-step filter-and-refine approach in identifying similar string pairs: 1) generating candidate pairs by traversing the inverted index; and 2) verifying the candidate pairs by computing the similarity. However, these algorithms either suffer from poor filtering power (which results in high verification cost), or incur too much computational cost to guarantee the filtering power. In this paper, we propose a multiple prefix filtering method based on different global orderings such that the number of candidate pairs can be reduced significantly. We also propose a parallel extension of the algorithm that is efficient and scalable in a MapReduce framework. We conduct extensive experiments on both centralized and Hadoop systems using both real and synthetic data sets, and the results show that our proposed approach outperforms existing approaches in both efficiency and scalability.
INDEX TERMS
Indexes, Pipeline processing, Algorithm design and analysis, Filtering, Educational institutions, XML, Transforms, multiple filtering, Indexes, Pipeline processing, Algorithm design and analysis, Filtering, Educational institutions, XML, Transforms, MapReduce, Similarity join
CITATION
Chuitian Rong, Wei Lu, Xiaoli Wang, Xiaoyong Du, Yueguo Chen, Anthony K.H. Tung, "Efficient and Scalable Processing of String Similarity Join", IEEE Transactions on Knowledge & Data Engineering, vol.25, no. 10, pp. 2217-2230, Oct. 2013, doi:10.1109/TKDE.2012.195
REFERENCES
[1] A. Arasu, V. Ganti, and R. Kaushik, "Efficient Exact Set-Similarity Joins," Proc. 32nd Int'l Conf. Very Large Data Bases, pp. 918-929, 2006.
[2] A. Arasu, C. Ré, and D. Suciu, "Large-Scale Deduplication with Constraints Using Dedupalog," Proc. Int'l Conf. Data Eng. (ICDE), pp. 952-963, 2009.
[3] R. Bayardo, Y. Ma, and R. Srikant, "Scaling up All Pairs Similarity Search," Proc. Int'l Conf. World Wide Web, pp. 131-140, 2007.
[4] S. Chaudhuri, V. Ganti, and R. Kaushik, "A Primitive Operator for Similarity Joins in Data Cleaning," Proc. Int'l Conf. Data Eng. (ICDE), pp. 61-72, 2006.
[5] S. Chaudhuri and R. Kaushik, "Extending Autocompletion to Tolerate Errors," Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 707-718, 2009.
[6] J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," Comm. ACM, vol. 51, no. 1, pp. 107-113, 2008.
[7] X. Dong, A. Halevy, and J. Madhavan, "Reference Reconciliation in Complex Information Spaces," Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 85-96, 2005.
[8] C. Dwork, R. Kumar, M. Naor, and D. Sivakumar, "Rank Aggregation Methods for the Web," Proc. Int'l Conf. World Wide Web, pp. 613-622, 2001.
[9] A. Elmagarmid, P. Ipeirotis, and V. Verykios, "Duplicate Record Detection: A Survey," IEEE Trans. Knowledge and Data Eng., vol. 19, no. 1, pp. 1-16, Jan. 2007.
[10] I. Fellegi and A. Sunter, "A Theory for Record Linkage," J. Am. Statistical Assoc., vol. 64, no. 328, pp. 1183-1210, 1969.
[11] L. Gravano, P. Ipeirotis, H. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava, "Approximate String Joins in a Database (almost) for Free," Proc. Int'l Conf. Very Large Data Bases, pp. 491-500, 2001.
[12] S. Guha, N. Koudas, A. Marathe, and D. Srivastava, "Merging the Results of Approximate Match Operations," Proc. Int'l Conf. Very Large Data Bases, pp. 636-647, 2004.
[13] M. Hernández and S. Stolfo, "The Merge/Purge Problem for Large Databases," Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 127-138, 1995.
[14] D. Jiang, B.C. Ooi, L. Shi, and S. Wu, "The Performance of MapReduce: An in-Depth Study," Proc. VLDB Endowment, vol. 3, no. 1, pp. 472-483, 2010.
[15] R. Kumar and S. Vassilvitskii, "Generalized Distances Between Rankings," Proc. Int'l Conf. World Wide Web, pp. 571-580, 2010.
[16] C. Li, J. Lu, and Y. Lu, "Efficient Merging and Filtering Algorithms for Approximate String Searches," Proc. Int'l Conf. Data Eng. (ICDE), pp. 257-266, 2008.
[17] W. Lu, Y. Shen, S. Chen, and B.C. Ooi, "Efficient Processing of k Nearest Neighbor Joins Using MapReduce," Proc. VLDB Endowment, vol. 5, no. 10, pp. 1016-1027, 2012.
[18] A. Monge and C. Elkan, "The Field Matching Problem: Algorithms and Applications," Proc. Second Int'l Conf. Knowledge Discovery and Data Mining, pp. 267-270, 1996.
[19] F. Naumann and M. Herschel, "An Introduction to Duplicate Detection," Synthesis Lectures on Data Management, vol. 2, no. 1, pp. 1-87, 2010.
[20] G. Salton and M. McGill, Introduction to Modern Information Retrieval. McGraw-Hill, 1986.
[21] S. Sarawagi and A. Bhamidipaty, "Interactive Deduplication Using Active Learning," Proc. Eighth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 269-278, 2002.
[22] S. Sarawagi and A. Kirpal, "Efficient Set Joins on Similarity Predicates," Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 743-754, 2004.
[23] J. Sivic and A. Zisserman, "Video Google: A Text Retrieval Approach to Object Matching in Videos," Proc. IEEE Int'l Conf. Computer Vision, pp. 1470-1477, 2003.
[24] R. Vernica, M. Carey, and C. Li, "Efficient Parallel Set-Similarity Joins Using MapReduce," Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 495-506, 2010.
[25] C. Wang, J. Wang, X. Lin, W. Wang, H. Wang, H. Li, W. Tian, J. Xu, and R. Li, "MapDupReducer: Detecting Near Duplicates over Massive Data Sets," Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 1119-1122, 2010.
[26] J. Wang, J. Feng, and G. Li, "Trie-Join: Efficient Trie-Based String Similarity Joins with Edit-Distance Constraints," Proc. VLDB Endowment, vol. 3, nos. 1/2, pp. 1219-1230, 2010.
[27] W. Winkler, "The State of Record Linkage and Current Research Problems," Statistical Research Division, 1999.
[28] I.H. Witten, A. Moffat, and T.C. Bell, Managing Gigabytes: Compressing and Indexing Documents and Images, second ed. Morgan Kaufmann, 1999.
[29] C. Xiao, W. Wang, X. Lin, and J. Yu, "Efficient Similarity Joins for Near Duplicate Detection," Proc. Int'l Conf. World Wide Web, pp. 131-140, 2008.
[30] Z. Zhang, M. Hadjieleftheriou, B. Ooi, and D. Srivastava, "Bed-Tree: An All-Purpose Index Structure for String Similarity Search Based on Edit Distance," Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 915-926, 2010.
23 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool