The Community for Technology Leaders
2014 IEEE 30th International Conference on Data Engineering (ICDE) (2014)
Chicago, IL, USA
March 31, 2014 to April 4, 2014
ISBN: 978-1-4799-2555-1
pp: 340-351
Dong Deng , Department of Computer Science, Tsinghua University, Beijing, China
Guoliang Li , Department of Computer Science, Tsinghua University, Beijing, China
Shuang Hao , Department of Computer Science, Tsinghua University, Beijing, China
Jiannan Wang , Department of Computer Science, Tsinghua University, Beijing, China
Jianhua Feng , Department of Computer Science, Tsinghua University, Beijing, China
ABSTRACT
String similarity join is an essential operation in data integration. The era of big data calls for scalable algorithms to support large-scale string similarity joins. In this paper, we study scalable string similarity joins using MapReduce. We propose a MapReduce-based framework, called MASSJOIN, which supports both set-based similarity functions and character-based similarity functions. We extend the existing partition-based signature scheme to support set-based similarity functions. We utilize the signatures to generate key-value pairs. To reduce the transmission cost, we merge key-value pairs to significantly reduce the number of key-value pairs, from cubic to linear complexity, while not sacrificing the pruning power. To improve the performance, we incorporate “light-weight” filter units into the key-value pairs which can be utilized to prune large number of dissimilar pairs without significantly increasing the transmission cost. Experimental results on real-world datasets show that our method significantly outperformed state-of-the-art approaches.
INDEX TERMS
Open systems, Filtering, Erbium
CITATION

D. Deng, G. Li, S. Hao, J. Wang and J. Feng, "MassJoin: A mapreduce-based method for scalable string similarity joins," 2014 IEEE 30th International Conference on Data Engineering (ICDE), Chicago, IL, USA, 2014, pp. 340-351.
doi:10.1109/ICDE.2014.6816663
96 ms
(Ver 3.3 (11022016))