Issue No. 06 - June (2016 vol. 27)
ISSN: 1045-9219
pp: 1660-1673
Jin Huang , Department of Information and Computing Systems, University of Melbourne, Melbourne, VIC, Australia
Rui Zhang , Department of Information and Computing Systems, University of Melbourne, Melbourne, VIC, Australia
Rajkumar Buyya , Department of Information and Computing Systems, University of Melbourne, Melbourne, VIC, Australia
Jian Chen , School of Software Engineering, South China University of Technology, Guangzhou, China
Yongwei Wu , Department of Computer Science and Technology, Tsinghua National Laboratory for Information Science and Technology (TNLIST), Tsinghua University, Beijing, China
ABSTRACT
The Earth Mover's Distance (EMD) similarity join has a number of important applications such as near duplicate image retrieval and distributed based pattern analysis. However, the computational cost of EMD is super cubic and consequently the EMD similarity join operation is prohibitive for datasets of even medium size. We propose to employ the Hadoop platform to speed up the operation. Simply porting the state-of-the-art metric distance similarity join algorithms to Hadoop results in inefficiency because they involve excessive distance computations and are vulnerable to skewed data distributions. We propose a novel framework, named Heads-Join, which transforms data into the space of EMD lower bounds and performs pruning and partitioning at a low cost because computing these EMD lower bounds has constant or linear complexity. We investigate both range and top-$k$ joins, and design efficient algorithms on three popular Hadoop computation paradigms, i.e., MapReduce, Bulk Synchronous Parallel, and Spark. We conduct extensive experiments on both real and synthetic datasets. The results show that Heads-Join outperforms the state-of-the-art metric similarity join technique, i.e., Quickjoin, by up to an order of magnitude and scales out well.
INDEX TERMS
Histograms, Sparks, Upper bound, Transforms, Algorithm design and analysis, Earth, Approximation error
CITATION

J. Huang, R. Zhang, R. Buyya, J. Chen and Y. Wu, "Heads-Join: Efficient Earth Mover's Distance Similarity Joins on Hadoop," in IEEE Transactions on Parallel & Distributed Systems, vol. 27, no. 6, pp. 1660-1673, 2016.
doi:10.1109/TPDS.2015.2462354