The Community for Technology Leaders
2014 IEEE 30th International Conference on Data Engineering (ICDE) (2014)
Chicago, IL, USA
March 31, 2014 to April 4, 2014
ISBN: 978-1-4799-2555-1
pp: 808-819
Jin Huang , Department of Computing and Information Systems, University of Melbourne, Victoria, Australia
Rui Zhang , Department of Computing and Information Systems, University of Melbourne, Victoria, Australia
Rajkumar Buyya , Department of Computing and Information Systems, University of Melbourne, Victoria, Australia
Jian Chen , South China University of Technology, Guangzhou, China
ABSTRACT
The Earth Mover's Distance (EMD) similarity join retrieves pairs of records with EMD below a given threshold. It has a number of important applications such as near duplicate image retrieval and pattern analysis in probabilistic datasets. However, the computational cost of EMD is super cubic to the number of bins in the histograms used to represent the data objects. Consequently, the EMD similarity join operation is prohibitive for large datasets. This is the first paper that specifically addresses the EMD similarity join and we propose to use MapReduce to approach this problem. The MapReduce algorithms designed for generic metric distance similarity joins are inefficient for the EMD similarity join because they involve a large number of distance computations and have unbalanced workloads on reducers when dealing with skewed datasets. We propose a novel framework, named Melody-Join, which transforms data into the space of EMD lower bounds and performs pruning and partitioning at a low cost because computing these EMD lower bounds has a constant complexity. Furthermore, we address two key problems, the limited pruning power and the unbalanced workloads, by enhancing each phase in the Melody-Join framework. We conduct extensive experiments on real datasets. The results show that Melody-Join outperforms the state-of-the-art technique by an order of magnitude, scales up better on large datasets than the state-of-the-art technique, and scales out well on distributed machines.
INDEX TERMS
Histograms, Transforms, Approximation error, Aggregates, Vectors, Earth
CITATION

J. Huang, R. Zhang, R. Buyya and J. Chen, "MELODY-JOIN: Efficient Earth Mover's Distance similarity joins using MapReduce," 2014 IEEE 30th International Conference on Data Engineering (ICDE), Chicago, IL, USA, 2014, pp. 808-819.
doi:10.1109/ICDE.2014.6816702
93 ms
(Ver 3.3 (11022016))