The Community for Technology Leaders
Green Image
<p>A parallel sort-merge-join algorithm which uses a divide-and-conquer approach to address the data skew problem is proposed. The proposed algorithm adds an extra, low-cost scheduling phase to the usual sort, transfer, and join phases. During the schedulingphase, a parallelizable optimization algorithm, using the output of the sort phase,attempts to balance the load across the multiple processors in the subsequent joinphase. The algorithm naturally identifies the largest skew elements, and assigns each ofthem to an optimal number of processors. Assuming a Zipf-like distribution of data skew,the algorithm is demonstrated to achieve very good load balancing for the join phase, andis shown to be very robust relative, among other things, to the degree of data skew andthe total number of processors.</p>
Index Termsdata skew management; transfer phase; sort phase; parallel sort merge join algorithm;divide-and-conquer; scheduling phase; join phases; parallelizable optimization algorithm;multiple processors; Zipf-like distribution; load balancing; distributed databases; merging; parallel algorithms; relational algebra; relational databases; sorting
D.M. Dias, J.L. Wolf, P.S. Yu, "A Parallel Sort Merge Join Algorithm for Managing Data Skew", IEEE Transactions on Parallel & Distributed Systems, vol. 4, no. , pp. 70-86, January 1993, doi:10.1109/71.205654
49 ms
(Ver 3.3 (11022016))