Issue No. 12 - December (1993 vol. 4)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/71.250117
<p>Presents a parallel hash join algorithm that is based on the concept of hierarchicalhashing, to address the problem of data skew. The proposed algorithm splits the usualhash phase into a hash phase and an explicit transfer phase, and adds an extrascheduling phase between these two. During the scheduling phase, a heuristicoptimization algorithm, using the output of the hash phase, attempts to balance the loadacross the multiple processors in the subsequent join phase. The algorithm naturallyidentifies the hash partitions with the largest skew values and splits them as necessary,assigning each of them to an optimal number of processors. Assuming for concreteness aZipf-like distribution of the values in the join column, a join phase which is CPU-bound,and a shared nothing environment, the algorithm is shown to achieve good join phaseload balancing, and to be robust relative to the degree of data skew and the totalnumber of processors. The overall speedup due to this algorithm is compared to someexisting parallel hash join methods. The proposed method does considerably better in high skew situations.</p>
Index Termsparallel hash join algorithm; data skew; hierarchical hashing; scheduling; heuristicoptimization; load balancing; Zipf-like distribution; join column; combinatorial optimization; complex queries; relational databases; hash joins; database theory; parallel algorithms; query processing; relational databases; resource allocation
J. Turek, J. Wolf, D. Dias and P. Yu, "A Parallel Hash Join Algorithm for Managing Data Skew," in IEEE Transactions on Parallel & Distributed Systems, vol. 4, no. , pp. 1355-1371, 1993.