This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
New Algorithms for Parallelizing Relational Database Joins in the Presence of Data Skew
December 1994 (vol. 6 no. 6)
pp. 990-997

Parallel processing is an attractive option for relational database systems. As in any parallel environment however, load balancing is a critical issue which affects overall performance. Load balancing for one common database operation in particular, the join of two relations, can be severely hampered for conventional parallel algorithms, due to a natural phenomenon known as data skew. In a pair of recent papers (J. Wolf et al., 1993; 1993), we described two new join algorithms designed to address the data skew problem. We propose significant improvements to both algorithms, increasing their effectiveness while simultaneously decreasing their execution times. The paper then focuses on the comparative performance of the improved algorithms and their more conventional counterparts. The new algorithms outperform their more conventional counterparts in the presence of just about any skew at all, dramatically so in cases of high skew.

[1] S. Lakshmi and P. Yu, "Effectiveness of parallel joins,"IEEE Trans. Knowledge and Data Eng., vol. 2, no. 4, pp. 410-424, 1990.
[2] C. Lynch, "Selectivity estimation and query optimization in large databases with highly skewed distributions of column values," inProc. 14th Int. Conf. Very Large Databases, 1988, pp. 240-251.
[3] J. Wolf, P. Yu, and J. Turek, D. Dias, "A parallel hash join algorithm for managing data skew,"IEEE Trans. Parallel and Distributed Syst., vol. 4, no. 12, pp. 1355-1371, Dec. 1993.
[4] J. Wolf, D. Dias, and P. Yu, "A parallel sort merge join algorithm for managing data skew,"IEEE Trans. Parallel and Distributed Syst., vol. 4, no. 1, pp. 70-86, Jan. 1993.
[5] Z. Galil and N. Megiddo, "A fast selection algorithm and the problem of optimum distribution of effort,"J. ACM, vol. 26, pp. 58-64, 1979.
[6] D. E. Knuth,The Art of Computer Programming, Vol. 3, Reading, MA: Addison-Wesley, 1973.
[7] J. Wolf, D. Dias, P. Yu, and J. Turek, "Algorithms for parallelizing relational database joins in the presence of data skew,"IBM RC19236, Oct. 1993, to appear.
[8] R. Graham, "Bounds on multiprocessing timing anomalies,"SIAM J. Comput., vol. 17, pp. 416-429, 1969.
[9] E. Coffman, M. Garey, and D. Johnson, "An application of bin packing to multiprocessor scheduling,"SIAM J. Comput., vol. 7, pp. 1-17, 1978.

Index Terms:
relational databases; relational algebra; resource allocation; parallel algorithms; parallel programming; relational database joins; data skew; parallel processing; relational database systems; parallel environment; load balancing; common database operation; parallel algorithms; join algorithms; comparative performance
Citation:
J.L. Wolf, D.M. Dias, P.S. Yu, J. Turek, "New Algorithms for Parallelizing Relational Database Joins in the Presence of Data Skew," IEEE Transactions on Knowledge and Data Engineering, vol. 6, no. 6, pp. 990-997, Dec. 1994, doi:10.1109/69.334888
Usage of this product signifies your acceptance of the Terms of Use.