
This Article  
 
Share  
Bibliographic References  
Add to:  
Digg Furl Spurl Blink Simpy Del.icio.us Y!MyWeb  
Search  
 
ASCII Text  x  
J.L. Wolf, D.M. Dias, P.S. Yu, "A Parallel Sort Merge Join Algorithm for Managing Data Skew," IEEE Transactions on Parallel and Distributed Systems, vol. 4, no. 1, pp. 7086, January, 1993.  
BibTex  x  
@article{ 10.1109/71.205654, author = {J.L. Wolf and D.M. Dias and P.S. Yu}, title = {A Parallel Sort Merge Join Algorithm for Managing Data Skew}, journal ={IEEE Transactions on Parallel and Distributed Systems}, volume = {4}, number = {1}, issn = {10459219}, year = {1993}, pages = {7086}, doi = {http://doi.ieeecomputersociety.org/10.1109/71.205654}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, }  
RefWorks Procite/RefMan/Endnote  x  
TY  JOUR JO  IEEE Transactions on Parallel and Distributed Systems TI  A Parallel Sort Merge Join Algorithm for Managing Data Skew IS  1 SN  10459219 SP70 EP86 EPD  7086 A1  J.L. Wolf, A1  D.M. Dias, A1  P.S. Yu, PY  1993 KW  Index Termsdata skew management; transfer phase; sort phase; parallel sort merge join algorithm;divideandconquer; scheduling phase; join phases; parallelizable optimization algorithm;multiple processors; Zipflike distribution; load balancing; distributed databases; merging; parallel algorithms; relational algebra; relational databases; sorting VL  4 JA  IEEE Transactions on Parallel and Distributed Systems ER   
A parallel sortmergejoin algorithm which uses a divideandconquer approach to address the data skew problem is proposed. The proposed algorithm adds an extra, lowcost scheduling phase to the usual sort, transfer, and join phases. During the schedulingphase, a parallelizable optimization algorithm, using the output of the sort phase,attempts to balance the load across the multiple processors in the subsequent joinphase. The algorithm naturally identifies the largest skew elements, and assigns each ofthem to an optimal number of processors. Assuming a Zipflike distribution of data skew,the algorithm is demonstrated to achieve very good load balancing for the join phase, andis shown to be very robust relative, among other things, to the degree of data skew andthe total number of processors.
[1] A. V. Aho, J. E. Hopcroft, and J. D. Ullman,The Design and Analysis of Computer Algorithms. Menlo Park, CA: AddisonWesley, 1974.
[2] S. G. Akl and N. Santoro, "Optimal parallel merging and sorting without memory conflicts,"IEEE Trans. Comput., vol. C36, pp. 13671369, 1987.
[3] L. Bic and R. L. Hartman, "Hither hundreds of processors in a database machine," inProc. 1985 Int. Workshop Database Machines, SpringerVerlag, 1985.
[4] M. Blasgen and K. Eswaran, "Storage and access in relational databases,"IBM Syst. J., vol. 4, p. 363, 1977.
[5] M. Blum, R. W. Floyd, V. R. Pratt, R. L. Rivest, and R. E. Tarjan, "Time bounds for selection,"J. Comput. Syst. Sci., vol. 7, pp. 448461, 1972.
[6] H. Boral, W. Alexander, L. Clay, G. Copeland, S. Danforth, M. Franklin, B. Hart, M. Smith, and P. Valduriez, "Prototyping Bubba, A highly parallel database system,"IEEE Trans. Knowledge Data Eng., vol. 2, no. 1, pp. 424, Mar. 1990.
[7] S. Christodoulakis "Estimating record selectivities,"Inform. Syst., vol. 8, no. 2, pp. 105115, 1983.
[8] E. G. Coffman and P. J. Denning,Operating Systems Theory. Englewood Cliffs, NJ: PrenticeHall, 1973.
[9] E. Coffman, M. Garey, and D. S. Johnson, "An application of bin packing to multiprocessor scheduling,"SIAM J. Comput., vol. 7, pp. 117, 1978.
[10] D. W. Cornell, D. M. Dias, and P. S. Yu, "On multisystem coupling through function request shipping,"IEEE Trans. Software Eng., vol. SE12, no. 10, pp. 10061017, Oct. 1986.
[11] S. A. Demurjian, D. K. Hsiao, D. S. Kerr, J. Menon, P. R. Strawser, R. C. Tekampe, J. Trimble, and R. J. Watson, "Performance evaluation of a database system in multiple backend configurations," inProc. 1985 Int. Workshop Database Machines, SpringerVerlag, 1985.
[12] D. J. Dewitt and R. H. Gerber "Multiprocessor hashbased join algorithms," inProc. 11th Int. Conf. Very Large Databases, 1985.
[13] D. J. Dewitt, S. Ghandeharizadeh, D. A. Schneider, A. Bricker, H.I. Hsiao, and R. Rasmussen, "The GAMMA database machine project,"IEEE Trans. Knowledge Data Eng., vol. 2, no. 1, pp. 4462, Mar. 1990.
[14] D. J. DeWitt, M. Smith, and H. Boral, "A singleuser performance evaluation of the Teradata database machine," MCC Tech. Rep. DB 08187, 1987.
[15] G. Frederickson and D. B. Johnson, "The complexity of selection and ranking in X+Y and matrices with sorted columns,"J. Comput. Syst. Sci., vol. 24, pp. 197209, 1982.
[16] G. Frederickson and D. B. Johnson, "Generalized selection and ranking: Sorted matrices,"SIAM J. Comput., vol. 13, pp. 1430, 1984.
[17] Z. Galil and N. Megiddo, "A fast selection algorithm and the problem of optimum distribution of effort,"J. ACM, vol. 26, pp. 5864, 1979.
[18] R. Graham "Bounds on multiprocessing timing anomalies,"SIAM J. Appl. Mathemat., vol. 17, no. 2, pp. 416429, 1969.
[19] D. Hsiao,Advanced Database Machine Architecture. Englewood Cliffs, NJ: PrenticeHall, 1983.
[20] T. Ibaraki and N. Katoh,Resource Allocation Problems. Cambridge, MA: MIT Press, 1988.
[21] B. R. Iyer and D. M. Dias, "System issues in parallel sorting for database systems," inProc. 6th Int. Conf. Data Eng., 1988.
[22] B. R. Iyer, G. R. Ricard, and P. J. Varman, "Percentile finding algorithm for multiple sorted runs," inProc. 15th Int. Conf. Very Large Databases, 1989.
[23] W. Kim, "A new way to compute the product and join of relations," inACM SIGMOD Conf. Proc., Santa Monica, CA, 1980, pp. 179 187.
[24] M. Kitsuregawa, H. Tanaka, and T. Motooka, "Application of hash to data base machine and its architecture,"New Generation Comput., vol. 1, no. 1, 1983.
[25] D. E. Knuth,The Art of Computer Programming, Vol. 3, Reading, MA: AddisonWesley, 1973.
[26] S. Lakshmi and P. Yu, "Analysis of effectiveness of parallel processing in database systems,"Comput. Syst. Sci. Eng., vol. 5, no. 2, pp. 7381, 1990.
[27] S. Lakshmi and P. S. Yu, "Effectiveness of parallel joins,"IEEE Trans. Knowledge Data Eng., vol. 2, no. 4, pp. 410424, 1990.
[28] C. Lynch, "Selectivity estimation and query optimization in large databases with highly skewed distributions of column values," inProc. 14th Int. Conf. Very Large Databases, 1988, pp. 240251.
[29] A. Y. Montgomery, D. J. D'Souza, and S. B. Lee, "The cost of relational algebraic operations on skewed data: Estimates and experiments," inInform. Processing 83, IFIP, 1983.
[30] P. M. Neches and J. E. Shemer, "The genesis of a database computer,"IEEE Comput. Mag., vol. 17, no. 11, pp. 4256, 1984.
[31] E. Ozkarahan,Database Machines and Database Management. Englewood Cliffs, NJ: PrenticeHall, 1986.
[32] G. Z. Qadah, "The equijoin operation on a multiprocessor database machine: Algorithms and the evaluation of their performance," inProc. 1985 Int. Workshop Database Machines, SpringerVerlag, 1985, pp. 3567.
[33] S. Salza, M. Terranova, and P. Velardi, "Performance modeling of the DBMAC architecture," inProc. 1983 Int. Workshop Database Machines, SpringerVerlag, 1983, pp. 7490.
[34] D. Schneider and D. Dewitt, "A performance evaluation of four parallel join algorithms in a sharednothing multiprocessor environment," inProc. ACM SIGMOD Conf.(Portland, OR), MayJune 1989, p. 110.
[35] M. Stonebraker, "The case for shared nothing,"IEEE Database Eng., vol. 9, no. 1, 1986.
[36] A. N. Tantawi, D. Towsley, and J. L. Wolf, "An algorithm for a classconstrained resource allocation problem, " inProc. 1988 ACM Sigmetrics Conf., May 1988, pp. 253260.
[37] P. Valduriez and G. Gardarin, "Join and semijoin algorithms for a multiprocessor database machine,"ACM Trans. Database Syst., vol. 9, no. 1, pp. 133161, 1984.
[38] J. I. Wolf, D. M. Dias, P. S. Yu, and J Turek, "An efficient algorithm for parallelizing hash joins in presence of data skew," inProc. Int. Conf. Data Eng., pp. 200209, Kobe, Japan, Apr. 1991.
[39] J. L. Wolf, B. R. Iyer, K. R. Pattipati, and J. Turek, "Optimal buffer partitioning for the nested block join algorithm," inProc. Seventh Int. Conf. Data Eng., Kobe, Japan, Apr. 1991, pp. 510519.
[40] J. Wolf, D. Dias, P. Yu, and J. Turek, "Comparative performance of parallel join algorithms," inProc. 1st Int. Conf. Parallel Distributed Inform. Syst., pp. 7888.
[41] G. K. Zipf,Human Behavior and the Principle of Least Effort. Reading, MA: AddisonWesley, 1949.