This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Parallel Execution of Hash Joins in Parallel Databases
August 1997 (vol. 8 no. 8)
pp. 872-883

Abstract—In this paper, we explore two important issues, processor allocation and the use of hash filters, to improve the parallel execution of hash joins. To exploit the opportunity of pipelining for hash join execution, a scheme to transform a bushy execution tree to an allocation tree is first devised. In an allocation tree, each node denotes a pipeline. Then, using the concept of synchronous execution time, processors are allocated to the nodes in the allocation tree in such a way that inner relations in a pipeline can be made available at approximately the same time. Also, the approach of hash filtering is investigated to further improve the parallel execution of hash joins. Extensive performance studies are conducted via simulation to demonstrate the importance of processor allocation and to evaluate various schemes using hash filters. It is experimentally shown that processor allocation is, in general ,the dominant factor to performance, and the effect of hash filtering becomes more prominent as the number of relations in a query increases.

[1] C. Baru et al., "An Overview of DB2 Parallel Edition," Proc. ACM SIGMOD, pp. 460-462, May 1995.
[2] D. Bitton and J. Gray, “Disk Shadowing,” Very Large Data Bases, pp. 331–338, 1988.
[3] H. Boral,W. Alexander,L. Clay,G. Copeland,S. Danforth,M. Franklin,B. Hart,M. Smith,, and P. Valduriez,“Prototyping Bubba, a highly parallel database system,” IEEE Trans. on Knowledge and Data Engineering, vol. 2, no. 1, pp. 4-24, Mar. 1990.
[4] M.-S. Chen, H.-I. Hsiao, and P.S. Yu, "Applying Hash Filerts to Improving the Execution of Bushy Trees," Proc. 19th Int'l Conf. Very Large Data Bases, pp. 505-516, Aug. 1993.
[5] M.S. Chen, M.L. Lo, P.S. Yu, and H.E. Young, “Applying Segmented Right-Deep Trees to Pipelining Multiple Hash Joins,” IEEE Trans. Knowledge and Data Eng., vol. 7, no. 4, Aug. 1995.
[6] M.-S. Chen,P.S. Yu,, and K.-L. Wu,“Scheduling and processor allocation for parallel execution of multi-join queries,” Proc. Eighth Int’l Conf. Data Engineering, pp. 58-67, Feb. 1992.
[7] D.J. DeWitt and R. Gerber, "Multiprocessor Hash-Based Join Algorithms," Proc. 11th Int'l Conf. Very Large Data Bases, pp. 151-162, Aug. 1985.
[8] D.J. DeWitt,S. Ghandeharizadeh,D.A. Schneider,A. Bricker,H.I. Hsiao,, and R. Rasmussen,“The gamma database machine project,” IEEE Trans. on Knowledge and Data Engineering, vol. 2, no. 1, pp. 44-62, Mar. 1990.
[9] D. DeWitt and J. Gray, “Parallel Database Systems: The Future of High-Performance Database Systems,” Comm. ACM, Vol. 35, No. 6, June 1992, pp. 85-98.
[10] D. Gardy and C. Puech, “On the Effect of Join Operations on Relation Sizes,” ACM Trans. Database Systems, vol. 14, no. 4, pp. 574-603, Dec. 1989.
[11] B. Gerber, "Informix Online XPS," Proc. ACM SIGMOD, p. 463, May 1995.
[12] R. Gerber, "Dataflow Query Processing Using Multiprocessor Hash-Partitioned Algorithms," Technical Report 672, Computer Science Dept., Univ. of Wisconsin-Madison, Oct. 1986.
[13] W. Hong,“Exploiting interoperator parallelism in XPRS,”inProc. ACM SIGMOD, San Diego, CA, June 1992, pp. 19–28.
[14] W. Hong and M. Stonebraker,“Optimization of parallel query execution plans in XPRS,” Proc. First Conf. Parallel and Distributed Information Systems, pp. 218-225, Dec. 1991.
[15] B. Meyer, Reusable Software: The Base Object-Oriented Component Libraries, Prentice Hall, 1994.
[16] Y.E. Ioannidis and Y.C. Kang,“Left-deep vs. bushy trees: An analysis of strategy spaces and its implication for query optimization,” Proc. ACM-SIGMOD Conf., vol. 20, pp. 168-177, 1991.
[17] M. Kitsuregawa, H. Tanaka, and T. Moto-Oka, "Architecture and Performance of Relational Algebra Machine GRACE," Proc. Int'l Conf. Parallel Processing, pp. 241-250, Aug. 1984.
[18] R. Krishnamurthy, H. Boral, and C. Zaniolo,“Optimization of nonrecursive queries,”inProc. 12th Int. Conf. Very Large Databases, Kyoto, Japan, Aug. 1986, pp. 128–137.
[19] M.-L. Lo, M.-S. Chen, C. V. Ravishankar, and P. S. Yu,“On optimal processor allocation to support pipelined hash joins,”inProc. ACM SIGMOD, May 1993, pp. 69–78.
[20] R.A. Lorie,J.-J. Daudenarde,J.W. Stamos,, and H.C. Young,“Exploiting database parallelism in a message-passing multiprocessor,” IBM J. of Research and Development, vol. 35, nos. 5/6, pp. 681-695, Sept./Nov. 1991.
[21] H. Lu, M.-C. Shan, and K.-L. Tan,“Optimization of multi-way join queries for parallel execution,”inProc. 17th Int. Conf. Very Large Databases, Barcelona, Spain, Sept. 1991, pp. 549–560.
[22] H. Lu,K. Tan,, and M. Shan,“Hash-based join algorithms for multiprocessor computers with shared memory,” Proc. 16th Int’l Conf. Very Large Data Bases, pp. 198-208, 1990.
[23] P. Mishra and M.H. Eich, "Join Processing in Relational Databases," ACM Computing Surveys, vol. 24, no. 1, pp. 64-113, Mar. 1992.
[24] E. Omiecinski and E.T. Lin, “The Adaptive-Hash Join Algorithms for a Hypercube Multicomputer,” IEEE Trans. Parallel and Distributed Systems, 1992.
[25] H. Pirahesh,C. Mohan,J. Cheng,T.S. Liu,, and P. Selinger,“Parallelism in relational data base systems: Architectural issues and design approaches,” Proc. Second Int’l Symp. Databases in Parallel and Distributed Systems, pp. 4-29, July 1990.
[26] N. Roussopoulos and H. Kang, “A Pipeline N-Way Join Algorithm Based on the 2-Way Semijoin Program,” IEEE Trans. Knowledge and Data Eng,m vol. 3, no. 4, pp. 461-473, Dec. 1991.
[27] D. Schneider, "Complex Query Processing in Multiprocessor Database Machines," Technical Report 965, Computer Science Dept., Univ. of Wisconsin-Madison, Sept. 1990.
[28] D. Schneider and D. DeWitt, “A Performance Evaluation of Four Parallel Join Algorithms in a Shared-Nothing Multiprocessor Environment,” ACM SIGMOD Record, vol. 18, no. 2, pp. 110-121, June 1989.
[29] D.A. Schneider and D.J. DeWitt, “Tradeoffs in Processing Complex Join Queries via Hashing in Multiprocessor Database Machines,” Proc. 16th Int'l Conf. Very Large Data Bases, pp. 469–480, Aug. 1990.
[30] P. Selinger,D. Astrahan,D. Chamberlin,R. Lorie,, and T. Price,“Access path selection in a relational database management system,” Proc. 1979 ACM-SIGMOD Int’l Conf. Management of Data, pp. 23-34,Boston, May 1979.
[31] E. Shekita, H. C. Young, and K. Tan,“Multijoin optimization for symmetric multiprocessors,”inProc. 19th Int. Conf. Very Large Databases, Aug. 1993, pp. 479–492.
[32] M. Stonebraker,R. Katz,D. Patterson,, and J. Ousterhout,“The design of XPRS,” Proc. 14th Int’l Conf. Very Large Data Bases, pp. 318-330, 1988.
[33] A. Swami,“Optimization of large join queries: Combining heuristics with combinatorial techniques,”inProc. ACM SIGMOD, Chicago, IL, June 1989, pp. 367–376.
[34] A. Swami and A. Gupta,“Optimization of large join queries,” Proc. ACM-SIGMOD Conf., pp. 8-17, 1988.
[35] A. Wilschut and P. Apers,“Dataflow query execution in parallel main-memory environment,” Proc. First Conf. Parallel and Distributed Information Systems, pp. 68-77, Dec. 1991.
[36] A.N. Wilschut, J. Flokstra, and P.M.G. Apers, "Parallel Evaluation of Multi-Join Queries," Proc. ACM SIGMOD, pp. 115-126, May 1995.
[37] J.L. Wolf, P.S. Yu, J. Turek, and D.M. Dias, “A Parallel Hash Join Algorithm for Managing Data Skew,” IEEE Trans. Parallel and Distributed Systems, vol. 4, no. 12, 1993.
[38] P.S. Yu,M.-S. Chen,H. Heiss,, and S.H. Lee,“On workload characterization of relational database environments,” IEEE Trans on Software Engineering, vol. 18, no. 4, pp. 347-355, Apr. 1992.

Index Terms:
Hash filters, pipelining, bushy trees, hash joins.
Citation:
Hui-I Hsiao, Ming-Syan Chen, Philip S. Yu, "Parallel Execution of Hash Joins in Parallel Databases," IEEE Transactions on Parallel and Distributed Systems, vol. 8, no. 8, pp. 872-883, Aug. 1997, doi:10.1109/71.605772
Usage of this product signifies your acceptance of the Terms of Use.