This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Effectiveness of Parallel Joins
December 1990 (vol. 2 no. 4)
pp. 410-424

The effectiveness of parallel processing of relational join operations is examined. The skew in the distribution of join attribute values and the stochastic nature of the task processing times are identified as the major factors that can affect the effective exploitation of parallelism. Expressions for the execution time of parallel hash join and semijoin are derived and their effectiveness analyzed. When many small processors are used in the parallel architecture, the skew can result in some processors becoming sources of bottleneck while other processors are being underutilized. Even in the absence of skew, the variations in the processing times of the parallel tasks belonging to a query can lead to high task synchronization delay and impact the maximum speedup achievable through parallel execution. For example, when the task processing time on each processor is exponential with the same mean, the speedup is proportional to P/ln(P) where P is the number of processors. Other factors such as memory size, communication bandwidth, etc., can lead to even lower speedup. These are quantified using analytical models.

[1] P. Bernstein and D. Chiu, "Using semijoins to solve relational queries,"J. ACM, vol. 28, no. 1, pp. 25-40, Jan. 1981.
[2] L. Bic and R. L. Hartman, "Hither hundreds of processors in a data-base machine," inProc. 1985 Int. Workshop Database Machines, Springer-Verlag, pp. 153-168.
[3] D. Bitton, D. J. DeWitt, and C. Turbyfill, "Benchmarking database systems--A systematic approach," inProc. 1983 Very Large Data-base conf., Oct. 1983.
[4] K. Bratbergsengen, "Hashing methods and relational algebra operations," inProc. Conf. Very Large Data Bases(Singapore), Aug. 1984, pp. 323-333.
[5] S. Christodoulakis, "Estimating record selectivities,"Inform. Syst., vol. 8, no. 2, pp. 105-115, 1983.
[6] S. A. Demurjian, D. K. Hsiao, D. S. Kerr, J. Menon, P. R. Strawser, R. C. Tekampe, J. Trimble, and R. J. Watson, "Performance evaluation of a database system in multiple backend configurations," inProc. 1985 Int. Workshop Database Machines, Springer-Verlag, pp. 91-111.
[7] D. J. DeWittet al., "Implementation techniques for main memory databases," inProc. ACM Sigmod(Boston, MA), June 18-21, 1984, pp. 1-8.
[8] D. J. DeWitt and R. H. Gerber, "Multiprocessor hash-based join algorithms," inProc. 11th Int. Conf. Very Large Data Bases, 1985, pp. 151-164.
[9] D. Dewitt, R. H. Gerber, G. Graefe, M. L. Heytens, K. B. Kumar, and M. Muralikrishna, "GAMMA--A high performance dataflow database machine," inProc. 12th Int. Conf. VLDB, Kyoto, Japan, Aug. 1986, pp. 228-237.
[10] D. J. DeWitt, M. Smith, and H. Boral, "A single-user performance evaluation of the teradata database machine," MCC Tech. Rep. DB-081-87, 1987.
[11] D. M. Dias, B. R. Iyer, and P. S. Yu, "Tradeoffs between coupling small and large processors for transaction,"IEEE Trans. Comput., vol. 37, no. 3, pp. 310-320, 1988.
[12] W. Feller,An Introduction to Probability Theory and Its Applications, Vol. 1, 3rd ed. New York: Wiley, 1967.
[13] A. Gravey, "A simple construction of upper bound for the mean of the maximum of N identically distributed random variables,"J. Appl. Probability, vol. 22, pp. 844-851, 1985.
[14] P. Heidelberger and M. S. Lakshmi, "A performance comparison of multimicroprocessor and mainframe database architectures,"IEEE Trans. Software Eng., vol. 14, no. 4, pp. 522-531, 1988.
[15] D. Hsiao,Advanced Database Machine Architecture. Englewood Cliffs, NJ: Prentice-Hall, 1983.
[16] M. Y. Kim, "Parallel operation of magnetic disk storage devices: Synchronized disk interleaving," inProc. 1985 Int. Workshop Database Machines, Springer-Verlag, pp. 299-329.
[17] M. Y. Kim and A. N. Tantawi, "Asynchronous disk interleaving: Approximating access delays," IBM Res. Rep. RC12497, IBM T.J. Watson Research Center, Yorktown Heights, NY 10598, 1988.
[18] L. Kleinrock,Queueing Systems, Vol. 1: Theory. New York: Wiley, 1975.
[19] D. E. Knuth,The Art of Computer Programming, Vol. 3, Reading, MA: Addison-Wesley, 1973.
[20] C. Kruskal and A. Weiss, "Allocating independence subtasks on parallel processors,"IEEE Trans. Software Eng., vol. SE-11, no. 10, pp. 1001-1015, Oct. 1985.
[21] S. Lakshmi and P. S. Yu, "Performance of relational join operations on parallel architectures," IBM Res. Rep. RC13370, IBM T. J. Watson Research Center, Yorktown Heights, NY 10598, 1987.
[22] B. Charron-Bost, "Combinatorics and Geometry of Consistent Cuts: Application to Concurrency Theory," inDistributed Algorithms, J.-C. Bermond and M. Raynal. eds.,Lecture Notes in Computer Science, Vol. 392, Springer-Verlag, Berlin, 1989.
[23] S. Lakshmi and P. Yu, "Analysis of effectiveness of parallel processing in database systems,"Comput. Syst. Sci. Eng., vol. 5, no. 2, pp. 73-81, 1990.
[24] C. Lynch, "Selectivity estimation and query optimization in large databases with highly skewed distributions of column values," inProc. 14th Int. Conf. Very Large Databases, 1988, pp. 240-251.
[25] L. F. Mackert and G. M. Lohman, "Index scans using a finite LRU buffer: A validated I/O model," IBM Res. Rep. RJ 4836, IBM T. J. Watson Research Center, Yorktown Heights, NY 10598, 1985.
[26] J. Menon, "Sorting and join algorithms for multiprocessor data-base," IBM Res. Rep. RJ 5049, IBM T. J. Watson Research Center, Yorktown Heights, NY 10598, 1986.
[27] A.Y. Montgomery, D. J. D'Souza, and S. B. Lee, "The cost of relational algebraic operations on skewed data: Estimates and experiments,"Inform. Processing 83, IFIP, 1983, pp. 235-241.
[28] P. M. Neches and J. E. Shemer, "The genesis of a database computer,"IEEE Comput. Mag., vol. 17, no. 11, pp. 42-56, 1984.
[29] G. Z. Qadah, "The equi-join operation on a multiprocessor database machine: Algorithms and the evaluation of their performance," inProc. 1985 Int. Workshop Database Machines, Springer-Verlag, 1985, pp. 35-67.
[30] Measurement Concepts Corp., "C3 teradata study," Tech. Rep. RADC-TR-85-273, Rome Air Development Center, NY, Mar. 1986.
[31] S. Salza, M. Terranova, and P. Velardi, "Performance modeling of the DBMAC architecture," inProc. 1983 Int. Workshop Database Machines, Springer-Verlag, pp. 74-90.
[32] P. Selinger,et al., "Access path selection in a relational data base system," inProc. 1979 ACM-SIGMOD Int. Conf. Management of Data, Boston, MA, June 1979.
[33] L. D. Shapiro, "Join processing in database systems with large main memories,"ACM Trans. Database Syst., vol. 11, no. 3, pp. 239-264, Sept. 1986.
[34] P. Valduriez and G. Gardarin, "Join and semi-join algorithms for a multiprocessor database machine,"ACM Trans. Database Syst., vol. 9, no. 1, pp. 133-161, 1984.
[35] S. Yoshizumi, S. Torii, K. Kojima, A. Sakata, and T. Imon, "Vector-type high-speed database processor--Fundamental concepts," inProc. 32nd Nat. Convention Inform. Processing Soc., Japan.
[36] G. K. Zipf,Human Behavior and the Principle of Least Effort. Reading, MA: Addison-Wesley, 1949.

Index Terms:
parallel processing; relational join operations; skew; distribution; join attribute values; stochastic nature; task processing times; execution time; parallel hash join; semijoin; small processors; parallel architecture; high task synchronization delay; maximum speedup; parallel execution; database theory; parallel programming; relational databases; storage management
Citation:
M.S. Lakshmi, P.S. Yu, "Effectiveness of Parallel Joins," IEEE Transactions on Knowledge and Data Engineering, vol. 2, no. 4, pp. 410-424, Dec. 1990, doi:10.1109/69.63253
Usage of this product signifies your acceptance of the Terms of Use.