• Publication
  • 2003
  • Issue No. 7 - July
  • Abstract - Efficient Data Parallel Algorithms for Multidimensional Array Operations Based on the EKMR Scheme for Distributed Memory Multicomputers
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Efficient Data Parallel Algorithms for Multidimensional Array Operations Based on the EKMR Scheme for Distributed Memory Multicomputers
July 2003 (vol. 14 no. 7)
pp. 625-639
Chun-Yuan Lin, IEEE Computer Society
Yeh-Ching Chung, IEEE Computer Society

Abstract—Array operations are useful in a large number of important scientific codes, such as molecular dynamics, finite element methods, climate modeling, atmosphere and ocean sciences, etc. In our previous work, we have proposed a scheme extended Karnaugh map representation (EKMR) for multidimensional array representation. We have shown that sequential multidimensional array operation algorithms based on the EKMR scheme have better performance than those based on the traditional matrix representation (TMR) scheme. Since parallel multidimensional array operations have been an extensively investigated problem, in this paper, we present efficient data parallel algorithms for multidimensional array operations based on the EKMR scheme for distributed memory multicomputers. In data parallel programming paradigm, in general, we distribute array elements to processors based on various distribution schemes, do local computation in each processor, and collect computation results from each processor. Based on the row, the column, and the 2D mesh distribution schemes, we design data parallel algorithms for matrix-matrix addition and matrix-matrix multiplication array operations in both TMR and EKMR schemes for multidimensional arrays. We also design data parallel algorithms for six Fortran 90 array intrinsic functions, All, Maxval, Merge, Pack, Sum, and Cshift. We compare the time of the data distribution, the local computation, and the result collection phases of these array operations based on the TMR and the EKMR schemes. The experimental results show that algorithms based on the EKMR scheme outperform those based on the TMR scheme for all test cases.

[1] J.C. Adams, W.S. Brainerd, J.T. Martin, B.T. Smith, and J.L. Wagener, FORTRAN 90 Handbooks. Intertext Publications/McGraw-Hill, 1992.
[2] C. Ancourt and R. Triolet, "Scanning Polyhedra with Do Loops," Proc. Third ACM Symp. Principles and Practice of Parallel Programming, pp. 39-50, 1991.
[3] I. Banicescu and S.F. Hummel, “Balancing Processor Loads and Exploiting Data Locality in N-Body Simulations,” Proc. 1995 ACM/IEEE Supercomputing Conf., Dec. 1995.
[4] D. Callahan, S. Carr, and K. Kennedy, “Improving Register Allocation for Subscripted Variables,” Proc. ACM SIGPLAN 1990 Conf. Programming Language Design and Implementation, pp. 53-65, June 1990.
[5] S. Carr, K.S. McKinley, and C.-W. Tseng, “Compiler Optimizations for Improving Data Locality,” Proc. Sixth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 252-262, Oct. 1994.
[6] L. Carter, J. Ferrante, and S.F. Hummel, “Hierarchical Tiling for Improved Superscalar Performance,” Proc. Nineth Int'l Symp. Parallel Processing, pp. 239-245, Apr. 1995.
[7] R.G. Chang, T.R. Chung, and J.K. Lee, Parallel Sparse Supports for Array Intrinsic Functions of Fortran 90 J. Supercomputing, vol. 18, no. 3, pp. 305-339, Mar. 2001.
[8] S. Chatterjee, A.R. Lebeck, P.K. Patnala, and M. Thottethodi, “Recursive Array Layouts and Fast Parallel Matrix Multiplication,” Proc. Eleventh Ann. ACM Symp. Parallel Algorithms and Architectures, pp. 222-231, June 1999.
[9] S. Chatterjee, V.V. Jain, A.R. Lebeck, S. Mundhra, and M. Thottethodi, “Nonlinear Array Layouts for Hierarchical Memory Systems,” Proc. 1999 ACM Int'l Conf. Supercomputing, pp. 444-453, June 1999.
[10] T.-R. Chung, R.-G. Chang, and J.K. Lee, Sampling and Analytical Techniques for Data Distribution of Parallel Sparse Computation Proc. SIAM Conf. Parallel Processing for Scientific Computing, 1997.
[11] T.-R. Chung, R.-G. Chang, and J.K. Lee, Efficient Support of Parallel Sparse Computation for Array Intrinsic Functions of Fortran 90 Proc. ACM Int'l Conf. Supercomputing, pp. 45-52, 1998.
[12] M. Cierniak and W. Li, “Unifying Data and Control Transformations for Distributed Shared Memory Machines,” Proc. SIGPLAN Conf. Programming Language Design and Implementation, June 1995.
[13] S. Coleman and K. McKinley, “Tile Size Selection Using Cache Organization and Data Layout,” Proc. SIGPLAN Conf. Programming Language Design and Implementation, June 1995.
[14] J.K. Cullum and R.A. Willoughby, Algorithms for Large Symmetric Eignenvalue Computations, vol. 1.Boston, Mass.: Birkhauser, 1985.
[15] Chen Ding and Ken Kennedy, “Improving Cache Performance in Dynamic Applications through Data and Computation Reorganization at Run Time,” Proc. ACM SIGPLAN‘99 Conf. Programming Language Design and Implementation, pp. 229–241, May 1999.
[16] C.H.Q. Ding, An Optimal Index Reshuffle Algorithm for Multidimensional Arrays and Its Applications for Parallel Architectures IEEE Trans. Parallel and Distributed Systems, vol. 12, no. 3, pp. 306-315, Mar. 2001.
[17] B.B. Fraguela, R. Doallo, and E.L. Zapata, “Cache Misses Prediction for High Performance Sparse Algorithms,” Proc. Fourth Int'l Euro-Par Conf. (Euro-Par '98), pp. 224-233, Sept. 1998.
[18] B.B. Fraguela, R. Doallo, and E.L. Zapata, Cache Probabilistic Modeling for Basic Sparse Algebra Kernels Involving Matrices with a Non-Uniform Distribution Proc. Euromicro Conf., pp. 345-348, Aug. 1998.
[19] B.B. Fraguela, R. Doallo, and E.L. Zapata, “Modeling Set Associative Caches Behaviour for Irregular Computations,” ACM Int'l Conf. Measurement and Modeling of Computer Systems (SIGMETRICS '98), pp. 192-201, June 1998.
[20] B.B. Fraguela, R. Doallo, and E.L. Zapata, “Automatic Analytical Modeling for the Estimation of Cache Misses,” Proc. Int'l Conf. Parallel Architectures and Compilation Techniques (PACT '99), Oct. 1999.
[21] J.D. Frens and D.S. Wise, “Auto-Blocking Matrix-Multiplication or Tracking BLAS3 Performance from Source Code,” Proc. Sixth ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, June 1997.
[22] G.H. Golub and C.F. Van Loan, Matrix Computations, second ed. Baltimore, Md.: John Hopkins Univ. Press, 1989.
[23] High Performance Fortran Forum, High Performance Fortran Language Specification, second ed. Rice Univ., 1997.
[24] M. Kandemir, J. Ramanujam, and A. Choudhary, “Improving Cache Locality by a Combination of Loop and Data Transformations,” IEEE Trans. Computers, vol. 48, no. 2, pp. 159-167, Feb. 1999. A preliminary version appears in Proc. 11th ACM Int'l Conf. Supercomputing (ICS '97), pp. 269-276, July 1997.
[25] M. Kandemir, J. Ramanujam, and A. Choudhary, “A Compiler Algorithm for Optimizing Locality in Loop Nests,” Proc. 1997 ACM Int'l Conf. Supercomputing, pp. 269-276, July 1997.
[26] C.W. Kebler and C.H. Smith, “The SPARAMAT Approach to Automatic Comprehension of Sparse Matrix Computations,” Proc. Seventh Int'l Workshop Program Comprehension, pp. 200-207, 1999.
[27] K. Kennedy and K.S. McKinley, "Optimizing for Parallelism and Data Locality," Proc. 1992 ACM Int'l Conf. Supercomputing, pp. 323-334,Washington, D.C., July 1992.
[28] I. Kodukula, N. Ahmed, and K. Pingali, “Data-Centric Multi-Level Blocking,” Proc. Programming Language Design and Implementation (PLDI '97), June 1997.
[29] V. Kotlyar, K. Pingali, and P. Stodghill, Compiling Parallel Sparse Code for User-Defined Data Structures Proc. SIAM Conf. Parallel Processing for Scientific Computing, 1997.
[30] V. Kotlyar, K. Pingali, and P. Stodghill, “A Relation Approach to the Compilation of Sparse Matrix Programs,” Euro Par, Aug. 1997.
[31] V. Kotlyar, K. Pingali, and P. Stodghill, “Compiling Parallel Code for Sparse Matrix Applications,” Proc. Supercomputing Conf., Aug. 1997.
[32] B. Kumar, C.-H. Huang, R.W. Johnson, and P. Sadayappan, “A Tensor Product Formulation of Strassen's Matrix Multiplication Algorithm with Memory Reduction,” Proc. Seventh Int'l Parallel Processing Symp., pp. 582-588, Apr. 1993.
[33] M. Lam, E. Rothberg, and M. Wolf, “The Cache Performance and Optimizations of Blocked Algorithms,” Proc. Fourth Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS '91), 1991.
[34] W. Li and K. Pingali, “A Singular Loop Transformation Framework Based on Non-Singular Matrices,” Proc. Fifth Workshop Languages and Compilers for Parallel Computers, pp. 249-260, 1992.
[35] C.Y. Lin, J.S. Liu, and Y.C. Chung, Efficient Representation Scheme for Multi-Dimensional Array Operations IEEE Trans. Computers, vol. 51, no. 3, pp. 327-345, Mar. 2002.
[36] K. McKinley, S. Carr, and C.W. Tseng, “Improving Data Locality with Loop Transformations,” ACM Trans. Programming Languages and Systems, vol. 18, no. 4, pp. 424-453, July 1996.
[37] M. O'Boyle and P. Knijnenburg, “Integrating Loop and Data Transformations for Global Optimisation,” Proc. Int'l Conf. Parallel Architectures and Compilation Techniques (PACT '98), Oct. 1998.
[38] W.H. Press, S.A. Teukolsky, W.T. Vetterling, and B.P. Flannery, Numerical Recipes in Fortran 90: The Art of Parallel Scientific Computing. Cambridge Univ. Press, 1996.
[39] P.D. Sulatycke and K. Ghose, “Caching Efficient Multithreaded Fast Multiplication of Sparse Matrices,” Proc. First Merged Int'l Parallel Processing Symp. and Symp. Parallel and Distributed Processing, pp. 117-123, 1998.
[40] M. Thottethodi, S. Chatterjee, and A.R. Lebeck, “Turing Strassen's Matrix Multiplication for Memory Efficiency,” Proc. ACM/IEEE SC98 Conf. High Performance Networking and Computing, Nov. 1998.
[41] M. Ujaldon, E.L. Zapata, S.D. Sharma, and J. Saltz, “Parallelization Techniques for Sparse Matrix Applications,” J. Parallel and Distribution Computing, 1996.
[42] M. Wolf and M. Lam, “A Data Locality Optimizing Algorithm,” Proc. SIGPLAN Conf. Programming Language Design and Implementation, pp. 30-44, June 1991.
[43] Y.Q. Yang, C. Ancourt, and F. Irigoin, Minimal Data Dependence Abstractions for Loop Transformations Proc. Workshop Languages and Compilers for Parallel Computing, pp. 201-216, 1994.
[44] L.H. Ziantz, C.C. Ozturan, and B.K. Szymanski, “Run-Time Optimization of Sparse Matrix-Vector Multiplication on SIMD Machines,” Proc. Int'l Conf. Parallel Architectures and Languages, pp. 313-322, July 1994.

Index Terms:
Data parallel algorithm, array operation, multidimensional array, data distribution, Karnaugh map.
Citation:
Chun-Yuan Lin, Yeh-Ching Chung, Jen-Shiuh Liu, "Efficient Data Parallel Algorithms for Multidimensional Array Operations Based on the EKMR Scheme for Distributed Memory Multicomputers," IEEE Transactions on Parallel and Distributed Systems, vol. 14, no. 7, pp. 625-639, July 2003, doi:10.1109/TPDS.2003.1214316
Usage of this product signifies your acceptance of the Terms of Use.