|
| This Article | ||
| ||
| Share | ||
| Bibliographic References | ||
| Add to: | ||
| | ||
| Search | ||
| ||
| ASCII Text | x | ||
| Chun-Yuan Lin, Yeh-Ching Chung, Jen-Shiuh Liu, "Efficient Data Parallel Algorithms for Multidimensional Array Operations Based on the EKMR Scheme for Distributed Memory Multicomputers," IEEE Transactions on Parallel and Distributed Systems, vol. 14, no. 7, pp. 625-639, July, 2003. | |||
| BibTex | x | ||
| @article{ 10.1109/TPDS.2003.1214316, author = {Chun-Yuan Lin and Yeh-Ching Chung and Jen-Shiuh Liu}, title = {Efficient Data Parallel Algorithms for Multidimensional Array Operations Based on the EKMR Scheme for Distributed Memory Multicomputers}, journal ={IEEE Transactions on Parallel and Distributed Systems}, volume = {14}, number = {7}, issn = {1045-9219}, year = {2003}, pages = {625-639}, doi = {http://doi.ieeecomputersociety.org/10.1109/TPDS.2003.1214316}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, } | |||
| RefWorks Procite/RefMan/Endnote | x | ||
| TY - JOUR JO - IEEE Transactions on Parallel and Distributed Systems TI - Efficient Data Parallel Algorithms for Multidimensional Array Operations Based on the EKMR Scheme for Distributed Memory Multicomputers IS - 7 SN - 1045-9219 SP625 EP639 EPD - 625-639 A1 - Chun-Yuan Lin, A1 - Yeh-Ching Chung, A1 - Jen-Shiuh Liu, PY - 2003 KW - Data parallel algorithm KW - array operation KW - multidimensional array KW - data distribution KW - Karnaugh map. VL - 14 JA - IEEE Transactions on Parallel and Distributed Systems ER - | |||
Abstract—Array operations are useful in a large number of important scientific codes, such as molecular dynamics, finite element methods, climate modeling, atmosphere and ocean sciences, etc. In our previous work, we have proposed a scheme
[1] J.C. Adams, W.S. Brainerd, J.T. Martin, B.T. Smith, and J.L. Wagener, FORTRAN 90 Handbooks. Intertext Publications/McGraw-Hill, 1992.
[2] C. Ancourt and R. Triolet, "Scanning Polyhedra with Do Loops," Proc. Third ACM Symp. Principles and Practice of Parallel Programming, pp. 39-50, 1991.
[3] I. Banicescu and S.F. Hummel, “Balancing Processor Loads and Exploiting Data Locality in N-Body Simulations,” Proc. 1995 ACM/IEEE Supercomputing Conf., Dec. 1995.
[4] D. Callahan, S. Carr, and K. Kennedy, “Improving Register Allocation for Subscripted Variables,” Proc. ACM SIGPLAN 1990 Conf. Programming Language Design and Implementation, pp. 53-65, June 1990.
[5] S. Carr, K.S. McKinley, and C.-W. Tseng, “Compiler Optimizations for Improving Data Locality,” Proc. Sixth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 252-262, Oct. 1994.
[6] L. Carter, J. Ferrante, and S.F. Hummel, “Hierarchical Tiling for Improved Superscalar Performance,” Proc. Nineth Int'l Symp. Parallel Processing, pp. 239-245, Apr. 1995.
[7] R.G. Chang, T.R. Chung, and J.K. Lee, Parallel Sparse Supports for Array Intrinsic Functions of Fortran 90 J. Supercomputing, vol. 18, no. 3, pp. 305-339, Mar. 2001.
[8] S. Chatterjee, A.R. Lebeck, P.K. Patnala, and M. Thottethodi, “Recursive Array Layouts and Fast Parallel Matrix Multiplication,” Proc. Eleventh Ann. ACM Symp. Parallel Algorithms and Architectures, pp. 222-231, June 1999.
[9] S. Chatterjee, V.V. Jain, A.R. Lebeck, S. Mundhra, and M. Thottethodi, “Nonlinear Array Layouts for Hierarchical Memory Systems,” Proc. 1999 ACM Int'l Conf. Supercomputing, pp. 444-453, June 1999.
[10] T.-R. Chung, R.-G. Chang, and J.K. Lee, Sampling and Analytical Techniques for Data Distribution of Parallel Sparse Computation Proc. SIAM Conf. Parallel Processing for Scientific Computing, 1997.
[11] T.-R. Chung, R.-G. Chang, and J.K. Lee, Efficient Support of Parallel Sparse Computation for Array Intrinsic Functions of Fortran 90 Proc. ACM Int'l Conf. Supercomputing, pp. 45-52, 1998.
[12] M. Cierniak and W. Li, “Unifying Data and Control Transformations for Distributed Shared Memory Machines,” Proc. SIGPLAN Conf. Programming Language Design and Implementation, June 1995.
[13] S. Coleman and K. McKinley, “Tile Size Selection Using Cache Organization and Data Layout,” Proc. SIGPLAN Conf. Programming Language Design and Implementation, June 1995.
[14] J.K. Cullum and R.A. Willoughby, Algorithms for Large Symmetric Eignenvalue Computations, vol. 1.Boston, Mass.: Birkhauser, 1985.
[15] Chen Ding and Ken Kennedy, “Improving Cache Performance in Dynamic Applications through Data and Computation Reorganization at Run Time,” Proc. ACM SIGPLAN‘99 Conf. Programming Language Design and Implementation, pp. 229–241, May 1999.
[16] C.H.Q. Ding, An Optimal Index Reshuffle Algorithm for Multidimensional Arrays and Its Applications for Parallel Architectures IEEE Trans. Parallel and Distributed Systems, vol. 12, no. 3, pp. 306-315, Mar. 2001.
[17] B.B. Fraguela, R. Doallo, and E.L. Zapata, “Cache Misses Prediction for High Performance Sparse Algorithms,” Proc. Fourth Int'l Euro-Par Conf. (Euro-Par '98), pp. 224-233, Sept. 1998.
[18] B.B. Fraguela, R. Doallo, and E.L. Zapata, Cache Probabilistic Modeling for Basic Sparse Algebra Kernels Involving Matrices with a Non-Uniform Distribution Proc. Euromicro Conf., pp. 345-348, Aug. 1998.
[19] B.B. Fraguela, R. Doallo, and E.L. Zapata, “Modeling Set Associative Caches Behaviour for Irregular Computations,” ACM Int'l Conf. Measurement and Modeling of Computer Systems (SIGMETRICS '98), pp. 192-201, June 1998.
[20] B.B. Fraguela, R. Doallo, and E.L. Zapata, “Automatic Analytical Modeling for the Estimation of Cache Misses,” Proc. Int'l Conf. Parallel Architectures and Compilation Techniques (PACT '99), Oct. 1999.
[21] J.D. Frens and D.S. Wise, “Auto-Blocking Matrix-Multiplication or Tracking BLAS3 Performance from Source Code,” Proc. Sixth ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, June 1997.
[22] G.H. Golub and C.F. Van Loan, Matrix Computations, second ed. Baltimore, Md.: John Hopkins Univ. Press, 1989.
[23] High Performance Fortran Forum, High Performance Fortran Language Specification, second ed. Rice Univ., 1997.
[24] M. Kandemir, J. Ramanujam, and A. Choudhary, “Improving Cache Locality by a Combination of Loop and Data Transformations,” IEEE Trans. Computers, vol. 48, no. 2, pp. 159-167, Feb. 1999. A preliminary version appears in Proc. 11th ACM Int'l Conf. Supercomputing (ICS '97), pp. 269-276, July 1997.
[25] M. Kandemir, J. Ramanujam, and A. Choudhary, “A Compiler Algorithm for Optimizing Locality in Loop Nests,” Proc. 1997 ACM Int'l Conf. Supercomputing, pp. 269-276, July 1997.
[26] C.W. Kebler and C.H. Smith, “The SPARAMAT Approach to Automatic Comprehension of Sparse Matrix Computations,” Proc. Seventh Int'l Workshop Program Comprehension, pp. 200-207, 1999.
[27] K. Kennedy and K.S. McKinley, "Optimizing for Parallelism and Data Locality," Proc. 1992 ACM Int'l Conf. Supercomputing, pp. 323-334,Washington, D.C., July 1992.
[28] I. Kodukula, N. Ahmed, and K. Pingali, “Data-Centric Multi-Level Blocking,” Proc. Programming Language Design and Implementation (PLDI '97), June 1997.
[29] V. Kotlyar, K. Pingali, and P. Stodghill, Compiling Parallel Sparse Code for User-Defined Data Structures Proc. SIAM Conf. Parallel Processing for Scientific Computing, 1997.
[30] V. Kotlyar, K. Pingali, and P. Stodghill, “A Relation Approach to the Compilation of Sparse Matrix Programs,” Euro Par, Aug. 1997.
[31] V. Kotlyar, K. Pingali, and P. Stodghill, “Compiling Parallel Code for Sparse Matrix Applications,” Proc. Supercomputing Conf., Aug. 1997.
[32] B. Kumar, C.-H. Huang, R.W. Johnson, and P. Sadayappan, “A Tensor Product Formulation of Strassen's Matrix Multiplication Algorithm with Memory Reduction,” Proc. Seventh Int'l Parallel Processing Symp., pp. 582-588, Apr. 1993.
[33] M. Lam, E. Rothberg, and M. Wolf, “The Cache Performance and Optimizations of Blocked Algorithms,” Proc. Fourth Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS '91), 1991.
[34] W. Li and K. Pingali, “A Singular Loop Transformation Framework Based on Non-Singular Matrices,” Proc. Fifth Workshop Languages and Compilers for Parallel Computers, pp. 249-260, 1992.
[35] C.Y. Lin, J.S. Liu, and Y.C. Chung, Efficient Representation Scheme for Multi-Dimensional Array Operations IEEE Trans. Computers, vol. 51, no. 3, pp. 327-345, Mar. 2002.
[36] K. McKinley, S. Carr, and C.W. Tseng, “Improving Data Locality with Loop Transformations,” ACM Trans. Programming Languages and Systems, vol. 18, no. 4, pp. 424-453, July 1996.
[37] M. O'Boyle and P. Knijnenburg, “Integrating Loop and Data Transformations for Global Optimisation,” Proc. Int'l Conf. Parallel Architectures and Compilation Techniques (PACT '98), Oct. 1998.
[38] W.H. Press, S.A. Teukolsky, W.T. Vetterling, and B.P. Flannery, Numerical Recipes in Fortran 90: The Art of Parallel Scientific Computing. Cambridge Univ. Press, 1996.
[39] P.D. Sulatycke and K. Ghose, “Caching Efficient Multithreaded Fast Multiplication of Sparse Matrices,” Proc. First Merged Int'l Parallel Processing Symp. and Symp. Parallel and Distributed Processing, pp. 117-123, 1998.
[40] M. Thottethodi, S. Chatterjee, and A.R. Lebeck, “Turing Strassen's Matrix Multiplication for Memory Efficiency,” Proc. ACM/IEEE SC98 Conf. High Performance Networking and Computing, Nov. 1998.
[41] M. Ujaldon, E.L. Zapata, S.D. Sharma, and J. Saltz, “Parallelization Techniques for Sparse Matrix Applications,” J. Parallel and Distribution Computing, 1996.
[42] M. Wolf and M. Lam, “A Data Locality Optimizing Algorithm,” Proc. SIGPLAN Conf. Programming Language Design and Implementation, pp. 30-44, June 1991.
[43] Y.Q. Yang, C. Ancourt, and F. Irigoin, Minimal Data Dependence Abstractions for Loop Transformations Proc. Workshop Languages and Compilers for Parallel Computing, pp. 201-216, 1994.
[44] L.H. Ziantz, C.C. Ozturan, and B.K. Szymanski, “Run-Time Optimization of Sparse Matrix-Vector Multiplication on SIMD Machines,” Proc. Int'l Conf. Parallel Architectures and Languages, pp. 313-322, July 1994.

