This Article 
 Bibliographic References 
 Add to: 
Algorithmic Redistribution Methods for Block-Cyclic Decompositions
December 1999 (vol. 10 no. 12)
pp. 1201-1216

Abstract—This article presents various data redistribution methods for block-partitioned linear algebra algorithms operating on dense matrices that are distributed in a block-cyclic fashion. Because the algorithmic partitioning unit and the distribution blocking factor are most often chosen to be equal, severe alignment restrictions are induced on the operands, and optimal values with respect to performance are architecture dependent. The techniques presented in this paper redistribute data “on the fly,” so that the user's data distribution blocking factor becomes independent from the architecture dependent algorithmic partitioning. These techniques are applied to the matrix-matrix multiplication operation. A performance analysis along with experimental results shows that alignment restrictions can then be removed and that high performance can be maintained across platforms independently from the user's data distribution blocking factor.

[1] M. Aboelaze, N. Chrisochoides, and E. Houstis, “The Parallelization of Level 2 and 3 BLAS Operations on Distributed-Memory Machines,” Technical Report CSD-TR-91-007, Purdue Univ., West Lafayette, Ind., 1991.
[2] R. Agarwal, F. Gustavson, and M. Zubair, “A High Performance Matrix Multiplication Algorithm on a Distributed-Memory Parallel Computer, Using Overlapped Communication,” IBM J. Research and Development, vol. 38, no. 6,pp.673–681, 1994.
[3] T. Agerwala, J. Martin, J. Mirza, D. Sadler, D. Dias, and M. Snir, “SP2 System Architecture,” IBM Systems J., vol. 34, no. 2,pp. 153–184, 1995.
[4] C. Ancourt, F. Coelho, F. Irigoin, R. Keryell, “A linear Algebra Framework for Static HPF Code Distribution,” Technical Report A-278-CRI, CRI-Ecole des Mines, Fontainebleau, France, 1995. (Available athttp:/
[5] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, and D. Sorensen, LAPACK Users' Guide. Philadelphia, Penn.: SIAM, 1995.
[6] C. Ashcraft, “The Distributed Solution of Linear Systems Using the Torus-Wrap Data Mapping,” Technical Report ECA-TR-147, Boeing Computer Services, Seattle, Wash., 1990.
[7] P. Bangalore, “The Data-Distribution-Independent Approach to Scalable Parallel Libraries,” master's thesis, Mississippi State Univ., 1995.
[8] J. Bilmes, K. Asanovic, J. Demmel, D. Lam, and C. Chin, “Optimizing Matrix Multiply using PHiPAC: A Portable, High-Performance, ANSI C Coding Methodology,” Technical Report UT CS-96-326, LAPACK Working Note 111, Univ. Tennessee, 1996.
[9] R. Bisseling and J. van der Vorst, “Parallel LU Decomposition on a Transputer Network,” Lecture Notes in Computer Sciences, G. van Zee and J. van der Vorst, eds., vol. 384, pp. 61–77, 1989.
[10] R. Bisseling and J. van der Vorst, “Parallel Triangular System Solving on a Mesh Network of Transputers,” SIAM J. Scientific and Statistical Computing, vol. 12,pp. 787–799, 1991.
[11] L. Blackford, J. Choi, A. Cleary, E. D'Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R.C. Whaley, ScaLAPACK Users' Guide. Philadelphia, Penn.: SIAM, 1997.
[12] R. Brent and P. Strazdins, “Implementation of BLAS Level 3 and LINPACK Benchmark on the AP1000,” Fujitsu Scientific and Technical J., vol. 5, no. 1,pp. 61–70, 1993.
[13] S. Chatterjee, J. Gilbert, F. Long, R. Schreiber, and S. Tseng, “Generating Local Adresses and Communication Sets for Data Parallel Programs,” J. Parallel and Distributed Computing, vol. 26,pp. 72–84, 1995.
[14] J. Choi, “A New Parallel Matrix Multiplication Algorithm on Distributed-Memory Concurrent Computers,” Technical Report UT CS-97-369, LAPACK Working Note 129, Univ. Tennessee, 1997.
[15] J. Choi, J. Dongarra, and D. Walker, “PUMMA: Parallel Universal Matrix Multiplication Algorithms on Distributed-Memory Concurrent Computers,” Concurrency: Practice and Experience, vol. 6, no. 7,pp. 543–570, 1994.
[16] J. Choi, J. Dongarra, and D. Walker, “PB-BLAS: A Set of Parallel Block Basic Linear Algebra Subroutines” Concurrency: Practice and Experience, vol. 8, no. 7,pp. 517–535, 1996.
[17] A. Chtchelkanova, J. Gunnels, G. Morrow, J. Overfelt, and R. van de Geijn, “Parallel Implementation of BLAS: General Techniques for Level 3 BLAS,” Concurrency: Practice and Experience, vol. 9, no. 9,pp. 837–857, 1997.
[18] E. Chu and A. George, “QR Factorization of a Dense Matrix on a Hypercube Multiprocessor,” SIAM J. Scientific and Statistical Computing, vol. 11,pp. 990–1,028, 1990.
[19] M. Dayde, I. Duff, and A. Petitet, “A Parallel Block Implementation of Level 3 BLAS for MIMD Vector Processors,” ACM Trans. Mathematical Software, vol. 20, no. 2,pp. 178–193, 1994.
[20] F. Desprez, J. Dongarra, and A. Petitet, C. Randriamaro, Y. Robert, “Scheduling Block-Cyclic Array Redistribution,” IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 2,pp. 192–205 1998.
[21] J. Dongarra, R. van de Geijn, and D. Walker, “Scalability Issues in the Design of a Library for Dense Linear Algebra,” J. Parallel and Distributed Computing, vol. 22, no. 3,pp. 523–537, 1994.
[22] J. Dongarra and D. Walker, “Software Libraries for Linear Algebra Computations on High Performance Computers,” SIAM Review, vol. 37, no. 2,pp. 151–180, 1995.
[23] J. Dongarra and R.C. Whaley, “A User's Guide to the BLACS v1.0,” Technical Report UT CS-95-281, LAPACK Working Note 94, Univ. Tennessee, 1995. ()
[24] G. Fox,M. Johnson,G. Lyzenga,S. Otto,J. Salmon,, and D. Walker,Solving Problems on Concurrent Processors, Vol. I: General Techniques andRegular Problems.Englewood Cliffs, N.J.: Prentice Hall 1988.
[25] G. Fox, S. Otto, and A. Hey, “Matrix Algorithms on a Hyper-cube I: Matrix Multiplication,” Parallel Computing, vol. 3,pp. 17–31, 1987.
[26] G. Geist and C. Romine, “LU Factorization Algorithms on Distributed-Memory Multiprocessor Architectures,” SIAM J. Scientific and Statistical Computing, vol. 9,pp. 639–649, 1988.
[27] M. Heath and C. Romine, “Parallel Solution Triangular Systems on Distributed-Memory Multiprocessors,” SIAM J. Scientific and Statistical Computing, vol. 9,pp. 558–588, 1988.
[28] B. Hendrickson, E. Jessup, and C. Smith, “A Parallel Eigensolver for Dense Symmetric Matrices,” personal communication, 1996.
[29] B. Hendrickson and D. Womble, “The Torus-Wrap Mapping for Dense Matrix Calculations on Massively Parallel Computers,” J. Scientific and Statistical Computing, vol. 15, no. 5,pp. 1,201–1,226, Sept. 1994.
[30] G. Henry and R. van de Geijn, “Parallelizing the QR Algorithm for the Unsymmetric Algebraic Eigenvalue Problem: Myths and Reality,” Technical Report UT CS-94-244, LAPACK Working Note 79, Univ. Tennessee, 1994.
[31] S. Huss-Lederman, E. Jacobson, A. Tsao, and G. Zhang, “Matrix Multiplication on the Intel Touchstone DELTA,” Concurrency: Practice and Experience, vol. 6, no. 7,pp. 571–594, 1994.
[32] B. Kågström, P. Ling, and C. van Loan, “GEMM-Based Level 3 BLAS: High-Performance Model Implementations and Performance Evaluation Benchmark,” Technical Report UMINF 95-18, Dept. Computing Science, UmeåUniv., 1995.
[33] E. Kalns and L. Ni, “Processor Mapping Techniques towards Efficient Data Redistribution,” IEEE Trans. Parallel and Distributed Systems, vol. 12, no. 6,pp. 1,234–1,247, 1995.
[34] K. Kennedy, N. ${\bf Nedeljkovi\acute c}$, and A. Sethi, “A Linear-Time Algorithm for Computing the Memory Access Sequence in Data Parallel Programs,” Proc. Fifth ACM SIGPLAN, Symp. Principles and Practice of Parallel Programming, 1995.
[35] C. Koelbel, D. Loveman, R. Schreiber, G. Steele Jr., and M. Zosel, The High Performance Fortran Handbook. MIT Press, 1994.
[36] V. Kumar, A. Grama, A. Gupta, and G. Karypis, Introduction to Parallel Computing: Design and Analysis of Algorithms. Benjamin Cummings, 1994.
[37] G. Li and T. Coleman, “A New Method for Solving Triangular Systems on Distributed-Memory Message-Passing Multiprocessor,” SIAM J. Scientific and Statistical Computing, vol. 10, no. 2,pp. 382–396, 1989.
[38] W. Lichtenstein and S.L. Johnsson, “Block-Cyclic Dense Linear Algebra,” SIAM J. Scientific and Statistical Computing, vol. 14, no. 6,pp. 1,259–1,288 1993.
[39] Y. Lim, P. Bhat, and V. Prasanna, “Efficient Algorithms for Block-Cyclic Redistribution of Arrays,” Technical Report CENG 97-10, Dept. Electrical Engineering-Systems, Univ. Southern California, Los Angeles, Calif., 1997.
[40] K. Mathur, S.L. Johnsson, “Multiplication of Matrices of Arbitrary Shapes on a Data Parallel Computer,” Parallel Computing, vol. 20,pp. 919–951, 1994.
[41] A. Petitet, Algorithmic Redistribution Methods for Block Cyclic Decompositions, doctoral thesis, Univ. Tennessee, K noxville, 1996.
[42] L. Prylli and B. Tourancheau, “Fast Runtime Block Cyclic Data Redistribution on Multiprocessors,” J. Parallel and Distributed Computing, vol. 45, 1997.
[43] P. Strazdins, “Matrix Factorization using Distributed Panels on the Fujitsu AP1000,” Proc. IEEE First Int'l Conf. Algorithms and Architectures for Parallel Processing (ICA3PP-95), 1995.
[44] P. Strazdins and H. Koesmarno, “A High Performance Version of Parallel LAPACK: Preliminary Report,” Proc. Sixth Parallel Computing Workshop, Fujitsu Parallel Computing Center, 1996.
[45] C. Stunkel, D. Shea, B. Abali, M. Atkins, C. Bender, D. Grice, P. Hochshild, D. Joseph, B. Nathanson, R. Swetz, R. Stucke, M. Tsao, and P. Varker, “The SP2 High-Performance Switch,” IBM Systems J., vol. 34, no. 2,pp. 185–204, 1995.
[46] A. Thirumalai and J. Ramanujam, “Fast Address Sequence Generation for Data Parallel Programs Using Integer Lattices,” Languages and Compilers for Parallel Computing: Lecture Notes in Computer Science. P. Sadayappan et al., eds., Springer-Verlag, 1996.
[47] R. van de Geijn and J. Watts, “SUMMA: Scalable Universal Matrix Multiplication Algorithm,” Concurrency: Practice and Experience, vol. 9, no. 4,pp. 255–274, 1997.
[48] E. van de Velde, “Experiments with Multicomputer LU Decomposition,” Concurrency: Practice and Experience, vol. 2,pp. 1–26, 1990.
[49] D. Walker and S. Otto, “Redistribution of Block-Cyclic Data Distributions Using MPI,” Concurrency: Practice and Experience, vol. 8, no. 9,pp. 707–728, 1996.
[50] L. Wang, J. Stichnoth, S. Chatterjee, “Runtime Performance of Par-allel Array Assignment: An Empirical Study,” Proc. Supercomputing, 1996. ().
[51] R. Whaley and J. Dongarra, “Automatically Tuned Linear Algebra Software,” Technical Report UT CS-97-366, LAPACK Working Note 131, Univ. Tennessee, 1997.

Index Terms:
Algorithmic blocking, redistribution, block-cyclic decomposition.
Antoine P. Petitet, Jack J. Dongarra, "Algorithmic Redistribution Methods for Block-Cyclic Decompositions," IEEE Transactions on Parallel and Distributed Systems, vol. 10, no. 12, pp. 1201-1216, Dec. 1999, doi:10.1109/71.819944
Usage of this product signifies your acceptance of the Terms of Use.