
This Article  
 
Share  
Bibliographic References  
Add to:  
Digg Furl Spurl Blink Simpy Del.icio.us Y!MyWeb  
Search  
 
ASCII Text  x  
Antoine P. Petitet, Jack J. Dongarra, "Algorithmic Redistribution Methods for BlockCyclic Decompositions," IEEE Transactions on Parallel and Distributed Systems, vol. 10, no. 12, pp. 12011216, December, 1999.  
BibTex  x  
@article{ 10.1109/71.819944, author = {Antoine P. Petitet and Jack J. Dongarra}, title = {Algorithmic Redistribution Methods for BlockCyclic Decompositions}, journal ={IEEE Transactions on Parallel and Distributed Systems}, volume = {10}, number = {12}, issn = {10459219}, year = {1999}, pages = {12011216}, doi = {http://doi.ieeecomputersociety.org/10.1109/71.819944}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, }  
RefWorks Procite/RefMan/Endnote  x  
TY  JOUR JO  IEEE Transactions on Parallel and Distributed Systems TI  Algorithmic Redistribution Methods for BlockCyclic Decompositions IS  12 SN  10459219 SP1201 EP1216 EPD  12011216 A1  Antoine P. Petitet, A1  Jack J. Dongarra, PY  1999 KW  Algorithmic blocking KW  redistribution KW  blockcyclic decomposition. VL  10 JA  IEEE Transactions on Parallel and Distributed Systems ER   
Abstract—This article presents various data redistribution methods for blockpartitioned linear algebra algorithms operating on dense matrices that are distributed in a blockcyclic fashion. Because the algorithmic partitioning unit and the distribution blocking factor are most often chosen to be equal, severe alignment restrictions are induced on the operands, and optimal values with respect to performance are architecture dependent. The techniques presented in this paper redistribute data “on the fly,” so that the user's data distribution blocking factor becomes independent from the architecture dependent algorithmic partitioning. These techniques are applied to the matrixmatrix multiplication operation. A performance analysis along with experimental results shows that alignment restrictions can then be removed and that high performance can be maintained across platforms independently from the user's data distribution blocking factor.
[1] M. Aboelaze, N. Chrisochoides, and E. Houstis, “The Parallelization of Level 2 and 3 BLAS Operations on DistributedMemory Machines,” Technical Report CSDTR91007, Purdue Univ., West Lafayette, Ind., 1991.
[2] R. Agarwal, F. Gustavson, and M. Zubair, “A High Performance Matrix Multiplication Algorithm on a DistributedMemory Parallel Computer, Using Overlapped Communication,” IBM J. Research and Development, vol. 38, no. 6,pp.673–681, 1994.
[3] T. Agerwala, J. Martin, J. Mirza, D. Sadler, D. Dias, and M. Snir, “SP2 System Architecture,” IBM Systems J., vol. 34, no. 2,pp. 153–184, 1995.
[4] C. Ancourt, F. Coelho, F. Irigoin, R. Keryell, “A linear Algebra Framework for Static HPF Code Distribution,” Technical Report A278CRI, CRIEcole des Mines, Fontainebleau, France, 1995. (Available athttp:/www.cri.ensmp.fr.)
[5] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, and D. Sorensen, LAPACK Users' Guide. Philadelphia, Penn.: SIAM, 1995.
[6] C. Ashcraft, “The Distributed Solution of Linear Systems Using the TorusWrap Data Mapping,” Technical Report ECATR147, Boeing Computer Services, Seattle, Wash., 1990.
[7] P. Bangalore, “The DataDistributionIndependent Approach to Scalable Parallel Libraries,” master's thesis, Mississippi State Univ., 1995.
[8] J. Bilmes, K. Asanovic, J. Demmel, D. Lam, and C. Chin, “Optimizing Matrix Multiply using PHiPAC: A Portable, HighPerformance, ANSI C Coding Methodology,” Technical Report UT CS96326, LAPACK Working Note 111, Univ. Tennessee, 1996.
[9] R. Bisseling and J. van der Vorst, “Parallel LU Decomposition on a Transputer Network,” Lecture Notes in Computer Sciences, G. van Zee and J. van der Vorst, eds., vol. 384, pp. 61–77, 1989.
[10] R. Bisseling and J. van der Vorst, “Parallel Triangular System Solving on a Mesh Network of Transputers,” SIAM J. Scientific and Statistical Computing, vol. 12,pp. 787–799, 1991.
[11] L. Blackford, J. Choi, A. Cleary, E. D'Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R.C. Whaley, ScaLAPACK Users' Guide. Philadelphia, Penn.: SIAM, 1997.
[12] R. Brent and P. Strazdins, “Implementation of BLAS Level 3 and LINPACK Benchmark on the AP1000,” Fujitsu Scientific and Technical J., vol. 5, no. 1,pp. 61–70, 1993.
[13] S. Chatterjee, J. Gilbert, F. Long, R. Schreiber, and S. Tseng, “Generating Local Adresses and Communication Sets for Data Parallel Programs,” J. Parallel and Distributed Computing, vol. 26,pp. 72–84, 1995.
[14] J. Choi, “A New Parallel Matrix Multiplication Algorithm on DistributedMemory Concurrent Computers,” Technical Report UT CS97369, LAPACK Working Note 129, Univ. Tennessee, 1997.
[15] J. Choi, J. Dongarra, and D. Walker, “PUMMA: Parallel Universal Matrix Multiplication Algorithms on DistributedMemory Concurrent Computers,” Concurrency: Practice and Experience, vol. 6, no. 7,pp. 543–570, 1994.
[16] J. Choi, J. Dongarra, and D. Walker, “PBBLAS: A Set of Parallel Block Basic Linear Algebra Subroutines” Concurrency: Practice and Experience, vol. 8, no. 7,pp. 517–535, 1996.
[17] A. Chtchelkanova, J. Gunnels, G. Morrow, J. Overfelt, and R. van de Geijn, “Parallel Implementation of BLAS: General Techniques for Level 3 BLAS,” Concurrency: Practice and Experience, vol. 9, no. 9,pp. 837–857, 1997.
[18] E. Chu and A. George, “QR Factorization of a Dense Matrix on a Hypercube Multiprocessor,” SIAM J. Scientific and Statistical Computing, vol. 11,pp. 990–1,028, 1990.
[19] M. Dayde, I. Duff, and A. Petitet, “A Parallel Block Implementation of Level 3 BLAS for MIMD Vector Processors,” ACM Trans. Mathematical Software, vol. 20, no. 2,pp. 178–193, 1994.
[20] F. Desprez, J. Dongarra, and A. Petitet, C. Randriamaro, Y. Robert, “Scheduling BlockCyclic Array Redistribution,” IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 2,pp. 192–205 1998.
[21] J. Dongarra, R. van de Geijn, and D. Walker, “Scalability Issues in the Design of a Library for Dense Linear Algebra,” J. Parallel and Distributed Computing, vol. 22, no. 3,pp. 523–537, 1994.
[22] J. Dongarra and D. Walker, “Software Libraries for Linear Algebra Computations on High Performance Computers,” SIAM Review, vol. 37, no. 2,pp. 151–180, 1995.
[23] J. Dongarra and R.C. Whaley, “A User's Guide to the BLACS v1.0,” Technical Report UT CS95281, LAPACK Working Note 94, Univ. Tennessee, 1995. ()
[24] G. Fox,M. Johnson,G. Lyzenga,S. Otto,J. Salmon,, and D. Walker,Solving Problems on Concurrent Processors, Vol. I: General Techniques andRegular Problems.Englewood Cliffs, N.J.: Prentice Hall 1988.
[25] G. Fox, S. Otto, and A. Hey, “Matrix Algorithms on a Hypercube I: Matrix Multiplication,” Parallel Computing, vol. 3,pp. 17–31, 1987.
[26] G. Geist and C. Romine, “LU Factorization Algorithms on DistributedMemory Multiprocessor Architectures,” SIAM J. Scientific and Statistical Computing, vol. 9,pp. 639–649, 1988.
[27] M. Heath and C. Romine, “Parallel Solution Triangular Systems on DistributedMemory Multiprocessors,” SIAM J. Scientific and Statistical Computing, vol. 9,pp. 558–588, 1988.
[28] B. Hendrickson, E. Jessup, and C. Smith, “A Parallel Eigensolver for Dense Symmetric Matrices,” personal communication, 1996.
[29] B. Hendrickson and D. Womble, “The TorusWrap Mapping for Dense Matrix Calculations on Massively Parallel Computers,” J. Scientific and Statistical Computing, vol. 15, no. 5,pp. 1,201–1,226, Sept. 1994.
[30] G. Henry and R. van de Geijn, “Parallelizing the QR Algorithm for the Unsymmetric Algebraic Eigenvalue Problem: Myths and Reality,” Technical Report UT CS94244, LAPACK Working Note 79, Univ. Tennessee, 1994.
[31] S. HussLederman, E. Jacobson, A. Tsao, and G. Zhang, “Matrix Multiplication on the Intel Touchstone DELTA,” Concurrency: Practice and Experience, vol. 6, no. 7,pp. 571–594, 1994.
[32] B. Kågström, P. Ling, and C. van Loan, “GEMMBased Level 3 BLAS: HighPerformance Model Implementations and Performance Evaluation Benchmark,” Technical Report UMINF 9518, Dept. Computing Science, UmeåUniv., 1995.
[33] E. Kalns and L. Ni, “Processor Mapping Techniques towards Efficient Data Redistribution,” IEEE Trans. Parallel and Distributed Systems, vol. 12, no. 6,pp. 1,234–1,247, 1995.
[34] K. Kennedy, N. ${\bf Nedeljkovi\acute c}$, and A. Sethi, “A LinearTime Algorithm for Computing the Memory Access Sequence in Data Parallel Programs,” Proc. Fifth ACM SIGPLAN, Symp. Principles and Practice of Parallel Programming, 1995.
[35] C. Koelbel, D. Loveman, R. Schreiber, G. Steele Jr., and M. Zosel, The High Performance Fortran Handbook. MIT Press, 1994.
[36] V. Kumar, A. Grama, A. Gupta, and G. Karypis, Introduction to Parallel Computing: Design and Analysis of Algorithms. Benjamin Cummings, 1994.
[37] G. Li and T. Coleman, “A New Method for Solving Triangular Systems on DistributedMemory MessagePassing Multiprocessor,” SIAM J. Scientific and Statistical Computing, vol. 10, no. 2,pp. 382–396, 1989.
[38] W. Lichtenstein and S.L. Johnsson, “BlockCyclic Dense Linear Algebra,” SIAM J. Scientific and Statistical Computing, vol. 14, no. 6,pp. 1,259–1,288 1993.
[39] Y. Lim, P. Bhat, and V. Prasanna, “Efficient Algorithms for BlockCyclic Redistribution of Arrays,” Technical Report CENG 9710, Dept. Electrical EngineeringSystems, Univ. Southern California, Los Angeles, Calif., 1997.
[40] K. Mathur, S.L. Johnsson, “Multiplication of Matrices of Arbitrary Shapes on a Data Parallel Computer,” Parallel Computing, vol. 20,pp. 919–951, 1994.
[41] A. Petitet, Algorithmic Redistribution Methods for Block Cyclic Decompositions, doctoral thesis, Univ. Tennessee, K noxville, 1996.
[42] L. Prylli and B. Tourancheau, “Fast Runtime Block Cyclic Data Redistribution on Multiprocessors,” J. Parallel and Distributed Computing, vol. 45, 1997.
[43] P. Strazdins, “Matrix Factorization using Distributed Panels on the Fujitsu AP1000,” Proc. IEEE First Int'l Conf. Algorithms and Architectures for Parallel Processing (ICA3PP95), 1995.
[44] P. Strazdins and H. Koesmarno, “A High Performance Version of Parallel LAPACK: Preliminary Report,” Proc. Sixth Parallel Computing Workshop, Fujitsu Parallel Computing Center, 1996.
[45] C. Stunkel, D. Shea, B. Abali, M. Atkins, C. Bender, D. Grice, P. Hochshild, D. Joseph, B. Nathanson, R. Swetz, R. Stucke, M. Tsao, and P. Varker, “The SP2 HighPerformance Switch,” IBM Systems J., vol. 34, no. 2,pp. 185–204, 1995.
[46] A. Thirumalai and J. Ramanujam, “Fast Address Sequence Generation for Data Parallel Programs Using Integer Lattices,” Languages and Compilers for Parallel Computing: Lecture Notes in Computer Science. P. Sadayappan et al., eds., SpringerVerlag, 1996.
[47] R. van de Geijn and J. Watts, “SUMMA: Scalable Universal Matrix Multiplication Algorithm,” Concurrency: Practice and Experience, vol. 9, no. 4,pp. 255–274, 1997.
[48] E. van de Velde, “Experiments with Multicomputer LU Decomposition,” Concurrency: Practice and Experience, vol. 2,pp. 1–26, 1990.
[49] D. Walker and S. Otto, “Redistribution of BlockCyclic Data Distributions Using MPI,” Concurrency: Practice and Experience, vol. 8, no. 9,pp. 707–728, 1996.
[50] L. Wang, J. Stichnoth, S. Chatterjee, “Runtime Performance of Parallel Array Assignment: An Empirical Study,” Proc. Supercomputing, 1996. ().
[51] R. Whaley and J. Dongarra, “Automatically Tuned Linear Algebra Software,” Technical Report UT CS97366, LAPACK Working Note 131, Univ. Tennessee, 1997.