
This Article  
 
Share  
Bibliographic References  
Add to:  
Digg Furl Spurl Blink Simpy Del.icio.us Y!MyWeb  
Search  
 
ASCII Text  x  
Siddhartha Chatterjee, Alvin R. Lebeck, Praveen K. Patnala, Mithuna Thottethodi, "Recursive Array Layouts and Fast Matrix Multiplication," IEEE Transactions on Parallel and Distributed Systems, vol. 13, no. 11, pp. 11051123, November, 2002.  
BibTex  x  
@article{ 10.1109/TPDS.2002.1058095, author = {Siddhartha Chatterjee and Alvin R. Lebeck and Praveen K. Patnala and Mithuna Thottethodi}, title = {Recursive Array Layouts and Fast Matrix Multiplication}, journal ={IEEE Transactions on Parallel and Distributed Systems}, volume = {13}, number = {11}, issn = {10459219}, year = {2002}, pages = {11051123}, doi = {http://doi.ieeecomputersociety.org/10.1109/TPDS.2002.1058095}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, }  
RefWorks Procite/RefMan/Endnote  x  
TY  JOUR JO  IEEE Transactions on Parallel and Distributed Systems TI  Recursive Array Layouts and Fast Matrix Multiplication IS  11 SN  10459219 SP1105 EP1123 EPD  11051123 A1  Siddhartha Chatterjee, A1  Alvin R. Lebeck, A1  Praveen K. Patnala, A1  Mithuna Thottethodi, PY  2002 KW  Data layout KW  matrix multiplication. VL  13 JA  IEEE Transactions on Parallel and Distributed Systems ER   
Abstract—The performance of both serial and parallel implementations of matrix multiplication is highly sensitive to memory system behavior. False sharing and cache conflicts cause traditional columnmajor or rowmajor array layouts to incur high variability in memory system performance as matrix size varies. This paper investigates the use of recursive array layouts to improve performance and reduce variability. Previous work on recursive matrix multiplication is extended to examine several recursive array layouts and three recursive algorithms: standard matrix multiplication and the more complex algorithms of Strassen and Winograd. While recursive layouts significantly outperform traditional layouts (reducing execution times by a factor of 1.22.5) for the standard algorithm, they offer little improvement for Strassen's and Winograd's algorithms. For a purely sequential implementation, it is possible to reorder computation to conserve memory space and improve performance between 10 percent and 20 percent. Carrying the recursive layout down to the level of individual matrix elements is shown to be counterproductive; a combination of recursive layouts down to canonically ordered matrix tiles instead yields higher performance. Five recursive layouts with successively increasing complexity of address computation are evaluated and it is shown that addressing overheads can be kept in control even for the most computationally demanding of these layouts.
[1] M. Thottethodi, S. Chatterjee, and A.R. Lebeck, “Turing Strassen's Matrix Multiplication for Memory Efficiency,” Proc. ACM/IEEE SC98 Conf. High Performance Networking and Computing, Nov. 1998.
[2] S. Chatterjee, V.V. Jain, A.R. Lebeck, S. Mundhra, and M. Thottethodi, “Nonlinear Array Layouts for Hierarchical Memory Systems,” Proc. 1999 ACM Int'l Conf. Supercomputing, pp. 444453, June 1999.
[3] S. Chatterjee, A.R. Lebeck, P.K. Patnala, and M. Thottethodi, “Recursive Array Layouts and Fast Parallel Matrix Multiplication,” Proc. Eleventh Ann. ACM Symp. Parallel Algorithms and Architectures, pp. 222231, June 1999.
[4] J.J. Dongarra, J.D. Croz, S. Hammarling, and I. Duff, "A Set of Level 3 Basic Linear Algebra Subprograms," ACM Trans. Mathematical Software, vol. 16, no. 1, pp. 117, Mar. 1990.
[5] D. Hilbert, “Éber Stetige Abbildung Einer Linie Auf Ein Flächenstück,” Mathematische Annalen, vol. 38, pp. 459–460, 1891.
[6] G. Peano, “Sur une Courbe Qui Remplit Toute une Aire Plaine,” Mathematische Annalen, vol. 36, pp. 157–160, 1890.
[7] M.S. Warren and J.K. Salmon, "A parallel hashed octtree Nbody algorithm," Proc. Supercomputing 93, pp. 1221, 1993.
[8] I. Banicescu and S.F. Hummel, “Balancing Processor Loads and Exploiting Data Locality in NBody Simulations,” Proc. 1995 ACM/IEEE Supercomputing Conf., Dec. 1995.
[9] S.F. Hummel, I. Banicescu, C.T. Wang, and J. Wein, “Load Balancing and Data Locality via Fractiling: An Experimental Study,” Language, Compilers, and RunTime Systems for Scalable Computers, 1995.
[10] Y.C. Hu, S.L. Johnsson, and S.H. Teng, “High Performance Fortran for Highly Irregular Problems,” Proc. the Sixth ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, pp. 13–24, June 1997.
[11] J.R. Pilkington and S.B. Baden, “Dynamic Partitioning of NonUniform Structured Workloads with Spacefilling Curves,” IEEE Trans. Parallel and Distributed Systems, vol. 7, no. 3, pp. 288300, 1996.
[12] J. P. Singh,T. Joe,A. Gupta,, and J. Hennessy,“An empirical comparison of the KSR and DASH multiprocessors,” Proc. Supercomputing 93, pp. 214225, Nov. 1993.
[13] T. Bially, "SpaceFilling Curves: Their Generation and Their Application to Bandwidth Reduction," IEEE Trans. Information Theory, vol. 15, no. 6, pp. 658664, 1969.
[14] M.F. Goodchild and A.W. Grandfield, “Optimizing Raster Storage: An Examination of Four Alternatives,” Proc. AutoCarto 6, vol. 1, pp. 400–407, Oct. 1983.
[15] R. Laurini, “Graphical Data Bases Built on Peano SpaceFilling Curves,” Proc. EUROGRAPHICS '85 Conf., C.E. Vandoni, ed., pp. 327–338, 1985.
[16] H.V. Jagadish, "Linear Clustering of Objects with Multiple Attributes," Proc. Int'l Conf. Management of Data, pp. 332342, ACM SIGMOD, 1990.
[17] J.D. Frens and D.S. Wise, “AutoBlocking MatrixMultiplication or Tracking BLAS3 Performance from Source Code,” Proc. Sixth ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, June 1997.
[18] D.S. Wise, “Ahnentafel Indexing into MortonOrdered Arrays, or Matrix Locality for Free,” Proc. EUROPAR 2000: Parallel Processing, pp. 774784, Aug. 2000.
[19] D.S. Wise, J.D. Frens, Y. Gu, and G.A. Alexander, “Language Support for MortonOrder Matrices,” Proc. Eighth ACM SIGPLAN Symp. Principles and Practices of Parallel Programming, pp. 24–33, June 2001.
[20] B. Moon, H.V. Jagadish, C. Faloutsos, and J.H. Saltz, “Analysis of Clustering Properties of Hilbert SpaceFilling Curve,” Technical Report No. CSTR3590, Univ. of Maryland Dept. of Computer Science, Mar. 1996.
[21] M. Lam, E. Rothberg, and M. Wolf, “The Cache Performance and Optimizations of Blocked Algorithms,” Proc. Fourth Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS '91), 1991.
[22] Basic Linear Algebra Subroutine Technical (BLAST) Forum, “Basic Linear Algebra Subroutine Technical (BLAST) Forum Standard,” http://www.netlib.org/blasblastforum/, Aug. 2001.
[23] C.E. Leiserson, “Personal Communication,” Aug. 1998.
[24] V. Strassen, “Gaussian Elimination is Not Optimal,” Numerical Mathmatics, vol. 13, pp. 354–356, 1969.
[25] R.D. Blumofe, C.F. Joerg, B.C. Kuszmaul, C.E. Leiserson, K.H. Randall, and Y. Zhou, “Cilk: An Efficient Multithreaded Runtime System,” Proc. Fifth ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, pp. 207–216, July 1995.
[26] P.C. Fischer and R.L. Probert, “Efficient Procedures for Using Matrix Algorithms,” Automata, Languages, and Programming, 1974.
[27] M. Hill and A. Smith, "Evaluating Associativity in CPU Caches," IEEE Trans. Computers, vol. 38, no. 12, pp. 1,6121,630, Dec. 1989.
[28] N.J. Higham, Accuracy and Stability of Numerical Algorithms, SIAM, Philadelphia, 1996.
[29] M. Cierniak and W. Li, “Unifying Data and Control Transformations for Distributed Shared Memory Machines,” Proc. SIGPLAN Conf. Programming Language Design and Implementation, June 1995.
[30] D. Culler and J.P. Singh, Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann, 1998.
[31] C. Koelbel, D. Loveman, R. Schreiber, G. Steele Jr., and M. Zosel, The High Performance Fortran Handbook. MIT Press, 1994.
[32] H. Sagan, SpaceFilling Curves. SpringerVerlag, 1994.
[33] M. Mano, Digital Design. PrenticeHall, 1984.
[34] S. HussLederman, E.M. Jacobson, J.R. Johnson, A. Tsao, and T. Turnbull, “Implementation of Strassen's Algorithm for Matrix Multiplication,” Proc. Supercomputing '96, 1996.
[35] C. Douglas, M. Heroux, G. Slishman, and R.M. Smith, “GEMMW: A Portable Level 3 BLAS Winograd Variant of Strassen's MatrixMatrix Multiply Algorithm,” J. Computational Physics, vol. 110, pp. 1–10, 1994.
[36] J. Bilmes, K. Asanovic, C.W. Chin, and J. Demmel, “Optimizing MatrixMultiply Using PHiPAC: A Portable, HighPerformance ANSI C Coding Methodology,” Proc. Int'l Conf. Supercomputing, pp. 340347, July 1997.
[37] R.C. Whaley and J. Dongarra, Automatically Tuned Linear Algebra Software (ATLAS) Proc. Supercomputing, Nov. 1998.
[38] M. Frigo and S.G. Johnson, “FFTW: An Adaptive Software Architecture for the FFT,” Proc. Int'l Conf. Acoustics, Speech, and Signal Processing, vol. 3, p. 1381, 1998.
[39] S. Toledo, “Locality of Reference in LU Decomposition with Partial Pivoting,” SIAM J. Matrix Analysis and Applications, vol. 18, no. 4, pp. 1065–1081, Oct. 1997.
[40] F.G. Gustavson, “Recursion Leads to Automatic Variable Blocking for Dense LinearAlgebra Algorithms,” IBM J. Research and Development, vol. 41, no. 6, pp. 737–755, Nov. 1997.
[41] E. Elmroth and F. Gustavson, “Applying Recursion to Serial and Parallel QR Factorization Leads to Better Performance,” IBM J. Research and Development, vol. 44, no. 4, pp. 605–624, July 2000.
[42] J. Wasniewski, B.S. Anderson, and F. Gustavson, “Recursive Formulation of Colesky Algorithm in Fortran 90,” Proc. Fourth Int'l Workshop, Applied Parallel Computing, Large Scale Scientific and Industrial Problems, PARA '98, B. Kågström, J. Dongarra, E. Elmroth, and J. Wasniewski, eds., June 1998.
[43] B.S. Andersen, F. Gustavson, J. Wasniewski, and P.Y. Yalamov, “Recursive Formulation of Some Dense Linear Algebra Algorithms,” Proc. Ninth SIAM Conf. Parallel Processing for Scientific Computing (PPSC '99), B. Hendrickson, K.A. Yelick, C.H. Bischof, I.S. Duff, A.S. Edelman, G.A. Geist, M.T. Heath, M.A. Heroux, C. Koelbel, R.S. Schreiber, R.F. Sincovec, and M.F. Wheeler, eds., Mar. 1999.
[44] IBM, Engineering and Scientific Subroutine Library Version 2 Release 2, Guide and Reference. vols. 1–3, 1994.
[45] F.G. Gustavson, “New Generalized Data Structures for Matrices Lead to a Variety of HighPerformance Algorithms,” Simulation and Visualization on the Grid, B. Engquist, L. Johnsson, M. Hammill, and F. Short, eds., 2000.
[46] L. Stals and U. Rüde, “Techniques for Improving the Data Locality of Iterative Methods,” Technical Report MRR97038, Institut für Mathematik, Universität Augsburg, Germany, Oct. 1997.
[47] G. Gibson, J.S. Vitter, and J. Wilkes, “Report of the Working Group on Storage I/O for LargeScale Computing,” ACM Computing Surveys, Dec. 1996.
[48] C.E. Leiserson, S. Rao, and S. Toledo, “Efficient OutofCore Algorithms for Linear Relaxation Using Blocking Covers,” J. Computer and System Sciences, vol. 54, no. 2, pp. 332–344, 1997.
[49] S. Sen and S. Chatterjee, “Towards a Theory of CacheEfficient Algorithms,” Proc. 11th Annual ACMSIAM Symp. Discrete Algorithms, pp. 829–838, Jan. 2000.
[50] M. Wolfe, “More Iteration Space Tiling,” Proc. Supercomputing '89, pp. 655664, Nov. 1989.
[51] M. Wolf and M. Lam, “A Data Locality Optimizing Algorithm,” Proc. SIGPLAN Conf. Programming Language Design and Implementation, pp. 3044, June 1991.
[52] S. Carr, K.S. McKinley, and C.W. Tseng, “Compiler Optimizations for Improving Data Locality,” Proc. Sixth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 252262, Oct. 1994.
[53] L. Carter, J. Ferrante, and S.F. Hummel, “Hierarchical Tiling for Improved Superscalar Performance,” Proc. Nineth Int'l Symp. Parallel Processing, pp. 239245, Apr. 1995.
[54] M. Mace, Memory Storage Patterns in Parallel Processing.Boston: Kluwer Academic, 1987.
[55] K. Knobe, J.D. Lukas, and G.L. Steele Jr., "Data Optimization: Allocation of Arrays to Reduce Communication on SIMD Machines," J. Parallel and Distributed Computing, vol. 8, no. 2, pp. 102118, Feb. 1990.
[56] M. Gupta, "Automatic Data Partitioning on Distributed Memory Multicomputers," PhD thesis CRHC9219/UILUENG922237, Dept. of Computer Science, Univ. of Illi nois, Urbana, Sept. 1992.
[57] S. Chatterjee, J. Gilbert, R. Schreiber, and S. Teng, "Optimal Evaluation of Array Expressions on Massively Parallel Machines," ACM Trans. Programming Languages and Systems, vol. 17, no. 1, pp. 123156, Jan. 1995.
[58] K. Kennedy and U. Kremer, “Automatic Data Layout for Distributed Memory Machines,” ACM Trans. Programming Languages and Systems, 1998.