This Article 
 Bibliographic References 
 Add to: 
Recursive Array Layouts and Fast Matrix Multiplication
November 2002 (vol. 13 no. 11)
pp. 1105-1123

Abstract—The performance of both serial and parallel implementations of matrix multiplication is highly sensitive to memory system behavior. False sharing and cache conflicts cause traditional column-major or row-major array layouts to incur high variability in memory system performance as matrix size varies. This paper investigates the use of recursive array layouts to improve performance and reduce variability. Previous work on recursive matrix multiplication is extended to examine several recursive array layouts and three recursive algorithms: standard matrix multiplication and the more complex algorithms of Strassen and Winograd. While recursive layouts significantly outperform traditional layouts (reducing execution times by a factor of 1.2-2.5) for the standard algorithm, they offer little improvement for Strassen's and Winograd's algorithms. For a purely sequential implementation, it is possible to reorder computation to conserve memory space and improve performance between 10 percent and 20 percent. Carrying the recursive layout down to the level of individual matrix elements is shown to be counterproductive; a combination of recursive layouts down to canonically ordered matrix tiles instead yields higher performance. Five recursive layouts with successively increasing complexity of address computation are evaluated and it is shown that addressing overheads can be kept in control even for the most computationally demanding of these layouts.

[1] M. Thottethodi, S. Chatterjee, and A.R. Lebeck, “Turing Strassen's Matrix Multiplication for Memory Efficiency,” Proc. ACM/IEEE SC98 Conf. High Performance Networking and Computing, Nov. 1998.
[2] S. Chatterjee, V.V. Jain, A.R. Lebeck, S. Mundhra, and M. Thottethodi, “Nonlinear Array Layouts for Hierarchical Memory Systems,” Proc. 1999 ACM Int'l Conf. Supercomputing, pp. 444-453, June 1999.
[3] S. Chatterjee, A.R. Lebeck, P.K. Patnala, and M. Thottethodi, “Recursive Array Layouts and Fast Parallel Matrix Multiplication,” Proc. Eleventh Ann. ACM Symp. Parallel Algorithms and Architectures, pp. 222-231, June 1999.
[4] J.J. Dongarra, J.D. Croz, S. Hammarling, and I. Duff, "A Set of Level 3 Basic Linear Algebra Subprograms," ACM Trans. Mathematical Software, vol. 16, no. 1, pp. 1-17, Mar. 1990.
[5] D. Hilbert, “Éber Stetige Abbildung Einer Linie Auf Ein Flächenstück,” Mathematische Annalen, vol. 38, pp. 459–460, 1891.
[6] G. Peano, “Sur une Courbe Qui Remplit Toute une Aire Plaine,” Mathematische Annalen, vol. 36, pp. 157–160, 1890.
[7] M.S. Warren and J.K. Salmon, "A parallel hashed oct-tree N-body algorithm," Proc. Supercomputing 93, pp. 12-21, 1993.
[8] I. Banicescu and S.F. Hummel, “Balancing Processor Loads and Exploiting Data Locality in N-Body Simulations,” Proc. 1995 ACM/IEEE Supercomputing Conf., Dec. 1995.
[9] S.F. Hummel, I. Banicescu, C.-T. Wang, and J. Wein, “Load Balancing and Data Locality via Fractiling: An Experimental Study,” Language, Compilers, and Run-Time Systems for Scalable Computers, 1995.
[10] Y.C. Hu, S.L. Johnsson, and S.-H. Teng, “High Performance Fortran for Highly Irregular Problems,” Proc. the Sixth ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, pp. 13–24, June 1997.
[11] J.R. Pilkington and S.B. Baden, “Dynamic Partitioning of Non-Uniform Structured Workloads with Spacefilling Curves,” IEEE Trans. Parallel and Distributed Systems, vol. 7, no. 3, pp. 288-300, 1996.
[12] J. P. Singh,T. Joe,A. Gupta,, and J. Hennessy,“An empirical comparison of the KSR and DASH multiprocessors,” Proc. Supercomputing 93, pp. 214-225, Nov. 1993.
[13] T. Bially, "Space-Filling Curves: Their Generation and Their Application to Bandwidth Reduction," IEEE Trans. Information Theory, vol. 15, no. 6, pp. 658-664, 1969.
[14] M.F. Goodchild and A.W. Grandfield, “Optimizing Raster Storage: An Examination of Four Alternatives,” Proc. Auto-Carto 6, vol. 1, pp. 400–407, Oct. 1983.
[15] R. Laurini, “Graphical Data Bases Built on Peano Space-Filling Curves,” Proc. EUROGRAPHICS '85 Conf., C.E. Vandoni, ed., pp. 327–338, 1985.
[16] H.V. Jagadish, "Linear Clustering of Objects with Multiple Attributes," Proc. Int'l Conf. Management of Data, pp. 332-342, ACM SIGMOD, 1990.
[17] J.D. Frens and D.S. Wise, “Auto-Blocking Matrix-Multiplication or Tracking BLAS3 Performance from Source Code,” Proc. Sixth ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, June 1997.
[18] D.S. Wise, “Ahnentafel Indexing into Morton-Ordered Arrays, or Matrix Locality for Free,” Proc. EUROPAR 2000: Parallel Processing, pp. 774-784, Aug. 2000.
[19] D.S. Wise, J.D. Frens, Y. Gu, and G.A. Alexander, “Language Support for Morton-Order Matrices,” Proc. Eighth ACM SIGPLAN Symp. Principles and Practices of Parallel Programming, pp. 24–33, June 2001.
[20] B. Moon, H.V. Jagadish, C. Faloutsos, and J.H. Saltz, “Analysis of Clustering Properties of Hilbert Space-Filling Curve,” Technical Report No. CS-TR-3590, Univ. of Maryland Dept. of Computer Science, Mar. 1996.
[21] M. Lam, E. Rothberg, and M. Wolf, “The Cache Performance and Optimizations of Blocked Algorithms,” Proc. Fourth Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS '91), 1991.
[22] Basic Linear Algebra Subroutine Technical (BLAST) Forum, “Basic Linear Algebra Subroutine Technical (BLAST) Forum Standard,”, Aug. 2001.
[23] C.E. Leiserson, “Personal Communication,” Aug. 1998.
[24] V. Strassen, “Gaussian Elimination is Not Optimal,” Numerical Mathmatics, vol. 13, pp. 354–356, 1969.
[25] R.D. Blumofe, C.F. Joerg, B.C. Kuszmaul, C.E. Leiserson, K.H. Randall, and Y. Zhou, “Cilk: An Efficient Multithreaded Runtime System,” Proc. Fifth ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, pp. 207–216, July 1995.
[26] P.C. Fischer and R.L. Probert, “Efficient Procedures for Using Matrix Algorithms,” Automata, Languages, and Programming, 1974.
[27] M. Hill and A. Smith, "Evaluating Associativity in CPU Caches," IEEE Trans. Computers, vol. 38, no. 12, pp. 1,612-1,630, Dec. 1989.
[28] N.J. Higham, Accuracy and Stability of Numerical Algorithms, SIAM, Philadelphia, 1996.
[29] M. Cierniak and W. Li, “Unifying Data and Control Transformations for Distributed Shared Memory Machines,” Proc. SIGPLAN Conf. Programming Language Design and Implementation, June 1995.
[30] D. Culler and J.P. Singh, Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann, 1998.
[31] C. Koelbel, D. Loveman, R. Schreiber, G. Steele Jr., and M. Zosel, The High Performance Fortran Handbook. MIT Press, 1994.
[32] H. Sagan, Space-Filling Curves. Springer-Verlag, 1994.
[33] M. Mano, Digital Design. Prentice-Hall, 1984.
[34] S. Huss-Lederman, E.M. Jacobson, J.R. Johnson, A. Tsao, and T. Turnbull, “Implementation of Strassen's Algorithm for Matrix Multiplication,” Proc. Supercomputing '96, 1996.
[35] C. Douglas, M. Heroux, G. Slishman, and R.M. Smith, “GEMMW: A Portable Level 3 BLAS Winograd Variant of Strassen's Matrix-Matrix Multiply Algorithm,” J. Computational Physics, vol. 110, pp. 1–10, 1994.
[36] J. Bilmes, K. Asanovic, C.-W. Chin, and J. Demmel, “Optimizing Matrix-Multiply Using PHiPAC: A Portable, High-Performance ANSI C Coding Methodology,” Proc. Int'l Conf. Supercomputing, pp. 340-347, July 1997.
[37] R.C. Whaley and J. Dongarra, Automatically Tuned Linear Algebra Software (ATLAS) Proc. Supercomputing, Nov. 1998.
[38] M. Frigo and S.G. Johnson, “FFTW: An Adaptive Software Architecture for the FFT,” Proc. Int'l Conf. Acoustics, Speech, and Signal Processing, vol. 3, p. 1381, 1998.
[39] S. Toledo, “Locality of Reference in LU Decomposition with Partial Pivoting,” SIAM J. Matrix Analysis and Applications, vol. 18, no. 4, pp. 1065–1081, Oct. 1997.
[40] F.G. Gustavson, “Recursion Leads to Automatic Variable Blocking for Dense Linear-Algebra Algorithms,” IBM J. Research and Development, vol. 41, no. 6, pp. 737–755, Nov. 1997.
[41] E. Elmroth and F. Gustavson, “Applying Recursion to Serial and Parallel QR Factorization Leads to Better Performance,” IBM J. Research and Development, vol. 44, no. 4, pp. 605–624, July 2000.
[42] J. Wasniewski, B.S. Anderson, and F. Gustavson, “Recursive Formulation of Colesky Algorithm in Fortran 90,” Proc. Fourth Int'l Workshop, Applied Parallel Computing, Large Scale Scientific and Industrial Problems, PARA '98, B. Kågström, J. Dongarra, E. Elmroth, and J. Wasniewski, eds., June 1998.
[43] B.S. Andersen, F. Gustavson, J. Wasniewski, and P.Y. Yalamov, “Recursive Formulation of Some Dense Linear Algebra Algorithms,” Proc. Ninth SIAM Conf. Parallel Processing for Scientific Computing (PPSC '99), B. Hendrickson, K.A. Yelick, C.H. Bischof, I.S. Duff, A.S. Edelman, G.A. Geist, M.T. Heath, M.A. Heroux, C. Koelbel, R.S. Schreiber, R.F. Sincovec, and M.F. Wheeler, eds., Mar. 1999.
[44] IBM, Engineering and Scientific Subroutine Library Version 2 Release 2, Guide and Reference. vols. 1–3, 1994.
[45] F.G. Gustavson, “New Generalized Data Structures for Matrices Lead to a Variety of High-Performance Algorithms,” Simulation and Visualization on the Grid, B. Engquist, L. Johnsson, M. Hammill, and F. Short, eds., 2000.
[46] L. Stals and U. Rüde, “Techniques for Improving the Data Locality of Iterative Methods,” Technical Report MRR97-038, Institut für Mathematik, Universität Augsburg, Germany, Oct. 1997.
[47] G. Gibson, J.S. Vitter, and J. Wilkes, “Report of the Working Group on Storage I/O for Large-Scale Computing,” ACM Computing Surveys, Dec. 1996.
[48] C.E. Leiserson, S. Rao, and S. Toledo, “Efficient Out-of-Core Algorithms for Linear Relaxation Using Blocking Covers,” J. Computer and System Sciences, vol. 54, no. 2, pp. 332–344, 1997.
[49] S. Sen and S. Chatterjee, “Towards a Theory of Cache-Efficient Algorithms,” Proc. 11th Annual ACM-SIAM Symp. Discrete Algorithms, pp. 829–838, Jan. 2000.
[50] M. Wolfe, “More Iteration Space Tiling,” Proc. Supercomputing '89, pp. 655-664, Nov. 1989.
[51] M. Wolf and M. Lam, “A Data Locality Optimizing Algorithm,” Proc. SIGPLAN Conf. Programming Language Design and Implementation, pp. 30-44, June 1991.
[52] S. Carr, K.S. McKinley, and C.-W. Tseng, “Compiler Optimizations for Improving Data Locality,” Proc. Sixth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 252-262, Oct. 1994.
[53] L. Carter, J. Ferrante, and S.F. Hummel, “Hierarchical Tiling for Improved Superscalar Performance,” Proc. Nineth Int'l Symp. Parallel Processing, pp. 239-245, Apr. 1995.
[54] M. Mace, Memory Storage Patterns in Parallel Processing.Boston: Kluwer Academic, 1987.
[55] K. Knobe, J.D. Lukas, and G.L. Steele Jr., "Data Optimization: Allocation of Arrays to Reduce Communication on SIMD Machines," J. Parallel and Distributed Computing, vol. 8, no. 2, pp. 102-118, Feb. 1990.
[56] M. Gupta, "Automatic Data Partitioning on Distributed Memory Multicomputers," PhD thesis CRHC-92-19/UILU-ENG-92-2237, Dept. of Computer Science, Univ. of Illi nois, Urbana, Sept. 1992.
[57] S. Chatterjee, J. Gilbert, R. Schreiber, and S. Teng, "Optimal Evaluation of Array Expressions on Massively Parallel Machines," ACM Trans. Programming Languages and Systems, vol. 17, no. 1, pp. 123-156, Jan. 1995.
[58] K. Kennedy and U. Kremer, “Automatic Data Layout for Distributed Memory Machines,” ACM Trans. Programming Languages and Systems, 1998.

Index Terms:
Data layout, matrix multiplication.
Siddhartha Chatterjee, Alvin R. Lebeck, Praveen K. Patnala, Mithuna Thottethodi, "Recursive Array Layouts and Fast Matrix Multiplication," IEEE Transactions on Parallel and Distributed Systems, vol. 13, no. 11, pp. 1105-1123, Nov. 2002, doi:10.1109/TPDS.2002.1058095
Usage of this product signifies your acceptance of the Terms of Use.