This Article 
 Bibliographic References 
 Add to: 
A Linear Algebra Framework for Automatic Determination of Optimal Data Layouts
February 1999 (vol. 10 no. 2)
pp. 115-135

Abstract—This paper presents a data layout optimization technique for sequential and parallel programs based on the theory of hyperplanes from linear algebra. Given a program, our framework automatically determines suitable memory layouts that can be expressed by hyperplanes for each array that is referenced. We discuss the cases where data transformations are preferable to loop transformations and show that under certain conditions a loop nest can be optimized for perfect spatial locality by using data transformations. We argue that data transformations can also optimize spatial locality for some arrays without distorting temporal/spatial locality exhibited by others. We divide the problem of optimizing data layout into two independent subproblems: 1) determining optimal static data layouts, and 2) determining data transformation matrices to implement the optimal layouts. By postponing the determination of the transformation matrix to the last stage, our method can be adapted to compilers with different default layouts. We then present an algorithm that considers optimizing parallelism and spatial locality simultaneously. Our results on eight programs on two distributed shared-memory multiprocessors, the Convex Exemplar SPP-2000 and the SGI Origin 2000, show that the layout optimizations are effective in optimizing spatial locality and parallelism.

[1] B. Appelbe and B. Lakshmanan, "Optimizing Parallel Programs Using Affinity Regions," Proc. 1993 Int'l Conf. Parallel Processing, pp. 246-249,St. Charles, Ill., Aug. 1993.
[2] J. Anderson, S. Amarasinghe, and M. Lam, “Data and Computation Transformations for Multiprocessors,” Proc. Fifth ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, July 1995.
[3] J. Anderson and M. Lam, "Global Optimizations for Parallelism and Locality on Scalable Parallel Machines," Proc. SIGPLAN Conf. Programming Language Design and Implementation, pp. 112-125,Albuquerque, N.M., June 1993.
[4] U. Banerjee,Dependence Analysis for Supercomputing. Norwell, MA: Kluwer, 1988.
[5] U. Banerjee, "Unimodular Transformations of Double Loops," Advances in Languages and Compilers for Parallel Processing, A. Nicolau et al., eds. MIT Press, 1991.
[6] R. Chandra, D. Chen, R. Cox, D. Maydan, N. Nedeljkovic, and J. Anderson, "Data-Distribution Support on Distributed-Shared Memory Multiprocessors," Proc. SIGPLAN Conf. Programming Language Design and Implementation, pp. 334-345,Las Vegas, Nev., 1997.
[7] S. Chatterjee, J. Gilbert, R. Schreiber, and S. Teng, "Optimal Evaluation of Array Expressions on Massively Parallel Machines," ACM Trans. Programming Languages and Systems, vol. 17, no. 1, pp. 123-156, Jan. 1995.
[8] M. Cierniak and W. Li, “Unifying Data and Control Transformations for Distributed Shared Memory Machines,” Proc. SIGPLAN Conf. Programming Language Design and Implementation, June 1995.
[9] J.J. Dongarra, J.D. Croz, S. Hammarling, and I. Duff, "A Set of Level 3 Basic Linear Algebra Subprograms," ACM Trans. Mathematical Software, vol. 16, no. 1, pp. 1-17, Mar. 1990.
[10] D. Gannon, W. Jalby, and K. Gallivan, "Strategies for Cache and Local Memory Management by Global Program Transformations," J. Parallel and Distributed Computing, vol. 5, no. 5, pp. 587-616, Oct. 1988.
[11] J. Garcia, E. Ayguade, and J. Labarta, "A Novel Approach Towards Automatic Data Distribution," Proc. Supercomputing'95,San Diego, Calif., Dec. 1995.
[12] J. Garcia, E. Ayguade, and J. Labarta, "Dynamic Data Distribution with Control Flow Analysis," Proc. Supercomputing'96,Pittsburgh, Penn., Nov. 1996.
[13] M. Gupta and P. Banerjee, “Demonstration of Automatic Data Partitioning Techniques for Parallelizing Compilers on Multicomputers,” IEEE Trans. Parallel and Distributed Systems, vol. 3, no. 2, pp. 179-193, Mar. 1992.
[14] M. Hill and A. Smith, "Evaluating Associativity in CPU Caches," IEEE Trans. Computers, vol. 38, no. 12, pp. 1,612-1,630, Dec. 1989.
[15] "NWChem: A Computational Chemistry Package for Parallel Computers," version 1.1, High Performance Computational Chemistry Group, Pacific Northwest Laboratory, Richland, Wash., 1995.
[16] C.-H. Huang and P. Sadayappan, “Communication-Free Partitioning of Nested Loops,” J. Parallel and Distributed Computing, vol. 19, pp. 90-102, 1993.
[17] T. Jeremiassen and S. Eggers, “Reducing False Sharing on Shared Memory Multiprocessors through Compile Time Data Transformations,” Proc. SIGPLAN Symp. Principles and Practices of Parallel Programming, pp. 179-188, July 1995.
[18] Y. Ju and H. Dietz, “Reduction of Cache Coherence Overhead by Compiler Data Layout and Loop Transformation,” Languages and Compilers for Parallel Computing, U. Banerjee et al., eds., pp. 344-358, Springer, 1992.
[19] M. Kandemir, A. Choudhary, J. Ramanujam, and M. Kandaswamy, "Locality Optimization Algorithms for Compilation of Out-of-Core Codes," J. Information Science and Eng., vol. 14, no. 1, pp. 107-138, Mar. 1998.
[20] M. Kandemir, A. Choudhary, J. Ramanujam, and R. Bordawekar, "Compilation Techniques for Out-of-Core Parallel Computations," Parallel Computing, vol. 24, nos. 3-4, pp. 597-628, June 1998.
[21] M. Kandemir, J. Ramanujam, and A. Choudhary, “A Compiler Algorithm for Optimizing Locality in Loop Nests,” Proc. 1997 ACM Int'l Conf. Supercomputing, pp. 269-276, July 1997.
[22] M. Kandemir, J. Ramanujam, and A. Choudhary, “Compiler Algorithms for Optimizing Locality and Parallelism on Shared and Distributed Memory Machines,” Proc. Int'l Conf. Parallel Architecture and Compiler Techniques (PACT '97), pp. 236-247, Nov. 1997.
[23] K. Kennedy and U. Kremer, “Automatic Data Layout for High Performance Fortran,” Proc. Supercomputing '95, Dec. 1995.
[24] J. Laudon and D. Lenoski, "The SGI Origin: A cc-NUMA Highly Scalable Server," Proc. 24th Ann. Int'l Symp. Computer Architecture, May 1997.
[25] S.-T. Leung and J. Zahorjan, "Optimizing Data Locality by Array Restructuring," Technical Report TR 95-09-01, Dept. of Computer Science and Eng., Univ. of Washington, Sept. 1995.
[26] W. Li, “Compiling for NUMA Parallel Machines,” PhD thesis, Dept. of Computer Science, Cornell Univ., 1993.
[27] J. Li and M. Chen, “Compiling Communication Efficient Programs for Massively Parallel Machines,” J. Parallel and Distributed Computers, vol. 2, no. 3, pp. 361-376, 1991.
[28] M. Mace, Memory Storage Patterns in Parallel Processing.Boston: Kluwer Academic, 1987.
[29] V. Maslov, "Delinearization: An Efficient Way to Break Multi-Loop Dependence Equations," Proc. SIGPLAN Conf. Programming Language Design and Implementation, pp. 152-161,San Francisco, June 1992.
[30] K. McKinley, S. Carr, and C.W. Tseng, “Improving Data Locality with Loop Transformations,” ACM Trans. Programming Languages and Systems, vol. 18, no. 4, pp. 424-453, July 1996.
[31] M. O'Boyle and P. Knijnenburg, "Non-Singular Data Transformations: Definition, Validity, Applications," Proc. Sixth Workshop Compilers for Parallel Computers, pp. 287-297,Aachen, Germany, 1996.
[32] D. Palermo and P. Banerjee, "Automatic Selection of Dynamic Data Partitioning Schemes for Distributed-Memory Multicomputers," Proc. Eighth Workshop Languages and Compilers for Parallel Computing,Columbus, Ohio, pp. 392-406, 1995.
[33] J. Ramanujam, "Compile-Time Techniques for Parallel Execution of Loops on Distributed Memory Multiprocessors," PhD thesis, The Ohio State Univ., Columbus, Ohio, 1990. Also available from University Microfilms Inc. as Document 91-11789.
[34] J. Ramanujam, “Non-Unimodular Transformations of Nested Loops,” Proc. Supercomputing '92, pp. 214-223, Nov. 1992.
[35] J. Ramanujam and A. Narayan, "Integrating Data Distribution and Loop Transformations for Distributed Memory Machines," Proc. Seventh SIAM Conf. Parallel Processing for Scientific Computing, D. Bailey et al., eds., pp. 668-673, Feb. 1995.
[36] J. Ramanujam and P. Sadayappan, “Compile-Time Techniques for Data Distribution in Distributed Memory Machines,” IEEE Trans. Parallel and Distributed Systems, vol. 2, no. 4, pp. 472-482, Oct. 1991.
[37] A. Schrijver, Theory of Linear and Integer Programming. John Wiley, 1986.
[38] S. Tandri and T. Abdelrahman, “Automatic Partitioning of Data and Computations on Scalable Shared Memory Multiprocessors,” Proc. 1997 Int'l Conf. Parallel Processing (ICPP '97), pp. 64-73, Aug. 1997.
[39] J. Torrellas, M. Lam, and J. Hennessey, "False Sharing and Spatial Locality in Multiprocessor Caches," IEEE Trans. Computers, vol. 43, no. 6, pp. 651-663, June 1994.
[40] E. Torrie, C. Tseng, M. Martonosi, and M. Hall, "Evaluating the Impact of Advanced Memory Systems on Compiler-Parallelized Codes," Proc. Int'l Conf. Parallel Architectures and Compilation Techniques,Limassol, Cyprus, June 1995.
[41] B. Verghese, S. Devine, A. Gupta, and M. Rosenblum, "Operating System Support for Improving Data Locality on cc-NUMA Compute Servers," Proc. Seventh Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 279-289,Cambridge, Mass., Oct. 1996.
[42] R. Wilson, R. French, C. Wilson, S. Amarasinghe, J. Anderson, S. Tjiang, S. Liao, C. Tseng, M. Hall, M. Lam, and J. Hennessy, "SUIF: An Infrastructure for Research on Parallelizing and Optimizing Compilers," ACM SIGPLAN Notices, vol. 29, no. 12, pp. 31-37, Dec 1994.
[43] M. Wolf and M. Lam, “A Data Locality Optimizing Algorithm,” Proc. SIGPLAN Conf. Programming Language Design and Implementation, pp. 30-44, June 1991.
[44] M. Wolfe, High Performance Compilers for Parallel Computing. Addison-Wesley, 1996.

Index Terms:
Data reuse, locality optimizations, spatial locality, memory performance, parallelism, array restructuring.
Mahmut Kandemir, Alok Choudhary, Nagaraj Shenoy, Prithviraj Banerjee, J. Ramanujam, "A Linear Algebra Framework for Automatic Determination of Optimal Data Layouts," IEEE Transactions on Parallel and Distributed Systems, vol. 10, no. 2, pp. 115-135, Feb. 1999, doi:10.1109/71.752779
Usage of this product signifies your acceptance of the Terms of Use.