This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
A Layout-Conscious Iteration Space Transformation Technique
December 2001 (vol. 50 no. 12)
pp. 1321-1336

Exploiting locality of references has become extremely important in realizing the potential performance of modern machines with deep memory hierarchies. The data access patterns of programs and the memory layouts of the accessed data sets play a critical role in determining the performance of applications running on these machines. This paper presents a cache locality optimization technique that can optimize a loop nest even if the arrays referenced have different layouts in memory. Such a capability is required for a global locality optimization framework that applies both loop and data transformations to a sequence of loop nests for optimizing locality. Our method uses a single linear algebra framework to represent both data layouts and loop transformations. It computes a nonsingular loop transformation matrix such that, in a given loop nest, data locality is exploited in the innermost loops, where it is most useful. The inverse of a nonsingular transformation matrix is built column-by-column, starting from the rightmost column. In addition, our approach can work in those cases where the data layouts of a subset of the referenced arrays is unknown; this is a key step in optimizing a sequence of loop nests and whole programs for locality. Experimental results on an SGI/Cray Origin 2000 nonuniform memory access multiprocessor machine show that our technique reduces execution times by as much as 70 percent.

[1] J. Anderson, “Automatic Computation and Data Decomposition for Multiprocessors,” PhD dissertation, Stanford Univ., Mar. 1997. Also available as Technical Report CSL-TR-97-179, Computer Systems Laboratory, Stanford Univ.
[2] J. Anderson, S. Amarasinghe, and M. Lam, “Data and Computation Transformations for Multiprocessors,” Proc. Fifth ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, July 1995.
[3] E. Ayguade, J. Garcia, M. Girones, M.L. Grande, and J. Labarta, “A Research Tool for Automatic Data Distribution in HPF,” Scientific Programming, vol. 6, no. 1, pp. 73-95, 1997.
[4] U. Banerjee, “Unimodular Transformations of Double Loops,” Proc. Advances in Languages and Compilers for Parallel Processing, A. Nicolau et al., eds., MIT Press, 1991.
[5] A. Bik and H. Wijshoff, “On a Completion Method for Unimodular Matrices,” Technical Report 94-14, Dept. of Computer Science, Leiden Univ., 1994.
[6] S. Carr and K. Kennedy, “Improving the Ratio of Memory Operations to Floating-Point Operations in Loops,” ACM Trans. Programming Languages and Systems, vol. 16, no. 6, pp. 1768-1810, Nov. 1994.
[7] M. Cierniak and W. Li, “Unifying Data and Control Transformations for Distributed Shared Memory Machines,” Proc. SIGPLAN Conf. Programming Language Design and Implementation, June 1995.
[8] S. Coleman and K. McKinley, “Tile Size Selection Using Cache Organization and Data Layout,” Proc. SIGPLAN Conf. Programming Language Design and Implementation, June 1995.
[9] S. Eggers and T. Jeremiassen, “Eliminating False Sharing,” Proc. 1991 Int'l Conf. Parallel Processing (ICPP '91), pp. 377-381, Aug. 1991.
[10] M. Gupta and P. Banerjee, “Demonstration of Automatic Data Partitioning Techniques for Parallelizing Compilers on Multicomputers,” IEEE Trans. Parallel and Distributed Systems, vol. 3, no. 2, pp. 179-193, Mar. 1992.
[11] J. Hennessy and D. Patterson, Computer Architecture: A Quantitative Approach. Morgan Kaufmann, 1995.
[12] C.-H. Huang and P. Sadayappan, “Communication-Free Partitioning of Nested Loops,” J. Parallel and Distributed Computing, vol. 19, pp. 90-102, 1993.
[13] T. Jeremiassen and S. Eggers, “Reducing False Sharing on Shared Memory Multiprocessors through Compile Time Data Transformations,” Proc. SIGPLAN Symp. Principles and Practices of Parallel Programming, pp. 179-188, July 1995.
[14] Y. Ju and H. Dietz, “Reduction of Cache Coherence Overhead by Compiler Data Layout and Loop Transformation,” Languages and Compilers for Parallel Computing, U. Banerjee et al., eds., pp. 344-358, Springer, 1992.
[15] M. Kandemir, A. Choudhary, N. Shenoy, P. Banerjee, and J. Ramanujam, “A Linear Algebra Framework for Automatic Determination of Optimal Data Layouts,” IEEE Trans. Parallel and Distributed Systems, vol. 10, no. 2, pp. 115-135, Feb. 1999.
[16] M. Kandemir, A. Choudhary, J. Ramanujam, and P. Banerjee, “A Matrix-Based Approach to the Global Locality Optimization Problem,” Proc. Int'l Conf. Parallel Architecture and Compiler Techniques (PACT '98), Oct. 1998.
[17] M. Kandemir, A. Choudhary, J. Ramanujam, and P. Banerjee, “Improving Locality Using Loop and Data Transformations in an Integrated Approach,” Proc. MICRO-31, Dec. 1998.
[18] M. Kandemir, J. Ramanujam, and A. Choudhary, “Compiler Algorithms for Optimizing Locality and Parallelism on Shared and Distributed Memory Machines,” Proc. Int'l Conf. Parallel Architecture and Compiler Techniques (PACT '97), pp. 236-247, Nov. 1997.
[19] M. Kandemir, J. Ramanujam, and A. Choudhary, “Improving Cache Locality by a Combination of Loop and Data Transformations,” IEEE Trans. Computers, vol. 48, no. 2, pp. 159-167, Feb. 1999. A preliminary version appears in Proc. 11th ACM Int'l Conf. Supercomputing (ICS '97), pp. 269-276, July 1997.
[20] M. Kandemir, J. Ramanujam, A. Choudhary, and P. Banerjee, “A Locality Optimization Algorithm Based on Explicit Representation of Data Layouts,” Technical Report CSE-00-008, Dept. of Computer Science and Eng., Pennsylvania State Univ., May 2000.
[21] W. Kelly, V. Maslov, W. Pugh, E. Rosser, T. Shpeisman, and D. Wonnacott, “The Omega Library Interface Guide,” Technical Report CS-TR-3445, Computer Science Dept., Univ. of Maryland, College Park, Mar. 1995.
[22] K. Kennedy and U. Kremer, “Automatic Data Layout for High Performance Fortran,” Proc. Supercomputing '95, Dec. 1995.
[23] I. Kodukula and K. Pingali, “Transformations of Imperfectly Nested Loops,” Proc. Supercomputing 96, Nov. 1996.
[24] I. Kodukula, N. Ahmed, and K. Pingali, “Data-Centric Multi-Level Blocking,” Proc. Programming Language Design and Implementation (PLDI '97), June 1997.
[25] M. Lam, E. Rothberg, and M. Wolf, “The Cache Performance and Optimizations of Blocked Algorithms,” Proc. Fourth Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS '91), 1991.
[26] S.-T. Leung and J. Zahorjan, “Optimizing Data Locality by Array Restructuring,” Technical Report TR 95-09-01, Dept. of Computer Science and Eng., Univ. of Washington, Sept. 1995.
[27] W. Li, “Compiling for NUMA Parallel Machines,” PhD thesis, Dept. of Computer Science, Cornell Univ., 1993.
[28] W. Li, “Compiler Cache Optimizations for Banded Matrix Problems,” Proc. Ninth ACM Int'l Conf. Supercomputing (ICS '95), pp. 21-30, July 1995.
[29] J. Li and M. Chen, “Compiling Communication Efficient Programs for Massively Parallel Machines,” J. Parallel and Distributed Computers, vol. 2, no. 3, pp. 361-376, 1991.
[30] K. McKinley, S. Carr, and C.W. Tseng, “Improving Data Locality with Loop Transformations,” ACM Trans. Programming Languages and Systems, vol. 18, no. 4, pp. 424-453, July 1996.
[31] M. O'Boyle and P. Knijnenburg, “Non-Singular Data Transformations: Definition, Validity, Applications,” Proc. Sixth Workshop Compilers for Parallel Computers (CPC '96), pp. 287-297, 1996.
[32] M. O'Boyle and P. Knijnenburg, “Integrating Loop and Data Transformations for Global Optimisation,” Proc. Int'l Conf. Parallel Architectures and Compilation Techniques (PACT '98), Oct. 1998.
[33] J. Ramanujam and P. Sadayappan, “Compile-Time Techniques for Data Distribution in Distributed Memory Machines,” IEEE Trans. Parallel and Distributed Systems, vol. 2, no. 4, pp. 472-482, Oct. 1991.
[34] J. Ramanujam and P. Sadayappan, “Tiling Multidimensional Iteration Spaces for Multicomputers,” J. Parallel and Distributed Computing, vol. 16, no. 2, pp. 108-120, Oct. 1992.
[35] J. Ramanujam, “Non-Unimodular Transformations of Nested Loops,” Proc. Supercomputing '92, pp. 214-223, Nov. 1992.
[36] J. Ramanujam, “Beyond Unimodular Transformations,” J. Supercomputing, vol. 9, no. 4, pp. 365-389, 1995.
[37] G. Rivera and C.-W. Tseng, “Data Transformations for Eliminating Conflict Misses,” Proc. SIGPLAN Conf. Programming Language Design and Implementation, June 1998.
[38] S. Tandri and T. Abdelrahman, “Automatic Partitioning of Data and Computations on Scalable Shared Memory Multiprocessors,” Proc. 1997 Int'l Conf. Parallel Processing (ICPP '97), pp. 64-73, Aug. 1997.
[39] M. Wolf and M. Lam, “A Data Locality Optimizing Algorithm,” Proc. SIGPLAN Conf. Programming Language Design and Implementation, pp. 30-44, June 1991.
[40] M. Wolf and M. Lam, “A Loop Transformation Theory and an Algorithm to Maximize Parallelism,” IEEE Trans. Parallel and Distributed Systems, vol. 2, no. 4, Oct. 1991.
[41] M. Wolfe, “More Iteration Space Tiling,” Proc. Supercomputing '89, pp. 655-664, Nov. 1989.
[42] M. Wolfe, High Performance Compilers for Parallel Computing. Addison Wesley, 1996.

Index Terms:
Data reuse, cache locality, memory layouts, loop transformations, program optimization
Citation:
M. Kandemir, J. Ramanujam, A. Choudhary, P. Banerjee, "A Layout-Conscious Iteration Space Transformation Technique," IEEE Transactions on Computers, vol. 50, no. 12, pp. 1321-1336, Dec. 2001, doi:10.1109/TC.2001.970571
Usage of this product signifies your acceptance of the Terms of Use.