
This Article  
 
Share  
Bibliographic References  
Add to:  
Digg Furl Spurl Blink Simpy Del.icio.us Y!MyWeb  
Search  
 
ASCII Text  x  
M. Kandemir, J. Ramanujam, A. Choudhary, P. Banerjee, "A LayoutConscious Iteration Space Transformation Technique," IEEE Transactions on Computers, vol. 50, no. 12, pp. 13211336, December, 2001.  
BibTex  x  
@article{ 10.1109/TC.2001.970571, author = {M. Kandemir and J. Ramanujam and A. Choudhary and P. Banerjee}, title = {A LayoutConscious Iteration Space Transformation Technique}, journal ={IEEE Transactions on Computers}, volume = {50}, number = {12}, issn = {00189340}, year = {2001}, pages = {13211336}, doi = {http://doi.ieeecomputersociety.org/10.1109/TC.2001.970571}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, }  
RefWorks Procite/RefMan/Endnote  x  
TY  JOUR JO  IEEE Transactions on Computers TI  A LayoutConscious Iteration Space Transformation Technique IS  12 SN  00189340 SP1321 EP1336 EPD  13211336 A1  M. Kandemir, A1  J. Ramanujam, A1  A. Choudhary, A1  P. Banerjee, PY  2001 KW  Data reuse KW  cache locality KW  memory layouts KW  loop transformations KW  program optimization VL  50 JA  IEEE Transactions on Computers ER   
Exploiting locality of references has become extremely important in realizing the potential performance of modern machines with deep memory hierarchies. The data access patterns of programs and the memory layouts of the accessed data sets play a critical role in determining the performance of applications running on these machines. This paper presents a cache locality optimization technique that can optimize a loop nest even if the arrays referenced have different layouts in memory. Such a capability is required for a global locality optimization framework that applies both loop and data transformations to a sequence of loop nests for optimizing locality. Our method uses a single linear algebra framework to represent both data layouts and loop transformations. It computes a nonsingular loop transformation matrix such that, in a given loop nest, data locality is exploited in the innermost loops, where it is most useful. The inverse of a nonsingular transformation matrix is built columnbycolumn, starting from the rightmost column. In addition, our approach can work in those cases where the data layouts of a subset of the referenced arrays is unknown; this is a key step in optimizing a sequence of loop nests and whole programs for locality. Experimental results on an SGI/Cray Origin 2000 nonuniform memory access multiprocessor machine show that our technique reduces execution times by as much as 70 percent.
[1] J. Anderson, “Automatic Computation and Data Decomposition for Multiprocessors,” PhD dissertation, Stanford Univ., Mar. 1997. Also available as Technical Report CSLTR97179, Computer Systems Laboratory, Stanford Univ.
[2] J. Anderson, S. Amarasinghe, and M. Lam, “Data and Computation Transformations for Multiprocessors,” Proc. Fifth ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, July 1995.
[3] E. Ayguade, J. Garcia, M. Girones, M.L. Grande, and J. Labarta, “A Research Tool for Automatic Data Distribution in HPF,” Scientific Programming, vol. 6, no. 1, pp. 7395, 1997.
[4] U. Banerjee, “Unimodular Transformations of Double Loops,” Proc. Advances in Languages and Compilers for Parallel Processing, A. Nicolau et al., eds., MIT Press, 1991.
[5] A. Bik and H. Wijshoff, “On a Completion Method for Unimodular Matrices,” Technical Report 9414, Dept. of Computer Science, Leiden Univ., 1994.
[6] S. Carr and K. Kennedy, “Improving the Ratio of Memory Operations to FloatingPoint Operations in Loops,” ACM Trans. Programming Languages and Systems, vol. 16, no. 6, pp. 17681810, Nov. 1994.
[7] M. Cierniak and W. Li, “Unifying Data and Control Transformations for Distributed Shared Memory Machines,” Proc. SIGPLAN Conf. Programming Language Design and Implementation, June 1995.
[8] S. Coleman and K. McKinley, “Tile Size Selection Using Cache Organization and Data Layout,” Proc. SIGPLAN Conf. Programming Language Design and Implementation, June 1995.
[9] S. Eggers and T. Jeremiassen, “Eliminating False Sharing,” Proc. 1991 Int'l Conf. Parallel Processing (ICPP '91), pp. 377381, Aug. 1991.
[10] M. Gupta and P. Banerjee, “Demonstration of Automatic Data Partitioning Techniques for Parallelizing Compilers on Multicomputers,” IEEE Trans. Parallel and Distributed Systems, vol. 3, no. 2, pp. 179193, Mar. 1992.
[11] J. Hennessy and D. Patterson, Computer Architecture: A Quantitative Approach. Morgan Kaufmann, 1995.
[12] C.H. Huang and P. Sadayappan, “CommunicationFree Partitioning of Nested Loops,” J. Parallel and Distributed Computing, vol. 19, pp. 90102, 1993.
[13] T. Jeremiassen and S. Eggers, “Reducing False Sharing on Shared Memory Multiprocessors through Compile Time Data Transformations,” Proc. SIGPLAN Symp. Principles and Practices of Parallel Programming, pp. 179188, July 1995.
[14] Y. Ju and H. Dietz, “Reduction of Cache Coherence Overhead by Compiler Data Layout and Loop Transformation,” Languages and Compilers for Parallel Computing, U. Banerjee et al., eds., pp. 344358, Springer, 1992.
[15] M. Kandemir, A. Choudhary, N. Shenoy, P. Banerjee, and J. Ramanujam, “A Linear Algebra Framework for Automatic Determination of Optimal Data Layouts,” IEEE Trans. Parallel and Distributed Systems, vol. 10, no. 2, pp. 115135, Feb. 1999.
[16] M. Kandemir, A. Choudhary, J. Ramanujam, and P. Banerjee, “A MatrixBased Approach to the Global Locality Optimization Problem,” Proc. Int'l Conf. Parallel Architecture and Compiler Techniques (PACT '98), Oct. 1998.
[17] M. Kandemir, A. Choudhary, J. Ramanujam, and P. Banerjee, “Improving Locality Using Loop and Data Transformations in an Integrated Approach,” Proc. MICRO31, Dec. 1998.
[18] M. Kandemir, J. Ramanujam, and A. Choudhary, “Compiler Algorithms for Optimizing Locality and Parallelism on Shared and Distributed Memory Machines,” Proc. Int'l Conf. Parallel Architecture and Compiler Techniques (PACT '97), pp. 236247, Nov. 1997.
[19] M. Kandemir, J. Ramanujam, and A. Choudhary, “Improving Cache Locality by a Combination of Loop and Data Transformations,” IEEE Trans. Computers, vol. 48, no. 2, pp. 159167, Feb. 1999. A preliminary version appears in Proc. 11th ACM Int'l Conf. Supercomputing (ICS '97), pp. 269276, July 1997.
[20] M. Kandemir, J. Ramanujam, A. Choudhary, and P. Banerjee, “A Locality Optimization Algorithm Based on Explicit Representation of Data Layouts,” Technical Report CSE00008, Dept. of Computer Science and Eng., Pennsylvania State Univ., May 2000.
[21] W. Kelly, V. Maslov, W. Pugh, E. Rosser, T. Shpeisman, and D. Wonnacott, “The Omega Library Interface Guide,” Technical Report CSTR3445, Computer Science Dept., Univ. of Maryland, College Park, Mar. 1995.
[22] K. Kennedy and U. Kremer, “Automatic Data Layout for High Performance Fortran,” Proc. Supercomputing '95, Dec. 1995.
[23] I. Kodukula and K. Pingali, “Transformations of Imperfectly Nested Loops,” Proc. Supercomputing 96, Nov. 1996.
[24] I. Kodukula, N. Ahmed, and K. Pingali, “DataCentric MultiLevel Blocking,” Proc. Programming Language Design and Implementation (PLDI '97), June 1997.
[25] M. Lam, E. Rothberg, and M. Wolf, “The Cache Performance and Optimizations of Blocked Algorithms,” Proc. Fourth Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS '91), 1991.
[26] S.T. Leung and J. Zahorjan, “Optimizing Data Locality by Array Restructuring,” Technical Report TR 950901, Dept. of Computer Science and Eng., Univ. of Washington, Sept. 1995.
[27] W. Li, “Compiling for NUMA Parallel Machines,” PhD thesis, Dept. of Computer Science, Cornell Univ., 1993.
[28] W. Li, “Compiler Cache Optimizations for Banded Matrix Problems,” Proc. Ninth ACM Int'l Conf. Supercomputing (ICS '95), pp. 2130, July 1995.
[29] J. Li and M. Chen, “Compiling Communication Efficient Programs for Massively Parallel Machines,” J. Parallel and Distributed Computers, vol. 2, no. 3, pp. 361376, 1991.
[30] K. McKinley, S. Carr, and C.W. Tseng, “Improving Data Locality with Loop Transformations,” ACM Trans. Programming Languages and Systems, vol. 18, no. 4, pp. 424453, July 1996.
[31] M. O'Boyle and P. Knijnenburg, “NonSingular Data Transformations: Definition, Validity, Applications,” Proc. Sixth Workshop Compilers for Parallel Computers (CPC '96), pp. 287297, 1996.
[32] M. O'Boyle and P. Knijnenburg, “Integrating Loop and Data Transformations for Global Optimisation,” Proc. Int'l Conf. Parallel Architectures and Compilation Techniques (PACT '98), Oct. 1998.
[33] J. Ramanujam and P. Sadayappan, “CompileTime Techniques for Data Distribution in Distributed Memory Machines,” IEEE Trans. Parallel and Distributed Systems, vol. 2, no. 4, pp. 472482, Oct. 1991.
[34] J. Ramanujam and P. Sadayappan, “Tiling Multidimensional Iteration Spaces for Multicomputers,” J. Parallel and Distributed Computing, vol. 16, no. 2, pp. 108120, Oct. 1992.
[35] J. Ramanujam, “NonUnimodular Transformations of Nested Loops,” Proc. Supercomputing '92, pp. 214223, Nov. 1992.
[36] J. Ramanujam, “Beyond Unimodular Transformations,” J. Supercomputing, vol. 9, no. 4, pp. 365389, 1995.
[37] G. Rivera and C.W. Tseng, “Data Transformations for Eliminating Conflict Misses,” Proc. SIGPLAN Conf. Programming Language Design and Implementation, June 1998.
[38] S. Tandri and T. Abdelrahman, “Automatic Partitioning of Data and Computations on Scalable Shared Memory Multiprocessors,” Proc. 1997 Int'l Conf. Parallel Processing (ICPP '97), pp. 6473, Aug. 1997.
[39] M. Wolf and M. Lam, “A Data Locality Optimizing Algorithm,” Proc. SIGPLAN Conf. Programming Language Design and Implementation, pp. 3044, June 1991.
[40] M. Wolf and M. Lam, “A Loop Transformation Theory and an Algorithm to Maximize Parallelism,” IEEE Trans. Parallel and Distributed Systems, vol. 2, no. 4, Oct. 1991.
[41] M. Wolfe, “More Iteration Space Tiling,” Proc. Supercomputing '89, pp. 655664, Nov. 1989.
[42] M. Wolfe, High Performance Compilers for Parallel Computing. Addison Wesley, 1996.