This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Static and Dynamic Locality Optimizations Using Integer Linear Programming
September 2001 (vol. 12 no. 9)
pp. 922-941

Abstract—The delivered performance on modern processors that employ deep memory hierarchies is closely related to the performance of the memory subsystem. Compiler optimizations aimed at improving cache locality are critical in realizing the performance potential of powerful processors. For scientific applications, several loop transformations have been shown to be useful in improving both temporal and spatial locality. Recently, there has been some work in the area of data layout optimizations, i.e., changing the memory layouts of multidimensional arrays from the language-defined default such as column-major storage in Fortran. The effect of such memory layout decisions is on the spatial locality characteristics of loop nests. While data layout transformations are not constrained by data dependences, they have no effect on temporal locality. On the other hand, loop transformations are not readily applicable to imperfect loop nests and are constrained by data dependences. More importantly, loop transformations affect the memory access patterns of all the arrays accessed in a loop nest and, as a result, the locality characteristics of some of the arrays may worsen. This paper presents a technique based on integer linear programming (ILP) that attempts to derive the best combination of loop and data layout transformations. Prior attempts to unify loop and data layout transformations for programs consisting of a sequence of loop nests have been based on heuristics not only for transformations for a single loop nest but also for the sequence in which loop nests will be considered. The ILP formulation presented here obviates the need for such heuristics and gives us a bar against which the heuristic algorithms can be compared. More importantly, our approach is able to transform memory layouts dynamically during program execution. This is particularly useful in applications whose disjoint code segments demand different layouts for a given array. In addition, we show how this formulation can be extended to address the false sharing problem in a multiprocessor environment. The key data structure we introduce is the memory layout graph (MLG) that allows us to formulate the problems as path problems. The paper discusses the relationship of this ILP approach based on the memory layout graphs to other work in the area including our previous work. Experimental results on a MIPS R10000-based system demonstrate the benefits of this approach and show that the use of the ILP formulation does not increase the compilation time significantly.

[1] W. Abu-Sufah, “Improving the Performance of Virtual Memory Computers,” Ph. D. thesis, Univ. of Illinois at Urbana-Champaign, Nov. 1978.
[2] J. Anderson, S. Amarasinghe, and M. Lam, “Data and Computation Transformations for Multiprocessors,” Proc. Fifth ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP '95), pp. 166-178, July 1995.
[3] E. Ayguade and J. Torres, “Partitioning the Statement Per Iteration Space Using Non-Singular Matrices,” Proc. 1993 Int'l Conf. Supercomputing (ICS '93), July 1993.
[4] Berkelaar, “$lp\_solve$ version 2.1,” Available from ftp://ftp.es.ele.tue.nl/publp_solve, 2001.
[5] S. Carr, “Combining Optimization for Cache and Instruction-Level Parallelism,” Proc. 1996 Int'l Conf. Parallel Architectures and Compiler Techniques (PACT '96), Oct. 1996.
[6] L. Carter, J. Ferrante, S. Hummel, B. Alpern, and K. Gatlin, “Hierarchical Tiling: A Methodology for High Performance,” Technical Report CS 96-508, Univ. of California, Santa Barbara, Nov. 1996.
[7] R. Chandra, D. Chen, R. Cox, D. Maydan, N. Nedeljkovic, and J. Anderson, “Data-Distribution Support on Distributed-Shared Memory Multiprocessors,” Proc. SIGPLAN Conf. Programming Language Design and Implementation (PLDI '97), pp. 334-345, 1997.
[8] F.T. Chong, B.-H. Lim, R. Bianchini, J. Kubiatowicz, and A. Agarwal, “Application Performance on the MIT Alewife Machine,” Computer, vol. 29, no. 12, pp. 57-64, Dec. 1996.
[9] M. Cierniak and W. Li, “Unifying Data and Control Transformations for Distributed Shared Memory Machines,” Proc. SIGPLAN Conf. Programming Language Design and Implementation (PLDI '95), pp. 205-217, June 1995.
[10] S. Coleman and K. McKinley, “Tile Size Selection Using Cache Organization and Data Layout,” Proc. SIGPLAN '95 Conf. Programming Language Design and Implementation (PLDI'95), 1995.
[11] K.M. Dixit, “New CPU Benchmark Suites from SPEC,” Proc. COMPCON '9237th IEEE Computer Soc. Int'l Conf., Feb. 1992.
[12] A.A. Dubrulle, “A Version of EISPACK for the IBM 3090VF,” Technical Report TR G320-3510, IBM Scientific Center, Palo Alto, Calif., 1988.
[13] S. Eggers and T. Jeremiassen, “Eliminating false Sharing,” Proc. Int'l Conf. Parallel Processing (ICPP '91), vol. I, pp. 377-381, Aug. 1991.
[14] J. Ferrante, V. Sarkar, and W. Thrash, “On Estimating and Enhancing Cache Effectiveness,” Proc. Languages and Compilers for Parallel Computing (LCPC '91), pp. 328-343, 1991.
[15] D. Gannon, W. Jalby, and K. Gallivan, “Strategies for Cache and Local Memory Management by Global Program Transformations,” J. Parallel and Distributed Computing, vol. 5, no. 5, pp. 587-616, Oct. 1988.
[16] J. Garcia, E. Ayguadé, and J. Labarta, “A Novel Approach Towards Automatic Data Distribution,” Proc. Supercomputing '95, Dec. 1995.
[17] J. Garcia, E. Ayguade, and J. Labarta, “Dynamic Data Distribution with Control Flow Analysis,” Proc. Supercomputing '96, Nov. 1996.
[18] M. Gupta and P. Banerjee, “Demonstration of Automatic Data Partitioning Techniques for Parallelizing Compilers on Multicomputers,” IEEE Trans. Parallel and Distributed Systems, vol. 3, no. 2, pp. 179-193, Mar. 1992.
[19] J. Hennessy and D. Patterson, Computer Architecture: A Quantitative Approach. second ed., San Mateo, Calif.: Morgan Kaufmann, 1995.
[20] High Performance Computational Chemistry Group. NWChem: A Computational Chemistry Package for Parallel Computers, Version 1.1. Richland, Wash.: Pacific Northwest Laboratory, 1995.
[21] T. Jeremiassen and S. Eggers, “Reducing False Sharing on Shared Memory Multiprocessors through Compile Time Data Transformations,” Proc. Fifth ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP '95), pp. 179-188, July 1995.
[22] Y. Ju and H. Dietz, “Reduction of Cache Coherence Overhead by Compiler Data Layout and Loop Transformation,” Proc. Languages and Compilers for Parallel Computing (LCPC '92), U. Banerjee et al., eds., pp. 344-358, 1992
[23] M. Kandemir, P. Banerjee, A. Choudhary, J. Ramanujam, and E. Ayguadé, “An Integer Linear Programming Approach for Optimizing Cache Locality,” Proc. 1999 ACM Int'l Conf. Supercomputing (ICS '99), pp. 500509, June 1999.
[24] M. Kandemir, A. Choudhary, N. Shenoy, P. Banerjee, and J. Ramanujam, “A Hyperplane Based Approach for Optimizing Spatial Locality in Loop Nests,” Proc. 1998 ACM Int'l Conf. Supercomputing (ICS '98), pp. 69-76, July 1998.
[25] M. Kandemir, A. Choudhary, J. Ramanujam, and P. Banerjee, “A Matrix-Based Approach to the Global Locality Optimization Problem,” Proc. 1998 Int'l Conf. Parallel Architectures and Compilation Techniques (PACT '98), Oct. 1998.
[26] M. Kandemir, A. Choudhary, J. Ramanujam, and P. Banerjee, “ImprovingLocality Using Loop and Data Transformations in an Integrated Approach,” Proc. 31st Ann. ACM/IEEE Int'l Symp. Microarchitecture, MICRO-31, Dec. 1998.
[27] M. Kandemir, A. Choudhary, J. Ramanujam, and P. Banerjee, “A Graph Based Framework to Detect Optimal Memory Layouts for Improving Data Locality,” Proc. Int'l Parallel Processing Symp. 99, Apr. 1999.
[28] M. Kandemir, A. Choudhary, J. Ramanujam, and M. Kandaswamy, “Locality Optimization Algorithms for Compilation of Out-of-Core Codes,” J. Information Science and Eng., vol. 14, no. 1, pp. 107-138, Mar. 1998.
[29] M. Kandemir, J. Ramanujam, and A. Choudhary, “A Compiler Algorithm for Optimizing Locality in Loop Nests,” Proc. 11th ACM Int'l Conf. Supercomputing (ICS '97), pp. 269-276, July 1997.
[30] W. Kelly, V. Maslov, W. Pugh, E. Rosser, T. Shpeisman, and D. Wonnacott, “The Omega Library Interface Guide,” Technical Report CS-TR-3445, Computer Science Dept., Univ. of Maryland, Mar. 1995.
[31] K. Kennedy and U. Kremer, “Automatic Data Layout for High Performance Fortran,” Proc. Supercomputing '95, Dec. 1995.
[32] I. Kodukula and K. Pingali, “Transformations of Imperfectly Nested Loops,” Proc. Supercomputing 96, Nov. 1996.
[33] I. Kodukula, N. Ahmed, and K. Pingali, “Data-Centric Multi-Level Blocking,” Proc. Programming Language Design and Implementation (PLDI '97), June 1997.
[34] M. Lam, E. Rothberg, and M. Wolf, “The Cache Performance and Optimizations of Blocked Algorithms,” Proc. Fourth Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS '91), 1991.
[35] S.-T. Leung and J. Zahorjan, “Optimizing Data Locality by Array Restructuring,” Technical Report TR 95-09-01, Dept. of Computer Science and Eng., Univ. of Washington, Sept. 1995.
[36] W. Li, “Compiling for NUMA Parallel Machines,” PhD thesis, Cornell Univ., Ithaca, NY, 1993.
[37] J. Li and M. Chen, “Compiling Communication Efficient Programs for Massively Parallel Machines,” J. Parallel and Distributed Computing, vol. 2, no. 3, pp. 361-376, 1991.
[38] K. McKinley, S. Carr, and C. Tseng, “Improving Data Locality with Loop Transformations,” ACM Trans. Programming Languages and Systems, vol. 18, no. 4, pp. 424-453, July 1996.
[39] K.S. McKinley and O. Temam, “A Quantitive Analysis of Loop Nest Locality,” Proc. Seventh Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS '96), 1996.
[40] F. McMahon, “The Livermore Fortran Kernels: A Computer Test of the Numerical Performance Range,” Technical Report UCRL-53745, Lawrence Livermore Nat'l Laboratory, Livermore, Calif., 1986.
[41] S.S. Muchnick, Advanced Compiler Design Implementation. San Francisco: Morgan Kaufmann, 1997.
[42] G. Nemhauser and L. Wolsey, Integer and Combinatorial Optimization, New York: John Wiley and Sons, 1988.
[43] M. O'Boyle and P. Knijnenburg, “Non-Singular Data Transformations: Definition, Validity, Applications,” Proc. Sixth Workshop Compilers for Parallel Computers (CPC '96), pp. 287-297, 1996.
[44] M. O'Boyle and P. Knijnenburg, “Integrating Loop and Data Transformations for Global Optimisation,” Proc. Int'l Conf. Parallel Architectures and Compilation Techniques (PACT '98), pp. 14-17, Oct. 1998.
[45] D. Palermo and P. Banerjee, “Automatic Selection of Dynamic Data Partitioning Schemes for Distributed-Memory Multicomputers,” Proc. Eighth Workshop Languages and Compilers for Parallel Computing (LCPC '95), pp. 392-406, 1995.
[46] Perfect Club, “The Perfect Club Benchmarks: Effective Performance Evaluation of Supercomputers,” Int'l J. Supercomputing Applications, vol. 3, no. 3, pp. 5-40, 1989.
[47] C. Polychronopoulos, M.B. Girkar, M.R. Haghighat, C.L. Lee, B.P. Leung, and D.A. Schouten, “Parafrase-2: An Environment for Parallelizing, Partitioning, Synchronizing, and Scheduling Programs on Multiprocessors,” Proc. Int'l Conf. Parallel Processing (ICPP '89), vol. II pp. 39-48, Aug. 1989.
[48] A.K. Porterfield, “Software Methods for Improving Cache Performance on Supercomputer Applications,” PhD thesis, Dept. Computer Science, Rice Univ., Houston, Texas, May 1989.
[49] G. Rivera and C. Tseng, “Data Transformations for Eliminating Conflict Misses,” Proc. 1998 ACM SIGPLAN Conf. Programming Language Design and Implementation (PLDI '98), pp. 38-49, June 1998.
[50] R.H. Saavedra, W. Mao, D. Park, J. Chame, and S. Moon, “The Combined Effectiveness of Unimodular Transformations, Tiling, and Software Prefetching,” Proc. 10th Int'l Parallel Processing Symp. (IPPS '96), pp. 39-46, Apr. 1996.
[51] V. Sarkar, G. Gao, and S. Han, “Locality Analysis for Distributed Ahared-Memory Multiprocessors,” Proc. Ninth Int'l Workshop Languages and Compilers for Parallel Computing (LCPC '96), Aug. 1996.
[52] N. Shenoy, M. Kandemir, D. Chakarabarti, A. Choudhary, and P. Banerjee, “Estimating Memory Access Costs on Distributed Shared Memory Multiprocessors,” technical report, Northwestern Univ., Evanston, Ill., 1998.
[53] S. Tandri and T. Abdelrahman, “Automatic Partitioning of Data and Computations on Scalable Shared Memory Multiprocessors,” Proc. 1997 Int'l Conf. Parallel Processing (ICPP '97), pp. 64-73, Aug. 1997.
[54] O. Temam, C. Fricker, and W. Jalby, “Impact of Cache Interferences on Usual Numerical Dense Loop Nests,” Proc. IEEE, vol. 81, no. 8, pp. 1103-1115, 1993.
[55] J. Torrellas, M. Lam, and J. Hennessey, “False Sharing and Spatial Locality in Multiprocessor Caches,” IEEE Trans. Computers, vol. 43, no. 6, pp. 651-663, June 1994.
[56] M. Wolf and M. Lam, “A Data Locality Optimizing Algorithm,” Proc. SIGPLAN Conf. Programming Language Design and Implementation (PLDI '91), pp. 30-44, June 1991.
[57] M. Wolf, D. Maydan, and D. Chen, “Combining Loop Transformations Considering Caches and Scheduling,” Proc. 29th Ann. ACM/IEEE Int'l Symp. Microarchitecture, MICRO-29, pp. 274-286, Dec. 1996.
[58] M. Wolfe, More Iteration Space Tiling, Proc. Supercomputing '89, pp. 655-664, Nov. 1989.
[59] M. Wolfe, High Performance Compilers for Parallel Computing. Calif.: Addison-Wesley, 1996.
[60] M. Zagha, B. Larson, S. Turner, and M. Itzkowitz, ”Performance Analysis Using the MIPS R10000 Performance Counters,” Proc. Supercomputing '96, Nov. 1996.
[61] H. Zima and B. Chapman, Supercompilers for Parallel and Vector Computers. New York: ACM Press, 1991.

Index Terms:
Data reuse, cache locality, memory layouts, compiler optimizations, cache miss estimation, integer linear programming.
Citation:
Mahmut Kandemir, Prithviraj Banerjee, Alok Choudhary, J. Ramanujam, Eduard Ayguadé, "Static and Dynamic Locality Optimizations Using Integer Linear Programming," IEEE Transactions on Parallel and Distributed Systems, vol. 12, no. 9, pp. 922-941, Sept. 2001, doi:10.1109/TPDS.2001.1184186
Usage of this product signifies your acceptance of the Terms of Use.