loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Quasidynamic Layout Optimizations for Improving Data Locality
November 2004 (vol. 15 no. 11)
pp. 996-1011

Abstract—Compiler-directed locality optimization techniques are effective in reducing the number of cycles spent in off-chip memory accesses. Recently, methods have been developed that transform memory layouts of data structures at compile-time to improve spatial locality of nested loops beyond current control-centric (loop nest-based) optimizations. Most of these data-centric transformations use a single static (program-wide) memory layout for each array. A disadvantage of these static layout-based locality enhancement strategies is that they might fail to optimize codes that manipulate arrays which demand different layouts in different parts of the code. In this paper, we introduce a new approach which extends current static layout optimization techniques by associating different memory layouts with the same array in different parts of the code. We call this strategy "quasidynamic layout optimization.” In this strategy, the compiler determines memory layouts (for different parts of the code) at compile time, but layout conversions occur at runtime. We show that the possibility of dynamically changing memory layouts during the course of execution adds a new dimension to the data locality optimization problem. Our strategy employs a static layout optimizer module as a building block and, by repeatedly invoking it for different parts of the code, it checks whether runtime layout modifications bring additional benefits beyond static optimization. Our experiments indicate significant improvements in execution time over static layout-based locality enhancing techniques.

[1] 996 I. Al-Furaih and S. Ranka, “Memory Hierarchy Management for Iterative Graph Structures,” Proc. 12th Int'l Parallel Processing Symp., Apr. 1998.[2] J. Anderson, S. Amarasinghe, and M. Lam, “Data and Computation Transformations for Multiprocessors,” Proc. Fifth ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, July 1995.[3] U. Banerjee, “Unimodular Transformations of Double Loops,” Advances in Languages and Compilers for Parallel Processing, A. Nicolau et al., eds., MIT Press, 1991.[4] P. Banerjee, J. Chandy, M. Gupta, J.G. Holm, A. Lain, D.J. Palermo, S. Ramaswamy, and E. Su, “Overview of the PARADIGM Compiler for Distributed Memory Message-Passing Multicomputers,” Computer, pp. 37-47, Oct. 1995.[5] D.R. Chakrabarti, N. Shenoy, A. Choudhary, and P. Banerjee, “An Efficient Uniform Run-Time Scheme for Mixed Regular-Irregular Applications,” Proc. Int'l Conf. Supercomputing, July 1998.[6] F.T. Chong, B.-H. Lim, R. Bianchini, J. Kubiatowicz, and A. Agarwal, “Application Performance on the MIT Alewife Machine,” Computer, vol. 29, no. 12, pp. 57-64, Dec. 1996.[7] M. Cierniak and W. Li, “Unifying Data and Control Transformations for Distributed Shared Memory Machines,” Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, June 1995.[8] S. Coleman and K. McKinley, “Tile Size Selection Using Cache Organization and Data Layout,” Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, June 1995.[9] R. Das, M. Uysal, J. Saltz, and Y.-S. Hwang, “Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures,” J. Parallel and Distributed Computing, vol. 22, no. 3, pp. 462-479, Sept. 1994.[10] C. Ding and K. Kennedy, “Inter-Array Data Regrouping,” Proc. 12th Workshop Languages and Compilers for Parallel Computing, Aug. 1999.[11] C. Ding and K. Kennedy, “Improving Cache Performance in Dynamic Applications through Data and Computation Reorganization at Runtime,” Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, May 1999.[12] K.M. Dixit, “New CPU Benchmark Suites from SPEC,” Proc. COMPCON '92— 37th IEEE Computer Soc. Int'l Conf., Feb. 1992.[13] P. Feautrier, “Dataflow Analysis of Array and Scalar References,” Int'l J. Parallel Programming, vol. 20, no. 1, pp. 23-51, 1991.[14] J. Garcia, E. Ayguade, and J. Labarta, “A Novel Approach towards Automatic Data Distribution,” Proc. Supercomputing Conf. '95, Dec. 1995.[15] J. Garcia, E. Ayguade, and J. Labarta, “Dynamic Data Distribution with Control Flow Analysis,” Proc. Supercomputing Conf. '96, Nov. 1996.[16] S. Ghosh, M. Martonosi, and S. Malik, “Precise Miss Analysis for Program Transformations with Caches of Arbitrary Associativity,” Proc. Eighth Int'l Symp. Architectural Support for Programming Languages and Operating Systems, Oct. 1998.[17] M. Gupta and P. Banerjee, “Demonstration of Automated Data Partitioning Techniques in Parallelizing Compilers for Distributed Memory Multiprocessors,” IEEE Trans. Parallel and Distributed Systems, vol. 3, no. 2, pp. 179-193, Mar. 1992.[18] M. Gupta and P. Banerjee, “Compile-Time Estimation of Communication Costs in Programs,” J. Programming Languages, vol. 2, pp. 191-225, 1994.[19] H. Han and C.-W. Tseng, “Improving Locality for Adaptive Irregular Scientific Codes,” Proc. 13th Int'l Workshop Languages and Compilers for Parallel Computing, Aug. 2000.[20] F. Irigoin and R. Triolet, “Supernode Partitioning,” Proc. 15th Ann. ACM Symp. Principles of Programming Languages, pp. 319-329, Jan. 1988.[21] Y. Ju and H. Dietz, “Reduction of Cache Coherence Overhead by Compiler Data Layout and Loop Transformation,” Proc. Conf. Languages and Compiler for Parallel Computing, pp. 344-358, 1992.[22] M. Kandemir, P. Banerjee, A. Choudhary, J. Ramanujam, and E. Ayguade, “An ILP Approach to Optimizing Cache Locality,” Proc. ACM Int'l Conf. Supercomputing, June 1999.[23] M. Kandemir, A. Choudhary, J. Ramanujam, and P. Banerjee, “A Matrix-Based Approach to the Global Locality Optimization Problem,” Proc. Int'l Conf. Parallel Architectures and Compilation Techniques, Oct. 1998.[24] M. Kandemir, A. Choudhary, N. Shenoy, P. Banerjee, and J. Ramanujam, “A Hyperplane Based Approach for Optimizing Spatial Locality in Loop Nests,” Proc. 1998 ACM Int'l Conf. Supercomputing, July 1998.[25] M. Kandemir, A. Choudhary, J. Ramanujam, and P. Banerjee, “Improving Locality Using Loop and Data Transformations in an Integrated Framework,” Proc. Int'l Symp. Microarchitecture, pp. 285-296, Dec. 1998.[26] M. Kandemir and I. Kadayif, “Compiler-Directed Selection of Dynamic Memory Layouts,” Proc. ACM SIGDA/SIGSOFT Ninth Int'l Symp. Hardware/Software Codesign, Apr. 2001.[27] K. Kennedy and U. Kremer, “Automatic Data Layout for High Performance Fortran,” Proc. Supercomputing Conf. '95, Dec. 1995.[28] K. Kennedy and U. Kremer, “Automatic Data Layout for Distributed Memory Machines,” ACM Trans. Programming Languages and Systems, vol. 20, no. 4, pp. 869-916, ACM Press, July 1998.[29] K. Kennedy, K. McKinley, “Optimizing for Parallelism and Data Locality,” Proc. ACM Int'l Conf. Supercomputing, 1992.[30] I. Kodukula, N. Ahmed, and K. Pingali, “Data-Centric Multi-Level Blocking,” Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, June 1997.[31] M. Lam, E. Rothberg, and M. Wolf, “The Cache Performance of Blocked Algorithms,” Proc. Fourth ACM Int'l Conf. Architectural Support for Programming Languages and Operating Systems, Apr. 1991.[32] S.-T. Leung and J. Zahorjan, “Optimizing Data Locality by Array Restructuring,” Technical Report TR 95-09-01, Dept. of Computer Science and Eng., Univ. of Washington, Seattle, Wash., Sept. 1995.[33] W. Li, “Compiling for NUMA Parallel Machines,” PhD thesis, Computer Science Dept., Cornell Univ., Ithaca, New York, 1993.[34] J. Li and M. Chen, “Compiling Communication Efficient Programs for Massively Parallel Machines,” J. Parallel and Distributed Computing, vol. 2, no. 3, pp. 361-376, 1991.[35] LP_SOLVE, ftp://ftp.es.ele.tue.nl/publp_solve/, 2004.[36] K.S. McKinley and O. Temam, “Quantifying Loop Nest Locality Using SPEC '95 and the Perfect Benchmarks,” ACM Trans. Computer Systems, vol. 17, no. 4, pp. 288-336, Nov. 1999.[37] K. McKinley, S. Carr, and C.W. Tseng, “Improving Data Locality with Loop Transformations,” ACM Trans. Programming Languages and Systems, 1996.[38] J. Mellor-Crummey, D. Whalley, and K. Kennedy, “Improving Memory Hierarchy Performance for Irregular Applications,” Proc. ACM Int'l Conf. Supercomputing, June 1999.[39] N. Mitchell, L. Carter, and J. Ferrante, “Localizing Non-Affine Array References,” Proc. Int'l Conf. Parallel Architectures and Compilation Techniques, Oct. 1999.[40] S.S. Muchnick, Advanced Compiler Design Implementation. San Francisco, Calif.: Morgan Kaufmann, 1997.[41] M. O'Boyle and P. Knijnenburg, “Non-Singular Data Transformations: Definition, Validity, Applications,” Proc. Sixth Workshop Compilers for Parallel Computers, pp. 287-297, 1996.[42] M. O'Boyle and P. Knijnenburg, “Integrating Loop and Data Transformations for Global Optimization,” Proc. Int'l Conf. Parallel Architectures and Compilation Techniques, Oct. 1998.[43] D. Palermo and P. Banerjee, “Automatic Selection of Dynamic Data Partitioning Schemes for Distributed-Memory Multicomputers,” Proc. Eighth Workshop Languages and Compilers for Parallel Computing, pp. 392-406, 1995.[44] D. Palermo, E.W. Hodges, and P. Banerjee, “Techniques for Selecting and Analyzing Data Distributions,” Proc. Workshop Challenges in Compiling for Scalable Parallel Systems, Oct. 1996.[45] “Perfect Club: The Perfect Club Benchmarks: Effective Performance Evaluation of Supercomputers,” Int'l J. Supercomputing Applications, vol. 3, no. 3, pp 5-40, 1989.[46] C. Polychronopoulos, M.B. Girkar, M.R. Haghighat, C.L. Lee, B.P. Leung, and D.A. Schouten, “Parafrase-2: An Environment for Parallelizing, Partitioning, Synchronizing, and Scheduling Programs on Multiprocessors,” Proc. Int'l Conf. Parallel Processing, pp. 39-48, Aug. 1989.[47] G. Rivera and C.-W. Tseng, “Data Transformations for Eliminating Conflict Misses,” Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, June 1998.[48] S. Tandri and T. Abdelrahman, “Automatic Partitioning of Data and Computations on Scalable Shared Memory Multiprocessors,” Proc. 1997 Int'l Conf. Parallel Processing, pp. 64-73, Aug. 1997.[49] O. Temam, C. Fricker, and W. Jalby, “Cache Interference Phenomena,” Proc. ACM SIGMETRICS Conf., 1994.[50] O. Temam, C. Fricker, and W. Jalby, “Impact of Cache Interferences on Usual Numerical Dense Loop Nests,” Proc. IEEE, special issue on computer performance evaluation, 1993.[51] O. Temam, E.D. Granston, and W. Jalby, “To Copy or Not to Copy: A Compile-Time Technique for Assessing When Data Copying Should be Used to Eliminate Cache Conflicts,” Proc. Supercomputing Conf., pp. 410-419, 1993.[52] R. Wilson et al., “SUIF: An Infrastructure for Research on Parallelizing and Optimizing Compilers,” ACM SIGPLAN Notices, vol. 29, no. 12, pp. 31-37, Dec. 1994.[53] M. Wolf and M. Lam, “A Data Locality Optimizing Algorithm,” Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, pp. 30-44, June 1991.[54] M. Wolfe, “Optimizing Supercompilers for Supercomputers,” PhD thesis, Univ. of Illinois at Urbana-Champaign, Oct. 1982.[55] M. Wolfe, High Performance Compilers for Parallel Computing. Addison-Wesley Publishing Company, 1996.

Index Terms:
Optimizing compilers, data locality, dynamic optimization, array-intensive computations.
Citation:
Ismail Kadayif, Mahmut Kandemir, "Quasidynamic Layout Optimizations for Improving Data Locality," IEEE Transactions on Parallel and Distributed Systems, vol. 15, no. 11, pp. 996-1011, Nov. 2004, doi:10.1109/TPDS.2004.70
Usage of this product signifies your acceptance of the Terms of Use.