This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Data Relation Vectors: A New Abstraction for Data Optimizations
August 2001 (vol. 50 no. 8)
pp. 798-810

Abstract—We present an abstraction, called data relation vectors, to improve the data access characteristics and memory layouts in regular computations. The key idea is to define a relation between the data elements accessed by close-by iterations and use this relation to guide to a number of optimizations for array-based computations. The specific optimizations studied in this paper include enhancing group-spatial and self-spatial reuses and improving intratile and intertile data reuses. In addition, this abstraction works well with other known abstractions such as data reuse vectors. We also present a unified scheme for optimizing the memory performance of programs using this new abstraction in conjunction with reuse vectors. The data relation vector abstraction has been implemented in the SUIF compilation framework and has been tested using a set of 12 benchmarks from image processing and scientific computation domains. Preliminary results on a superscalar processor show that it is successful in reducing compilation time and outperforms two previously proposed techniques, one that uses only loop transformations and one that uses both loop and data transformations. Our experiments also show that the proposed abstraction helps one to select good data tile shapes which can subsequently be used to determine iteration space tiles.

[1] A. Agarwal, D. Kranz, and V. Natarajan, “Automatic Partitioning of Parallel Loops and Data Arrays for Distributed Shared Memory Multiprocessors,” IEEE Trans. Parallel and Distributed Systems, vol. 6, no. 9, pp. 943-962, Sept. 1995.
[2] I. Al-Furaih and S. Ranka, “Memory Hierarchy Management for Iterative Graph Structures,” Proc. 12th Int'l Parallel Processing Symp., Apr. 1998.
[3] S.P. Amarasinghe, J.M. Anderson, M.S. Lam, and C.W. Tseng, “The SUIF Compiler for Scalable Parallel Machines,” Proc. Seventh SIAM Conf. Parallel Processing for Scientific Computing, Feb. 1995.
[4] J. Anderson, “Automatic Computation and Data Decomposition for Multiprocessors,” PhD dissertation, Stanford Univ., Mar. 1997. Also available as Technical Report CSL-TR-97-179, Computer Systems Laboratory, Stanford Univ.
[5] R. Bhargava, L.K. John, B.L. Evans, and R. Radhakrishnan, “Evaluating MMX Technology Using DSP and Multimedia Applications,” Proc. IEEE Symp. Microarchitecture, pp. 37-46, Dec. 1998.
[6] V. Bhaskaran and K. Konstantinides, Image and Video Compression StandardsAlgorithms and Architectures. Boston: Kluwer Academic, 1996.
[7] R. Bordawekar, A. Choudhary, and J. Ramanujam, “Automatic Optimization of Communication in Out-of-Core Stencil Codes,” technical report, Scalable I/O Initiative, Center of Advanced Computing Research, CALTECH, Nov. 1995. A short version appeared in Proc. ACM Int'l Conf. Supercomputing, 1996.
[8] R. Brickner, K. Holian, B. Thiagarajan, and S.L. Johnsson, “Designing a Stencil Compiler for the Connection Machine Model CM-5,” Technical Report LA-UR-94-3152, Los Alamos Nat'l Laboratory, 1994.
[9] M. Bromley, S. Heller, T. McNerney, and G. Steele Jr., “Fortran at Ten Gigaflops: The Connection Machine Convolution Compiler,” Proc. ACM SIGPLAN '91 Conf. Programming Language Design and Implementation, pp. 145-156, June 1991.
[10] L. Chen, W. Chen, Y. Jehng, and C. Church, “An Efficient Parallel Motion Estimation Algorithm for Digital Image Processing,” IEEE Trans. Circuits and Systems for Video Technology, vol. 1, no. 4, pp. 378-385, Dec. 1991.
[11] M. Cierniak and W. Li, “Unifying Data and Control Transformations for Distributed Shared Memory Machines,” Proc. SIGPLAN '95 Conf. Programming Language Design and Implementation, pp. 205-217, 1995.
[12] S. Coleman and K. McKinley, “Tile Size Selection Using Cache Organization and Data Layout,” Proc. SIGPLAN '95 Conf. Programming Language Design and Implementation, June 1995.
[13] R. Das, M. Uysal, J. Saltz, and Y.-S. Hwang, “Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures,” J. Parallel and Distributed Computing, vol. 22, no. 3, pp. 462-479, Sept. 1994.
[14] C. Ding and K. Kennedy, “Improving Cache Performance in Dynamic Applications through Data and Computation Reorganization at Runtime,” Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, May 1999.
[15] C. Ding and K. Kennedy, “Inter-Array Data Regrouping,” Proc. 12th Workshop Languages and Compilers for Parallel Computing, Aug. 1999.
[16] J. Dongarra and R. Schreiber, “Automatic Blocking of Nested Loops,” Technical Report UT-CS-90-108, Dept. of Computer Science, Univ. of Tennessee, May 1990.
[17] K. Esseghir, “Improving Data Locality for Caches,” master's thesis, Dept. of Computer Science, Rice Univ., Sept. 1993.
[18] G. Fox, M. Johnson, G. Lyzenga, S. Otto, J. Salmon, and D. Walker, Solving Problems on Concurrent Processors, Volume 1: General Techniques and Regular Problems. Prentice Hall, 1988.
[19] D. Gannon, W. Jalby, and K. Gallivan, “Strategies for Cache and Local Memory Management by Global Program Transformations,” J. Parallel and Distributed Computing, vol. 5, no. 5, pp. 587-616, Oct. 1988.
[20] J. Garcia, E. Ayguade, and J. Labarta, “A Novel Approach towards Automatic Data Distribution,” Proc. Supercomputing '95, Dec. 1995.
[21] J. Garcia, E. Ayguade, and J. Labarta, “Dynamic Data Distribution with Control Flow Analysis,” Proc. Supercomputing '96, Nov. 1996.
[22] M.H. Gerndt, “Automatic Parallelization for Distributed-Memory Multiprocessing Systems,” PhD thesis, Univ. of Bonn, Dec. 1989.
[23] M. Gupta and P. Banerjee, “Demonstration of Automatic Data Partitioning Techniques for Parallelizing Compilers on Multicomputers,” IEEE Trans. Parallel and Distributed Systems, vol. 3, no. 2, pp. 179-193, Mar. 1992.
[24] H. Han and C.-W. Tseng, “Improving Locality for Adaptive Irregular Scientific Codes,” Proc. 13th Int'l Workshop Languages and Compilers for High-Performance Computing, Aug. 2000.
[25] F. Irigoin and R. Triolet, “Supernode Partitioning,” Proc. 15th Ann. ACM Symp. Principles of Programming Languages, pp. 319-329, Jan. 1988.
[26] M. Kandemir, J. Ramanujam, and A. Choudhary, “Compiler Algorithms for Optimizing Locality and Parallelism on Shared and Distributed Memory Machines,” J. Parallel and Distributed Computing, vol. 60, no. 8, pp. 924-965, Aug. 2000.
[27] M. Kandemir, A. Choudhary, J. Ramanujam, and P. Banerjee, “Improving Locality Using Loop and Data Transformations in an Integrated Framework,” Proc. Int'l Symp. Microarchitecture (MICRO), Dec. 1998.
[28] M. Kandemir, A. Choudhary, N. Shenoy, P. Banerjee, and J. Ramanujam, “A Linear Algebra Framework for Automatic Determination of Optimal Data Layouts,” IEEE Trans. Parallel and Distributed Systems, vol. 10, no. 2, pp. 115-135, Feb. 1999.
[29] M. Kandemir, J. Ramanujam, and A. Choudhary, “Improving Cache Locality by a Combination of Loop and Data Transformations,” IEEE Trans. Computers, vol. 48, no. 2, pp. 159-167, Feb. 1999.
[30] W. Kelly, V. Maslov, W. Pugh, E. Rosser, T. Shpeisman, and D. Wonnacott, “The Omega Library Interface Guide,” Technical Report CS-TR-3445, Computer Science Dept. , Univ. of Maryland, College Park, Mar. 1995.
[31] K. Kennedy and U. Kremer, “Automatic Data Layout for High Performance Fortran,” Proc. Supercomputing '95, Dec. 1995.
[32] K. Kennedy and K. McKinley, “Optimizing for Parallelism and Data Locality,” Proc. Int'l Conf. Supercomputing, 1992.
[33] I. Kodukula, N. Ahmed, and K. Pingali, “Data-Centric Multi-Level Blocking,” Proc. SIGPLAN Conf. Programming Language Design and Implementation, June 1997.
[34] I. Kodukula, K. Pingali, R. Cox, and D. Maydan, “Imperfectly Nested Loop Transformations for Memory Hierarchy Management,” Proc. Int'l Conf. Supercomputing, June 1999.
[35] M. Lam, E. Rothberg, and M. Wolf, “The Cache Performance of Blocked Algorithms,” Proc. Fourth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, Apr. 1991.
[36] S.-T. Leung and J. Zahorjan, Optimizing Data Locality by Array Restructuring,” Technical Report TR 95-09-01, Dept. of Computer Science and Eng., Univ. of Washington, Sept. 1995.
[37] W. Li, “Compiling for NUMA Parallel Machines,” PhD thesis, Computer Science Dept., Cornell Univ., Ithaca, N.Y., 1993.
[38] J. Li and M. Chen, “Compiling Communication Efficient Programs for Massively Parallel Machines,” J. Parallel and Distributed Computing, vol. 2, no. 3, pp. 361-376, 1991.
[39] K. McKinley, S. Carr, and C.W. Tseng, “Improving Data Locality with Loop Transformations,” ACM Trans. Programming Languages and Systems, 1996.
[40] J. Mellor-Crummey, D. Whalley, and K. Kennedy, “Improving Memory Hierarchy Performance for Irregular Applications,” Proc. ACM Int'l Conf. Supercomputing, June 1999.
[41] N. Mitchell, L. Carter, and J. Ferrante, “Localizing Non-Affine Array References,” Proc. Int'l Conf. Parallel Architectures and Compilation Techniques, Oct 1999.
[42] M. O'Boyle and P. Knijnenburg, “Non-Singular Data Transformations: Definition, Validity, Applications,” Proc. Sixth Workshop Compilers for Parallel Computers, pp. 287-297, 1996.
[43] M. O'Boyle and P. Knijnenburg, “Integrating Loop and Data Transformations for Global Optimisation,” Proc. Int'l Conf. Parallel Architectures and Compilation Techniques, Oct. 1998.
[44] W. Pugh and E. Rosser, “Iteration Space Slicing for Locality,” Proc. Int'l Workshop Languages and Compilers for Parallel Computing, Aug. 1999.
[45] J. Ramanujam and P. Sadayappan, “A Methodology for Parallelizing Programs for Multicomputers and Complex Memory Multiprocessors,” Proc. Supercomputing '89, pp. 637-646, Nov. 1989.
[46] J. Ramanujam and P. Sadayappan, “Tiling Multi-Dimensional Iteration Spaces for Multi-Computers,” J. Parallel and Distributed Computing, vol. 16, no. 2, pp. 108-120, Oct. 1992.
[47] D. Reed, L. Adams, and M.L. Patrick, “Stencils and Problem Partitionings: Their Influence on the Performance of Multiple Processor System,” IEEE Trans. Computers, vol. 36, no. 7, pp. 845-858, July 1987.
[48] G. Rivera and C.-W. Tseng, “Data Transformations for Eliminating Conflict Misses,” Proc. SIGPLAN '98 Conf. Programming Language Design and Implementation, June 1998.
[49] S. Tandri and T. Abdelrahman, “Automatic Partitioning of Data and Computations on Scalable Shared Memory Multiprocessors,” Proc. Int'l Conf. Parallel Processing, pp. 64-73, Aug. 1997.
[50] M. Wolf and M. Lam, “A Data Locality Optimizing Algorithm,” Proc. ACM SIGPLAN '91 Conf. Programming Language Design and Implementation, pp. 30-44, June 1991.
[51] M. Wolf, D. Maydan, and D. Chen, “Combining Loop Transformations Considering Caches and Scheduling,” Proc. Int'l Symp. Microarchitecture, pp. 274-286, Dec. 1996.
[52] M. Wolfe, High Performance Compilers for Parallel Computing. Addison-Wesley, 1996.
[53] J. Xue and C.-H. Huang, “Reuse-Driven Tiling for Data Locality,” Languages and Compilers for Parallel Computing, Z. Li et al., eds., Springer-Verlag, 1998.
[54] M. Zagha, B. Larson, S. Turner, and M. Itzkowitz, “Performance Analysis Using the MIPS R10000 Performance Counters,” Proc. Supercomputing '96, Nov. 1996.

Index Terms:
Data reuse, cache locality, compiler optimizations for memory hierarchy, reuse vectors, data relation vectors, loop transformations, memory layouts.
Citation:
M. Kandemir, J. Ramanujam, "Data Relation Vectors: A New Abstraction for Data Optimizations," IEEE Transactions on Computers, vol. 50, no. 8, pp. 798-810, Aug. 2001, doi:10.1109/12.947000
Usage of this product signifies your acceptance of the Terms of Use.