This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Improving Data Locality by Array Contraction
September 2004 (vol. 53 no. 9)
pp. 1073-1084
Zhiyuan Li, IEEE Computer Society
Array contraction is a program transformation which reduces array size while preserving the correct output. In this paper, we present an aggressive array-contraction technique and study its impact on memory system performance. This technique, called controlled SFC, combines loop shifting and controlled loop fusion to maximize opportunities for array contraction within a given loop nesting. A controlled fusion scheme is used to prevent overfusing loops and to avoid excessive pressure on the cache and the registers. Reducing the array size increases data reuse because of the increased average number of memory operations on the same memory addresses. Furthermore, if the data size of a loop nest fits in the cache after array contraction, then repeated references to the same variable in the loop nest will generate cache hits, assuming set conflicts are eliminated successfully.

[1] R. Ahuja, T. Magnanti, and J. Orlin, Network Flows: Theory, Algorithms, and Applications. Englewood, N.J.: Prentice Hall, 1993.
[2] V. Allan, R. Jones, R. Lee, and S. Allan, Software Pipelining ACM Computing Surveys, vol. 27, no. 3, pp. 367-432, Sept. 1995.
[3] D. Bacon, S. Graham, and O. Sharp, Compiler Transformations for High-Performance Computing ACM Computing Surveys, vol. 26, no. 4, pp. 345-420, Dec. 1994.
[4] D. Burger and T. Austin, The SimpleScalar Tool Set, Version 2.0 Technical Report TR-1342, Dept. of Computer Sciences, Univ. of Wisconsin, Madison, June 1997.
[5] B. Creusillet and F. Irigoin, Interprocedural Array Region Analyses Int'l J. Parallel Programming, vol. 24, no. 6, pp. 513-546, Dec. 1996.
[6] A. Darte, On the Complexity of Loop Fusion Proc. Int'l Conf/ Parallel Architecture and Compilation Techniques. pp. 149-157, Oct. 1999.
[7] C. Ding and K. Kennedy, Improving Effective Bandwidth through Compiler Enhancement of Global Cache Reuse Proc. Int'l Parallel and Distributed Processing Symp., 2001.
[8] P. Feautrier, Array Dataflow Analysis Proc. Compiler Optimizations for Scalable Parallel Systems Languages, 2001.
[9] P. Feautrier, Dataflow Analysis of Array and Scalar References Int'l J. Parallel Programming, vol. 20, no. 1, pp. 23-53, Jan. 1991.
[10] A. Fraboulet, G. Huard, and A. Mignotte, Loop Alignment for Memory Accesses Optimization Proc. 12th Int'l Symp. System Synthesis, Nov. 1999.
[11] G.R. Gao, R. Olsen, V. Sarkar, and R. Thekkath, Collective Loop Fusion for Array Contraction Proc. Fifth Workshop Languages and Compilers for Parallel Computing, pp. 281-295, 1992.
[12] T. Gross and P. Steenkiste, Structured Dataflow Analysis for Arrays and Its Use in an Optimizing Compiler Software-Practice and Experience, vol. 20, no. 2, Feb. 1990.
[13] J. Gu, Z. Li, and G. Lee, An Evaluation of the Potential Benefits of Register Allocation for Array References Proc. Workshop Interaction between Compilers and Computer Architectures, Feb. 1996.
[14] J. Gu, Z. Li, and G. Lee, Experience with Efficient Array Data Flow Analysis for Array Privatization Proc. Sixth ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, pp. 157-167, June 1997.
[15] J. Hennessy and D. Patterson, Computer Architecture: A Quantitative Approach. Morgan Kaufmann, 1996.
[16] M. Kandemir, A. Choudhary, J. Ramanujam, and P. Banerjee, A Matrix-Based Approach to Global Locality Optimization J. Parallel and Distributed Computing, vol. 58, no. 2, pp. 190-235, 1999.
[17] K. Kennedy, Fast Greedy Weighted Fusion Proc. 2000 Int'l Conf. Supercomputing, May 2000.
[18] K. Kennedy and K.S. McKinley, Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution Proc. Sixth Workshop Languages and Compilers for Parallel Computing, Aug. 1993.
[19] J. Laudon and D. Lenoski, “The SGI Origin: A CC-NUMA Highly Scalable Server,” Proc. 24th Ann. Int'l Symp. Computer Architecture (ISCA '97), May 1997.
[20] V. Lefebvre and P. Feautrier, Automatic Storage Management for Parallel Programs Parallel Computing, vol. 24, nos. 3-4, pp. 649-671, May 1998.
[21] A.W. Lim, S.-W. Liao, and M.S. Lam, Blocking and Array Contraction across Arbitrarily Nested Loops Using Affine Partitioning Proc. 2001 ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, pp. 103-112, June 2001.
[22] V. Loechner, B. Meister, and P. Clauss, Precise Data Locality Optimization of Nested Loops J. Supercomputing, vol. 21, no. 1, pp. 37-76, 2002.
[23] N. Manjikian and T.S. Abdelrahman, “Fusion of Loops for Parallelism and Locality,” IEEE Trans. Parallel and Distributed Systems vol. 8, no. 2, pp. 193-209, Feb. 1997.
[24] K.S. McKinley and O. Teman, Quantifying Loop Nest Locality Using SPEC'95 and the Perfect Benchmarks ACM Trans. Computer Systems, vol. 17, no. 4, Nov. 1999.
[25] A.G. Mohamed, G.C. Fox, G. von Laszewski, M. Parashar, T. Haupt, K. Mills, Y.-H. Lu, N.-T. Lin, and N.-K. Yeh, Applications Benchmark Set for Fortran-D and High Performance Fortran Technical Report CRPS-TR92260, Center for Research on Parallel Computation, Rice Univ., June 1992.
[26] T. Mowry, M.S. Lam, and A. Gupta, Design and Evaluation of a Compiler Algorithm for Prefetching Proc. Fifth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 62-73, Oct. 1992.
[27] J. Rice and J. Jing, Problems to Test Parallel and Vector Languages Technical Report CSD-TR-1016, Dept. of Computer Science, Purdue Univ., 1990.
[28] G. Rivera and C.-W. Tseng, Eliminating Conflict Misses for High Performance Architectures Proc. 1998 ACM Int'l Conf. Supercomputing, pp. 353-360, July 1998.
[29] G. Rivera and C.-W. Tseng, A Comparison of Compiler Tiling Algorithms Proc. Eighth Int'l Conf. Compiler Construction, Mar. 1999.
[30] V. Sarkar, Optimized Unrolling of Nested Loops Proc. ACM Int'l Conf. Supercomputing, pp. 153-166, May 2000.
[31] S.K. Singhai and K.S. McKinley, A Parameterized Loop Fusion Algorithm for Improving Parallelism and Cache Locality The Computer J., vol. 40, no. 6, 1997.
[32] Y. Song, R. Xu, C. Wang, and Z. Li, Performance Enhancement by Memory Reduction Technical Report CSD-TR-00-016, Dept. of Computer Science, Purdue Univ., 2000, http://www.cs.purdue. edu/homes/songyhacademic.html .
[33] Y. Song, R. Xu, C. Wang, and Z. Li, Data Locality Enhancement by Memory Reduction Proc. 15th ACM Int'l Conf. Supercomputing, June 2001.
[34] W. Tembe and S. Pande, Data I/O Minimization for Loops on Limited On-Chip Memory Processors IEEE Trans. Computers, vol. 51, no. 10, pp. 1269-1280, Oct. 2002.
[35] W. Thies, F. Vivien, J. Sheldon, and S. Amarasinghe, A Unified Framework for Schedule and Storage Optimization Proc. 2001 ACM SIGPLAN Conf. Programming Language Design and Implementation, pp. 232-242, June 2001.
[36] M. Wolfe, High Performance Compilers for Parallel Computing. Addison-Wesley, 1995.

Index Terms:
Compiler, memory, optimization, performance, array contraction, data locality, loop shifting, optimizing compilers.
Citation:
Yonghong Song, Rong Xu, Cheng Wang, Zhiyuan Li, "Improving Data Locality by Array Contraction," IEEE Transactions on Computers, vol. 53, no. 9, pp. 1073-1084, Sept. 2004, doi:10.1109/TC.2004.62
Usage of this product signifies your acceptance of the Terms of Use.