This Article 
 Bibliographic References 
 Add to: 
Loop Restructuring for Data I/O Minimization on Limited On-Chip Memory Embedded Processors
October 2002 (vol. 51 no. 10)
pp. 1269-1280

Abstract—In this paper, we propose a framework for analyzing the flow of values and their reuse in loop nests to minimize data traffic under the constraints of limited on-chip memory capacity and dependences. Our analysis first undertakes fusion of possible loop nests intra-procedurally and then performs loop distribution. The analysis discovers the closeness factor of two statements which is a quantitative measure of data traffic saved per unit memory occupied if the statements, were under the same loop nest over the case where they are under different loop nests. We then develop a greedy algorithm which traverses the program dependence graph (PDG) to group statements together under the same loop nest legally to promote maximal reuse per unit of memory occupied. We implemented our framework in Petit, a tool for dependence analysis and loop transformations. We compared our method with one based on tiling of fused loop nest and one based on a greedy strategy to purely maximize reuse. We show that our methods work better than both of these strategies in most cases for processors such as TMS320Cxx, which have a very limited amount of on-chip memory. The improvements in data I/O range from 10 to 30 percent over tiling and from 10 to 40 percent over maximal reuse for JPEG loops.

[1] L. Wang, W. Tembe, and S. Pande, “Framework for Loop Distribution on Limited Memory Processors,” Proc. Int'l Conf. Compiler Construction (CC), pp. 141-156, Mar. 2000.
[2] Ken Kennedy, “Fast Greedy Weighted Fusion” Proc. Int'l Conf. Supercomputing, May 2000.
[3] J. Eyre and J. Bier, “DSP Processors Hit the Mainstream,” Computer, vol. 31, no. 8, pp. 51-59, Aug. 1998.
[4] Omega net, .
[5] Texas Instruments. TMS 320C5x User's Guide.
[6] Embedded Java, .
[7] A. Sundaram and S. Pande, “Compiler Optimizations for Real Time Execution of Loops on Limited Memory Embedded Systems,” Proc. IEEE Int'l Real-Time Systems Symp. pp. 154-164.
[8] A. Sundaram and S. Pande, “An Efficient Data Partitioning Method for Limited Memory Embedded Systems,” Proc. ACM Special Interest Group Programming Languages (SIGPLAN) Workshop Languages, Compilers, and Tools for Embedded Systems (in conjunction with PLDI) pp. 205-218, 1998.
[9] F. Irigoin and R. Triolet, “Supernode Partitioning,” Proc. 15th ACM Symp. Principles of Programming Languages, pp. 319-329, Jan. 1988.
[10] J. Ramanujam and P. Sadayappan, “Tiling Multidimensional Iteration Spaces for Multicomputers,” J. Parallel and Distributed Computing, vol. 16, pp. 108-120, 1992.
[11] U. Banerjee, Loop Transformations for Restructuring Compilers, Kluwer Academic Publishers, Boston, Mass., 1997.
[12] W. Li, “Compiling for NUMA Parallel Machines,” PhD thesis, Dept. of Computer Science, Cornell Univ., 1993.
[13] M. Wolfe, High Performance Compilers for Parallel Computing, Addison-Wesley, 1996.
[14] M. Wolfe, “Iteration Space Tiling for Memory Hierarchies,” Proc. Third SIAM Conf. Parallel Processing for Scientific Computing, Dec. 1987.
[15] R. Schreiber and J. Dongarra, “Automatic Blocking of Nested Loops” technical report, RIACS, NASA Ames Research Center, and Oak Ridge Nat'l Laboratory, May 1990.
[16] I. Kodukula, N. Ahmed, and K. Pingali, “Data-Centric Multi-Level Blocking,” Proc. Programming Language Design and Implementation (PLDI '97), June 1997.
[17] P. Panda, A. Nicolau, and N. Dutt, “Memory Organization for Improved Data Cache Performance in Embedded Processors,” Proc. Int'l Symp. System Synthesis, 1996.
[18] N. Mitchell, K. Hogstedt, L. Carter, and J. Ferrante, “Quantifying the Multi-Level Nature of Tiling Interactions” Int'l J. Parallel Programming, vol. 26, no. 6, pp. 641-670, 1998.
[19] K. Cooper and T. Harvey, “Compiler Controlled Memory,” Proc. Eighth Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS), Oct. 1998.
[20] K. McKinley, S. Carr, and C.W. Tseng, “Improving Data Locality with Loop Transformations,” ACM Trans. Programming Languages and Systems, vol. 18, no. 4, pp. 424-453, July 1996.
[21] M. Wolf and M. Lam, “A Data Locality Optimizing Algorithm,” Proc. SIGPLAN Conf. Programming Language Design and Implementation, pp. 30-44, June 1991.
[22] K. Kennedy and K.S. McKinley, "Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution," Languages and Compilers for Parallel Computing, U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua, eds., pp. 301-321,Portland, Ore., Aug. 1993.
[23] G. Gao, R. Olsen, V. Sarkar, and R. Thekkath, “Collective Loop Fusion for Array Contraction” Proc. Languages and Compilers for Parallel Computing (LCPC), 1992.
[24] J. Ferrante,K.J. Ottenstein,, and J.D. Warren,“The program dependence graph and its use in optimization,” ACM Trans. Programming Languages and Systems, vol. 9, no. 3, pp. 319-349, June 1987.
[25] E. Dusterwald, R. Gupta, and M. Soffa, “A Practical Data-Flow Framework for Array Reference Analysis and Its Application in Optimization,” Proc. ACM Programming Language Design and Implementation (PLDI), pp. 68-77, 1993.
[26] R. Gupta and R. Bodik, “Array Data-Flow Analysis for Load-Store Optimizations in Super-Scalar Architectures,” Eighth Annual Workshop Languages and Compilers for Parallel Computing (LCPC), 1995. Also published in Int'l J. Parallel Computing, vol. 24, no. 6, pp. 481-512, 1996.

Index Terms:
Loop fusion, limited memory, embedded processors, data locality, program dependence graph.
Waibhav Tembe, Santosh Pande, "Loop Restructuring for Data I/O Minimization on Limited On-Chip Memory Embedded Processors," IEEE Transactions on Computers, vol. 51, no. 10, pp. 1269-1280, Oct. 2002, doi:10.1109/TC.2002.1039852
Usage of this product signifies your acceptance of the Terms of Use.