This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
An Efficient Code Generation Technique for Tiled Iteration Spaces
October 2003 (vol. 14 no. 10)
pp. 1021-1034
Nectarios Koziris, IEEE Computer Society

Abstract—This paper presents a novel approach for the problem of generating tiled code for nested for-loops, transformed by a tiling transformation. Tiling or supernode transformation has been widely used to improve locality in multilevel memory hierarchies, as well as to efficiently execute loops onto parallel architectures. However, automatic code generation for tiled loops can be a very complex compiler work, especially when nonrectangular tile shapes and iteration space bounds are concerned. Our method considerably enhances previous work on rewriting tiled loops, by considering parallelepiped tiles and arbitrary iteration space shapes. In order to generate tiled code, we first enumerate all tiles containing points within the iteration space and, second, sweep all points within each tile. For the first subproblem, we refine upon previous results concerning the computation of new loop bounds of an iteration space that has been transformed by a nonunimodular transformation. For the second subproblem, we transform the initial parallelepiped tile into a rectangular one, in order to generate efficient code with the aid of a nonunimodular transformation matrix and its Hermite Normal Form (HNF). Experimental results show that the proposed method significantly accelerates the compilation process and generates much more efficient code.

[1] V. Adve and J. Mellor-Crummey, Advanced Code Generation for High Performance Fortran Languages, Compilation Techniques, and Run Time Systems for Scalable Parallel Systems, chapter 18, 1997.
[2] A. Agarwal, D. Kranz, and V. Natarajan, “Automatic Partitioning of Parallel Loops and Data Arrays for Distributed Shared-Memory Multiprocessors,” IEEE Trans. Parallel and Distributed Systems, vol. 6, no. 9, pp. 943-962, Sept. 1995.
[3] S.P. Amarasinghe and M.S. Lam, Communication Optimization and Code Generation for Distributed Memory Machines Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, June 1993.
[4] C. Ancourt and F. Irigoin, Scanning Polyhedra with DO Loops Proc. Third ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, pp. 39-50, Apr. 1991.
[5] R. Andonov, P. Calland, S. Niar, S. Rajopadhye, and N. Yanev, First Steps Towards Optimal Oblique Tile Sizing Proc. Eighth Int'l Workshop Compilers for Parallel Computers, pp. 351-366, Jan. 2000.
[6] A. Bik and H. Wijshoff, Implementation of Fourier-Motzkin Elimination Proc. First Ann. Conf. Advanced School for Computing and Imaging, pp. 377-386, 1995.
[7] P. Boulet, A. Darte, T. Risset, and Y. Robert, (Pen)-Ultimate Tiling? INTEGRATION, The VLSI J., vol. 17, pp. 33-51, 1994.
[8] B. Chapman, P. Mehrotra, and H. Zima, Programming in Vienna Fortran Proc. Third Workshop Compilers for Parallel Computers, pp. 121-160, July 1992.
[9] F. Desprez, J. Dongarra, and Y. Robert, Determining the Idle Time of a Tiling: New Results J. Information Science and Eng., vol. 14, pp. 167-190, Mar. 1997.
[10] E. D'Hollander, "Partitioning and Labeling of Loops by Unimodular Transformations," IEEE Trans. Parallel and Distributed Systems, vol. 3, no. 4, July 1992.
[11] I. Drossitis, G. Goumas, N. Koziris, G. Papakonstantinou, and P. Tsanakas, Evaluation of Loop Grouping Methods Based on Orthogonal Projection Spaces Proc. Int'l Conf. Parallel Processing, pp. 469-476, Aug. 2000.
[12] A. Fernandez, J. Llaberia, and M. Valero, Loop Transformations Using Nonunimodular Matrices IEEE Trans. Parallel and Distributed Systems, vol. 6, no. 8, pp. 832-840, Aug. 1995.
[13] G. Fox, S. Hiranandani, K. Kennedy, C. Koelbel, U. Kremer, C. Tseng, and M. Wu, Fortran-D Language Specification Technical Report TR-91-170, Dept. of Computer Science, Rice Univ., Dec. 1991.
[14] G. Goumas, N. Drosinos, M. Athanasaki, and N. Koziris, Compiling Tiled Iteration Spaces for Clusters Proc. IEEE Int'l Conf. Cluster Computing, pp. 360-369, Sept. 2002.
[15] G. Goumas, A. Sotiropoulos, and N. Koziris, Minimizing Completion Time for Loop Tiling with Computation and Communication Overlapping Proc. IEEE Int'l Parallel and Distributed Processing Symp., Apr. 2001.
[16] E. Hodzic and W. Shang, On Supernode Transformation with Minimized Total Running Time IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 5, pp. 417-428, May 1998.
[17] E. Hodzic and W. Shang, On Time Optimal Supernode Shape IEEE Trans. Parallel and Distributed Systems, vol. 13, no. 12, pp. 1220-1233, Dec. 2002.
[18] K. Hogstedt, L. Carter, and J. Ferrante, Determining the Idle Time of a Tiling Principles of Programming Languages, pp. 319-323, Jan. 1997.
[19] K. Hogstedt, L. Carter, and J. Ferrante, Selecting Tile Shape for Minimal Execution Time Proc. ACM Symp. Parallel Algorithms and Architectures, pp. 201-211, 1999.
[20] K. Hogstedt, L. Carter, and J. Ferrante, On the Parallel Execution Time of Tiled Loops IEEE Trans. Parallel and Distributed Systems, vol. 14, no. 3, pp. 307-321, Mar. 2003.
[21] F. Irigoin and R. Triolet, Supernode Partitioning Proc. 15th Ann. ACM SIGACT-SIGPLAN Symp. Principles of Programming Languages, pp. 319-329, Jan. 1988.
[22] M. Jimenez, Multilevel Tiling for Non-Rectangular Iteration Spaces PhD dissertation, Univ. Politecnica de Catalunia, 1999.
[23] W. Kelly, V. Maslov, W. Pugh, E. Rosser, T. Shpeisman, and D. Wonnacott, The Omega Library Interface Guide Technical Report CS-TR-3445, Computer Science Dept., Univ. of Maryland, College Park, Mar. 1995.
[24] C.-T. King, W.-H. Chou, and L. Ni, Pipelined Data-Parallel Algorithms: Part II Design IEEE Trans. Parallel and Distributed Systems, vol. 2, no. 4, pp. 430-439, Oct. 1991.
[25] W. Li, Compiling for NUMA Parallel Machines PhD dissertation, Cornell Univ., Ithaca, New York, 1993.
[26] J. Ramanujam, “Non-Unimodular Transformations of Nested Loops,” Proc. Supercomputing '92, pp. 214-223, Nov. 1992.
[27] J. Ramanujam, Beyond Unimodular Transformations J. Supercomputing, vol. 9, no. 4, pp. 365-389, Oct. 1995.
[28] J. Ramanujam and P. Sadayappan, Tiling Multidimensional Iteration Spaces for Multicomputers J. Parallel and Distributed Computing, vol. 16, pp. 108-120, 1992.
[29] W. Shang and J.A.B. Fortes, "Independent Partitioning of Algorithms with Uniform Dependencies," IEEE Trans. Computers, vol. 41, no. 2, pp. 190-206, Feb. 1992.
[30] J.-P. Sheu and T.-S. Chen, Partitioning and Mapping Nested Loops for Linear Array Multicomputers J. Supercomputing, vol. 9, pp. 183-202, 1995.
[31] J.-P. Sheu and T.-H. Tai, Partitioning and Mapping Nested Loops on Multiprocessor Systems IEEE Trans. Parallel and Distributed Systems, vol. 2, no. 4, pp. 430-439, Oct. 1991.
[32] A. Sotiropoulos, G. Tsoukalas, and N. Koziris, Enhancing the Performance of Tiled Loop Execution onto Clusters Using Memory Mapped Network Interfaces and Pipelined Schedules Proc. 2002 Workshop Comm. Architecture for Clusters, and Int'l Parallel and Distributed Processing Symp., Apr. 2002.
[33] E. Su, A. Lain, S. Ramaswamy, D.J. Palermo, E.W. Hodges, and P. Banerjee, Advanced Compilation Techniques in the PARADIGM Compiler for Distributed Memory Multicomputers Proc. ACM Int'l Conf. Supercomputing, July 1995.
[34] P. Tang and J. Xue, Generating Efficient Tiled Code for Distributed Memory Machines Parallel Computing, vol. 26, no. 11, pp. 1369-1410, 2000.
[35] P. Tsanakas, N. Koziris, and G. Papakonstantinou, Chain Grouping: A Method for Partitioning Loops onto Mesh-Connected Processor Arrays IEEE Trans. Parallel and Distributed Systems, vol. 11, no. 9, pp. 941-955, Sept. 2000.
[36] M. Wolf and M. Lam, A Data Locality Optimizing Algorithm Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, June 1991.
[37] M. Wolf and M. Lam, “A Loop Transformation Theory and an Algorithm to Maximize Parallelism,” IEEE Trans. Parallel and Distributed Systems, vol. 2, no. 4, Oct. 1991.
[38] J. Xue, Automatic Non-Unimodular Loop Transformations for Massive Parallelism Parallel Computing, vol. 20, no. 5, pp. 711-728, 1994.
[39] J. Xue, Communication-Minimal Tiling of Uniform Dependence Loops J. Parallel and Distributed Computing, vol. 42, no. 1, pp. 42-59, 1997.
[40] J. Xue and W. Cai, Time-Minimal Tiling when Rise is Larger than Zero Parallel Computing, vol. 28, no. 6, pp. 915-939, 2002.

Index Terms:
Loop tiling, supernodes, nonunimodular transformations, Fourier-Motzkin elimination, code generation.
Citation:
Georgios Goumas, Maria Athanasaki, Nectarios Koziris, "An Efficient Code Generation Technique for Tiled Iteration Spaces," IEEE Transactions on Parallel and Distributed Systems, vol. 14, no. 10, pp. 1021-1034, Oct. 2003, doi:10.1109/TPDS.2003.1239870
Usage of this product signifies your acceptance of the Terms of Use.