
This Article  
 
Share  
Bibliographic References  
Add to:  
Digg Furl Spurl Blink Simpy Del.icio.us Y!MyWeb  
Search  
 
ASCII Text  x  
Georgios Goumas, Maria Athanasaki, Nectarios Koziris, "An Efficient Code Generation Technique for Tiled Iteration Spaces," IEEE Transactions on Parallel and Distributed Systems, vol. 14, no. 10, pp. 10211034, October, 2003.  
BibTex  x  
@article{ 10.1109/TPDS.2003.1239870, author = {Georgios Goumas and Maria Athanasaki and Nectarios Koziris}, title = {An Efficient Code Generation Technique for Tiled Iteration Spaces}, journal ={IEEE Transactions on Parallel and Distributed Systems}, volume = {14}, number = {10}, issn = {10459219}, year = {2003}, pages = {10211034}, doi = {http://doi.ieeecomputersociety.org/10.1109/TPDS.2003.1239870}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, }  
RefWorks Procite/RefMan/Endnote  x  
TY  JOUR JO  IEEE Transactions on Parallel and Distributed Systems TI  An Efficient Code Generation Technique for Tiled Iteration Spaces IS  10 SN  10459219 SP1021 EP1034 EPD  10211034 A1  Georgios Goumas, A1  Maria Athanasaki, A1  Nectarios Koziris, PY  2003 KW  Loop tiling KW  supernodes KW  nonunimodular transformations KW  FourierMotzkin elimination KW  code generation. VL  14 JA  IEEE Transactions on Parallel and Distributed Systems ER   
Abstract—This paper presents a novel approach for the problem of generating tiled code for nested forloops, transformed by a tiling transformation. Tiling or supernode transformation has been widely used to improve locality in multilevel memory hierarchies, as well as to efficiently execute loops onto parallel architectures. However, automatic code generation for tiled loops can be a very complex compiler work, especially when nonrectangular tile shapes and iteration space bounds are concerned. Our method considerably enhances previous work on rewriting tiled loops, by considering parallelepiped tiles and arbitrary iteration space shapes. In order to generate tiled code, we first enumerate all tiles containing points within the iteration space and, second, sweep all points within each tile. For the first subproblem, we refine upon previous results concerning the computation of new loop bounds of an iteration space that has been transformed by a nonunimodular transformation. For the second subproblem, we transform the initial parallelepiped tile into a rectangular one, in order to generate efficient code with the aid of a nonunimodular transformation matrix and its Hermite Normal Form (HNF). Experimental results show that the proposed method significantly accelerates the compilation process and generates much more efficient code.
[1] V. Adve and J. MellorCrummey, Advanced Code Generation for High Performance Fortran Languages, Compilation Techniques, and Run Time Systems for Scalable Parallel Systems, chapter 18, 1997.
[2] A. Agarwal, D. Kranz, and V. Natarajan, “Automatic Partitioning of Parallel Loops and Data Arrays for Distributed SharedMemory Multiprocessors,” IEEE Trans. Parallel and Distributed Systems, vol. 6, no. 9, pp. 943962, Sept. 1995.
[3] S.P. Amarasinghe and M.S. Lam, Communication Optimization and Code Generation for Distributed Memory Machines Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, June 1993.
[4] C. Ancourt and F. Irigoin, Scanning Polyhedra with DO Loops Proc. Third ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, pp. 3950, Apr. 1991.
[5] R. Andonov, P. Calland, S. Niar, S. Rajopadhye, and N. Yanev, First Steps Towards Optimal Oblique Tile Sizing Proc. Eighth Int'l Workshop Compilers for Parallel Computers, pp. 351366, Jan. 2000.
[6] A. Bik and H. Wijshoff, Implementation of FourierMotzkin Elimination Proc. First Ann. Conf. Advanced School for Computing and Imaging, pp. 377386, 1995.
[7] P. Boulet, A. Darte, T. Risset, and Y. Robert, (Pen)Ultimate Tiling? INTEGRATION, The VLSI J., vol. 17, pp. 3351, 1994.
[8] B. Chapman, P. Mehrotra, and H. Zima, Programming in Vienna Fortran Proc. Third Workshop Compilers for Parallel Computers, pp. 121160, July 1992.
[9] F. Desprez, J. Dongarra, and Y. Robert, Determining the Idle Time of a Tiling: New Results J. Information Science and Eng., vol. 14, pp. 167190, Mar. 1997.
[10] E. D'Hollander, "Partitioning and Labeling of Loops by Unimodular Transformations," IEEE Trans. Parallel and Distributed Systems, vol. 3, no. 4, July 1992.
[11] I. Drossitis, G. Goumas, N. Koziris, G. Papakonstantinou, and P. Tsanakas, Evaluation of Loop Grouping Methods Based on Orthogonal Projection Spaces Proc. Int'l Conf. Parallel Processing, pp. 469476, Aug. 2000.
[12] A. Fernandez, J. Llaberia, and M. Valero, Loop Transformations Using Nonunimodular Matrices IEEE Trans. Parallel and Distributed Systems, vol. 6, no. 8, pp. 832840, Aug. 1995.
[13] G. Fox, S. Hiranandani, K. Kennedy, C. Koelbel, U. Kremer, C. Tseng, and M. Wu, FortranD Language Specification Technical Report TR91170, Dept. of Computer Science, Rice Univ., Dec. 1991.
[14] G. Goumas, N. Drosinos, M. Athanasaki, and N. Koziris, Compiling Tiled Iteration Spaces for Clusters Proc. IEEE Int'l Conf. Cluster Computing, pp. 360369, Sept. 2002.
[15] G. Goumas, A. Sotiropoulos, and N. Koziris, Minimizing Completion Time for Loop Tiling with Computation and Communication Overlapping Proc. IEEE Int'l Parallel and Distributed Processing Symp., Apr. 2001.
[16] E. Hodzic and W. Shang, On Supernode Transformation with Minimized Total Running Time IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 5, pp. 417428, May 1998.
[17] E. Hodzic and W. Shang, On Time Optimal Supernode Shape IEEE Trans. Parallel and Distributed Systems, vol. 13, no. 12, pp. 12201233, Dec. 2002.
[18] K. Hogstedt, L. Carter, and J. Ferrante, Determining the Idle Time of a Tiling Principles of Programming Languages, pp. 319323, Jan. 1997.
[19] K. Hogstedt, L. Carter, and J. Ferrante, Selecting Tile Shape for Minimal Execution Time Proc. ACM Symp. Parallel Algorithms and Architectures, pp. 201211, 1999.
[20] K. Hogstedt, L. Carter, and J. Ferrante, On the Parallel Execution Time of Tiled Loops IEEE Trans. Parallel and Distributed Systems, vol. 14, no. 3, pp. 307321, Mar. 2003.
[21] F. Irigoin and R. Triolet, Supernode Partitioning Proc. 15th Ann. ACM SIGACTSIGPLAN Symp. Principles of Programming Languages, pp. 319329, Jan. 1988.
[22] M. Jimenez, Multilevel Tiling for NonRectangular Iteration Spaces PhD dissertation, Univ. Politecnica de Catalunia, 1999.
[23] W. Kelly, V. Maslov, W. Pugh, E. Rosser, T. Shpeisman, and D. Wonnacott, The Omega Library Interface Guide Technical Report CSTR3445, Computer Science Dept., Univ. of Maryland, College Park, Mar. 1995.
[24] C.T. King, W.H. Chou, and L. Ni, Pipelined DataParallel Algorithms: Part II Design IEEE Trans. Parallel and Distributed Systems, vol. 2, no. 4, pp. 430439, Oct. 1991.
[25] W. Li, Compiling for NUMA Parallel Machines PhD dissertation, Cornell Univ., Ithaca, New York, 1993.
[26] J. Ramanujam, “NonUnimodular Transformations of Nested Loops,” Proc. Supercomputing '92, pp. 214223, Nov. 1992.
[27] J. Ramanujam, Beyond Unimodular Transformations J. Supercomputing, vol. 9, no. 4, pp. 365389, Oct. 1995.
[28] J. Ramanujam and P. Sadayappan, Tiling Multidimensional Iteration Spaces for Multicomputers J. Parallel and Distributed Computing, vol. 16, pp. 108120, 1992.
[29] W. Shang and J.A.B. Fortes, "Independent Partitioning of Algorithms with Uniform Dependencies," IEEE Trans. Computers, vol. 41, no. 2, pp. 190206, Feb. 1992.
[30] J.P. Sheu and T.S. Chen, Partitioning and Mapping Nested Loops for Linear Array Multicomputers J. Supercomputing, vol. 9, pp. 183202, 1995.
[31] J.P. Sheu and T.H. Tai, Partitioning and Mapping Nested Loops on Multiprocessor Systems IEEE Trans. Parallel and Distributed Systems, vol. 2, no. 4, pp. 430439, Oct. 1991.
[32] A. Sotiropoulos, G. Tsoukalas, and N. Koziris, Enhancing the Performance of Tiled Loop Execution onto Clusters Using Memory Mapped Network Interfaces and Pipelined Schedules Proc. 2002 Workshop Comm. Architecture for Clusters, and Int'l Parallel and Distributed Processing Symp., Apr. 2002.
[33] E. Su, A. Lain, S. Ramaswamy, D.J. Palermo, E.W. Hodges, and P. Banerjee, Advanced Compilation Techniques in the PARADIGM Compiler for Distributed Memory Multicomputers Proc. ACM Int'l Conf. Supercomputing, July 1995.
[34] P. Tang and J. Xue, Generating Efficient Tiled Code for Distributed Memory Machines Parallel Computing, vol. 26, no. 11, pp. 13691410, 2000.
[35] P. Tsanakas, N. Koziris, and G. Papakonstantinou, Chain Grouping: A Method for Partitioning Loops onto MeshConnected Processor Arrays IEEE Trans. Parallel and Distributed Systems, vol. 11, no. 9, pp. 941955, Sept. 2000.
[36] M. Wolf and M. Lam, A Data Locality Optimizing Algorithm Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, June 1991.
[37] M. Wolf and M. Lam, “A Loop Transformation Theory and an Algorithm to Maximize Parallelism,” IEEE Trans. Parallel and Distributed Systems, vol. 2, no. 4, Oct. 1991.
[38] J. Xue, Automatic NonUnimodular Loop Transformations for Massive Parallelism Parallel Computing, vol. 20, no. 5, pp. 711728, 1994.
[39] J. Xue, CommunicationMinimal Tiling of Uniform Dependence Loops J. Parallel and Distributed Computing, vol. 42, no. 1, pp. 4259, 1997.
[40] J. Xue and W. Cai, TimeMinimal Tiling when Rise is Larger than Zero Parallel Computing, vol. 28, no. 6, pp. 915939, 2002.