This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
On the Parallel Execution Time of Tiled Loops
March 2003 (vol. 14 no. 3)
pp. 307-321

Abstract—Many computationally-intensive programs, such as those for differential equations, spatial interpolation, and dynamic programming, spend a large portion of their execution time in multiply-nested loops that have a regular stencil of data dependences. Tiling is a well-known compiler optimization that improves performance on such loops, particularly for computers with a multileveled hierarchy of parallelism and memory. Most previous work on tiling is limited in at least one of the following ways: they only handle nested loops of depth two, orthogonal tiling, or rectangular tiles. In our work, we tile loop nests of arbitrary depth using polyhedral tiles. We derive a prediction formula for the execution time of such tiled loops, which can be used by a compiler to automatically determine the tiling parameters that minimizes the execution time. We also explain the notion of rise, a measure of the relationship between the shape of the tiles and the shape of the iteration space generated by the loop nest. The rise is a powerful tool in predicting the execution time of a tiled loop. It allows us to reason about how the tiling affects the length of the longest path of dependent tiles, which is a measure of the execution time of a tiling. We use a model of the tiled iteration space that allows us to determine the length of the longest path of dependent tiles using linear programming. Using the rise, we derive a simple formula for the length of the longest path of dependent tiles in rectilinear iteration spaces, a subclass of the convex iteration spaces, and show how to choose the optimal tile shape.

[1] G. Almasi, B. Alpern, L. Berman, L. Carter, and D. Hale, “A Case-Study in Performance Programming: Seismic Migration,” Proc. Symp. High Performance Computing, Sept. 1991.
[2] C. Ancourt and F. Irigoin, “Automatic Code Distribution,” Proc. Third Workshop Compilers for Parallel Computers (CPC '92), July 1992.
[3] R. Andonov, S. Rajopadhye, and N. Yanev, “Optimal Orthogonal Tiling,” Proc. Europar '98, pp. 480-490, Sept. 1998.
[4] U. Banerjee, “Unimodular Transformations of Double Loops,” Proc. Workshop Programming Languages and Compilers for Parallel Computing, Aug. 1990.
[5] N.H.F. Beebe, “Matrix Multiply Benchmarks,” technical report, Center for Scientific Computing, Dept. of Math., Univ. of Utah, 1990, This report is updated frequently.
[6] J. Bilmes, K. Asanovic, C.-W. Chin, and J. Demmel, “Optimizing Matrix-Multiply Using PHiPAC: A Portable, High-Performance ANSI C Coding Methodology,” Proc. Int'l Conf. Supercomputing, pp. 340-347, July 1997.
[7] P. Boulet, A. Darte, T. Risset, and Y. Robert, "(Pen)-Ultimate Tiling," Integration, VLSI J., vol. 17, pp. 33-51, 1994.
[8] D. Callahan, J. Cocke, and K. Kennedy, “Estimating Interlock and Improving Balance for Pipelined Machines,” J. Parallel and Distributed Computing, vol. 5, no. 4, pp. 334-358, Aug. 1988.
[9] S. Carr, “Combining Optimization for Cache and Instruction-Level Parallelism,” Proc. Int'l Conf. Parallel Architectures and Compilation Techniques (PACT '96), pp. 238-247, 1996.
[10] S. Carr and K. Kennedy, “Compiler Blockability of Numerical Algorithms;” Proc. Supercomputing, pp. 114-124, Minneapolis, Minn., Nov. 1992.
[11] S. Carr, K.S. McKinley, and C.-W. Tseng, “Compiler Optimizations for Improving Data Locality,” Proc. Sixth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 252-262, Oct. 1994.
[12] L. Carter, J. Ferrante, and S.F. Hummel, “Efficient Parallelism via Hierarchical Tiling,” Proc. SIAM Conf. Parallel Processing for Scientific Computing, Feb. 1995.
[13] L. Carter, J. Ferrante, and S.F. Hummel, “Hierarchical Tiling for Improved Superscalar Performance,” Proc. Nineth Int'l Symp. Parallel Processing, pp. 239-245, Apr. 1995.
[14] S. Coleman and K. McKinley, “Tile Size Selection Using Cache Organization and Data Layout,” Proc. SIGPLAN Conf. Programming Language Design and Implementation, June 1995.
[15] T.H. Cormen,C.E. Leiserson, and R.L. Rivest,Introduction to Algorithms.Cambridge, Mass.: MIT Press/McGraw-Hill, 1990.
[16] D. Culler, R. Karp, D. Patterson, A. Sahay, K.E. Schauser, E. Santos, R. Subramonian, and T. von Eicken, “LogP—A Practice Model of Parallel Computation,” Comm. ACM, vol. 39, no. 11, pp. 78-85, 1996.
[17] A. Darte, L. Khachiyan, and Y. Robert, “Linear Scheduling is Nearly Optimal,” Parallel Processing Letters, vol. 1, no. 2, pp. 73-81, 1991.
[18] F. Desprez, J. Dongarra, F. Rastello, and Y. Robert, “Determining the Idle Time of a Tiling: New Results,” Proc. Int'l Conf. Parallel Architectures and Compilation Techniques (PACT '97), Nov. 1997.
[19] S.V. Rajopadhye, F. Quiller, and D. Wilde, “Generation of Efficient Nested Loops from Polyhedra,” Int'l J. Parallel Programming, vol. 28, no. 5, pp. 469-498, 2000.
[20] J. Ferrante,K.J. Ottenstein,, and J.D. Warren,“The program dependence graph and its use in optimization,” ACM Trans. Programming Languages and Systems, vol. 9, no. 3, pp. 319-349, June 1987.
[21] K. Högstedt, “Predicting Performance for Tiled Perfectly Nested Loops,” PhD thesis, Univ. of California, San Diego, Dept. of Computer Science and Eng., Dec. 1999.
[22] K. Högstedt, L. Carter, and J. Ferrante, “Determining the Idle Time of a Tiling,” Proc. Symp. Principles of Programming Languages, Jan. 1997.
[23] K. Högstedt, L. Carter, and J. Ferrante, “Selecting Tile Shape for Minimal Execution Time,” Proc 11th ACM Symp. Parallel Algorithms and Architectures, June 1999.
[24] F. Irigoin and R. Triolet, “Supernode Partitioning,” Proc. 15th ACM Symp. Principles of Programming Languages, pp. 319-329, Jan. 1988.
[25] K. Kennedy and K.S. McKinley, "Optimizing for Parallelism and Data Locality," Proc. 1992 ACM Int'l Conf. Supercomputing, pp. 323-334,Washington, D.C., July 1992.
[26] K. Kennedy and K.S. McKinley, "Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution," Languages and Compilers for Parallel Computing, U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua, eds., pp. 301-321,Portland, Ore., Aug. 1993.
[27] I. Kodukula, N. Ahmed, and K. Pingali, “Data-Centric Multi-Level Blocking,” Proc. Programming Language Design and Implementation (PLDI '97), June 1997.
[28] N. Mitchell, K. Hogstedt, L. Carter, and J. Ferrante, “Quantifying the Multi-Level Nature of Tiling Interactions” Int'l J. Parallel Programming, vol. 26, no. 6, pp. 641-670, 1998.
[29] N. Mitchell, L. Carter, J. Ferrante, and K. Högstedt, “Quantifying the Multilevel Nature of Tiling Interactions,” Proc. Workshop Languages and Compilers for Parallel Computing, 1997.
[30] C.J. Newburn and J.P. Shen, “Automatic Partitioning of Signal Processing Programs for Symmetric Multiprocessors,” Proc. 1996 Conf. Parallel Architectures and Compilation Techniques (PACT '96), pp. 269-280, Oct. 1996.
[31] F. Quiller and S.V. Rajopadhye, “Optimizing Memory Usage in the Polyhedral Model,” ACM Trans. Programming Languages and Systems (TOPLAS), vol. 22, no. 5, pp. 773-815, 2000.
[32] J. Ramanujam and P. Sadayappan,“Tiling multidimensional iteration spaces for nonshared memorymachines,” Proc. Supercomputing’91, IEEE CS Press, 1991.
[33] D.A. Reed, L.M. Adams, and M.L. Patrick, “Stencils and Problem Partitionings: Their Influence on the Performance of Multiple Processor Systems,” IEEE Trans. Computers, vol. 36, no. 7, pp. 845-858, July 1987.
[34] V. Sarkar, “Automatic Selection of High-Order Transformations in the IBM XL FORTRAN Compilers,” IBM J. Research and Development, vol. 41, no. 3, pp. 233-264, 1997.
[35] Y. Son and Z. Li, “New Tiling Techniques to Improve Cache Temporal Locality,” Proc. Conf. Programming Language Design and Implementation, pp. 215-228, May, 1999.
[36] Stanford SUIF Compiler System,http:/suif.stanford.edu/, 2002.
[37] Sweep3D Benchmark,www.llnl.gov/asci.benchmarks/asci/limtited/ sweep3dasci_sweep3d.html, 1995.
[38] R.C. Whaley and J. Dongarra, Automatically Tuned Linear Algebra Software (ATLAS) Proc. Supercomputing, Nov. 1998.
[39] M.E. Wolf, “Improving Locality and Parallelism in Nested Loops,” doctoral thesis, Dept. of Computer Science, Stanford Univ., 1992.
[40] M. Wolf and M. Lam, “A Data Locality Optimizing Algorithm,” Proc. SIGPLAN Conf. Programming Language Design and Implementation, pp. 30-44, June 1991.
[41] M. Wolf and M. Lam, “A Loop Transformation Theory and an Algorithm to Maximize Parallelism,” IEEE Trans. Parallel and Distributed Systems, vol. 2, no. 4, Oct. 1991.
[42] M. Wolf, D. Maydan, and D. Chen, “Combining Loop Transformations Considering Caches and Scheduling,” Proc. MICRO-29, pp. 274-286, Dec. 1996.
[43] M. Wolfe, “Iteration Space Tiling for Memory Hierarchies,” Proc. Third SIAM Conf. Parallel Processing for Scientific Computing, Dec. 1987.
[44] M. Wolfe, “More Iteration Space Tiling,” Proc. Supercomputing '89, pp. 655-664, Nov. 1989.
[45] M.J. Wolfe, High Performance Compilers for Parallel Computing. Addison-Wesley, 1996.
[46] D. Wonnacott, “Time Skewing for Parallel Computers,” Languages and Compilers for Parallel Computing, Springer-Verlag, Aug. 1999.
[47] J. Xue and C.-H. Huang, “Reuse-Driven Tiling for Improving Data Locality,” Int'l J. Parallel Programming, vol. 26, no. 6, pp. 671-696, 1998.

Index Terms:
Tiling, blocking, compiler optimization, parallel compilers.
Citation:
Karin Högstedt, Larry Carter, Jeanne Ferrante, "On the Parallel Execution Time of Tiled Loops," IEEE Transactions on Parallel and Distributed Systems, vol. 14, no. 3, pp. 307-321, March 2003, doi:10.1109/TPDS.2003.1189587
Usage of this product signifies your acceptance of the Terms of Use.