
This Article  
 
Share  
Bibliographic References  
Add to:  
Digg Furl Spurl Blink Simpy Del.icio.us Y!MyWeb  
Search  
 
ASCII Text  x  
Karin Högstedt, Larry Carter, Jeanne Ferrante, "On the Parallel Execution Time of Tiled Loops," IEEE Transactions on Parallel and Distributed Systems, vol. 14, no. 3, pp. 307321, March, 2003.  
BibTex  x  
@article{ 10.1109/TPDS.2003.1189587, author = {Karin Högstedt and Larry Carter and Jeanne Ferrante}, title = {On the Parallel Execution Time of Tiled Loops}, journal ={IEEE Transactions on Parallel and Distributed Systems}, volume = {14}, number = {3}, issn = {10459219}, year = {2003}, pages = {307321}, doi = {http://doi.ieeecomputersociety.org/10.1109/TPDS.2003.1189587}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, }  
RefWorks Procite/RefMan/Endnote  x  
TY  JOUR JO  IEEE Transactions on Parallel and Distributed Systems TI  On the Parallel Execution Time of Tiled Loops IS  3 SN  10459219 SP307 EP321 EPD  307321 A1  Karin Högstedt, A1  Larry Carter, A1  Jeanne Ferrante, PY  2003 KW  Tiling KW  blocking KW  compiler optimization KW  parallel compilers. VL  14 JA  IEEE Transactions on Parallel and Distributed Systems ER   
Abstract—Many computationallyintensive programs, such as those for differential equations, spatial interpolation, and dynamic programming, spend a large portion of their execution time in multiplynested loops that have a regular stencil of data dependences. Tiling is a wellknown compiler optimization that improves performance on such loops, particularly for computers with a multileveled hierarchy of parallelism and memory. Most previous work on tiling is limited in at least one of the following ways: they only handle nested loops of depth two, orthogonal tiling, or rectangular tiles. In our work, we tile loop nests of arbitrary depth using polyhedral tiles. We derive a prediction formula for the execution time of such tiled loops, which can be used by a compiler to automatically determine the tiling parameters that minimizes the execution time. We also explain the notion of
[1] G. Almasi, B. Alpern, L. Berman, L. Carter, and D. Hale, “A CaseStudy in Performance Programming: Seismic Migration,” Proc. Symp. High Performance Computing, Sept. 1991.
[2] C. Ancourt and F. Irigoin, “Automatic Code Distribution,” Proc. Third Workshop Compilers for Parallel Computers (CPC '92), July 1992.
[3] R. Andonov, S. Rajopadhye, and N. Yanev, “Optimal Orthogonal Tiling,” Proc. Europar '98, pp. 480490, Sept. 1998.
[4] U. Banerjee, “Unimodular Transformations of Double Loops,” Proc. Workshop Programming Languages and Compilers for Parallel Computing, Aug. 1990.
[5] N.H.F. Beebe, “Matrix Multiply Benchmarks,” technical report, Center for Scientific Computing, Dept. of Math., Univ. of Utah, 1990, This report is updated frequently.
[6] J. Bilmes, K. Asanovic, C.W. Chin, and J. Demmel, “Optimizing MatrixMultiply Using PHiPAC: A Portable, HighPerformance ANSI C Coding Methodology,” Proc. Int'l Conf. Supercomputing, pp. 340347, July 1997.
[7] P. Boulet, A. Darte, T. Risset, and Y. Robert, "(Pen)Ultimate Tiling," Integration, VLSI J., vol. 17, pp. 3351, 1994.
[8] D. Callahan, J. Cocke, and K. Kennedy, “Estimating Interlock and Improving Balance for Pipelined Machines,” J. Parallel and Distributed Computing, vol. 5, no. 4, pp. 334358, Aug. 1988.
[9] S. Carr, “Combining Optimization for Cache and InstructionLevel Parallelism,” Proc. Int'l Conf. Parallel Architectures and Compilation Techniques (PACT '96), pp. 238247, 1996.
[10] S. Carr and K. Kennedy, “Compiler Blockability of Numerical Algorithms;” Proc. Supercomputing, pp. 114124, Minneapolis, Minn., Nov. 1992.
[11] S. Carr, K.S. McKinley, and C.W. Tseng, “Compiler Optimizations for Improving Data Locality,” Proc. Sixth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 252262, Oct. 1994.
[12] L. Carter, J. Ferrante, and S.F. Hummel, “Efficient Parallelism via Hierarchical Tiling,” Proc. SIAM Conf. Parallel Processing for Scientific Computing, Feb. 1995.
[13] L. Carter, J. Ferrante, and S.F. Hummel, “Hierarchical Tiling for Improved Superscalar Performance,” Proc. Nineth Int'l Symp. Parallel Processing, pp. 239245, Apr. 1995.
[14] S. Coleman and K. McKinley, “Tile Size Selection Using Cache Organization and Data Layout,” Proc. SIGPLAN Conf. Programming Language Design and Implementation, June 1995.
[15] T.H. Cormen,C.E. Leiserson, and R.L. Rivest,Introduction to Algorithms.Cambridge, Mass.: MIT Press/McGrawHill, 1990.
[16] D. Culler, R. Karp, D. Patterson, A. Sahay, K.E. Schauser, E. Santos, R. Subramonian, and T. von Eicken, “LogP—A Practice Model of Parallel Computation,” Comm. ACM, vol. 39, no. 11, pp. 7885, 1996.
[17] A. Darte, L. Khachiyan, and Y. Robert, “Linear Scheduling is Nearly Optimal,” Parallel Processing Letters, vol. 1, no. 2, pp. 7381, 1991.
[18] F. Desprez, J. Dongarra, F. Rastello, and Y. Robert, “Determining the Idle Time of a Tiling: New Results,” Proc. Int'l Conf. Parallel Architectures and Compilation Techniques (PACT '97), Nov. 1997.
[19] S.V. Rajopadhye, F. Quiller, and D. Wilde, “Generation of Efficient Nested Loops from Polyhedra,” Int'l J. Parallel Programming, vol. 28, no. 5, pp. 469498, 2000.
[20] J. Ferrante,K.J. Ottenstein,, and J.D. Warren,“The program dependence graph and its use in optimization,” ACM Trans. Programming Languages and Systems, vol. 9, no. 3, pp. 319349, June 1987.
[21] K. Högstedt, “Predicting Performance for Tiled Perfectly Nested Loops,” PhD thesis, Univ. of California, San Diego, Dept. of Computer Science and Eng., Dec. 1999.
[22] K. Högstedt, L. Carter, and J. Ferrante, “Determining the Idle Time of a Tiling,” Proc. Symp. Principles of Programming Languages, Jan. 1997.
[23] K. Högstedt, L. Carter, and J. Ferrante, “Selecting Tile Shape for Minimal Execution Time,” Proc 11th ACM Symp. Parallel Algorithms and Architectures, June 1999.
[24] F. Irigoin and R. Triolet, “Supernode Partitioning,” Proc. 15th ACM Symp. Principles of Programming Languages, pp. 319329, Jan. 1988.
[25] K. Kennedy and K.S. McKinley, "Optimizing for Parallelism and Data Locality," Proc. 1992 ACM Int'l Conf. Supercomputing, pp. 323334,Washington, D.C., July 1992.
[26] K. Kennedy and K.S. McKinley, "Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution," Languages and Compilers for Parallel Computing, U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua, eds., pp. 301321,Portland, Ore., Aug. 1993.
[27] I. Kodukula, N. Ahmed, and K. Pingali, “DataCentric MultiLevel Blocking,” Proc. Programming Language Design and Implementation (PLDI '97), June 1997.
[28] N. Mitchell, K. Hogstedt, L. Carter, and J. Ferrante, “Quantifying the MultiLevel Nature of Tiling Interactions” Int'l J. Parallel Programming, vol. 26, no. 6, pp. 641670, 1998.
[29] N. Mitchell, L. Carter, J. Ferrante, and K. Högstedt, “Quantifying the Multilevel Nature of Tiling Interactions,” Proc. Workshop Languages and Compilers for Parallel Computing, 1997.
[30] C.J. Newburn and J.P. Shen, “Automatic Partitioning of Signal Processing Programs for Symmetric Multiprocessors,” Proc. 1996 Conf. Parallel Architectures and Compilation Techniques (PACT '96), pp. 269280, Oct. 1996.
[31] F. Quiller and S.V. Rajopadhye, “Optimizing Memory Usage in the Polyhedral Model,” ACM Trans. Programming Languages and Systems (TOPLAS), vol. 22, no. 5, pp. 773815, 2000.
[32] J. Ramanujam and P. Sadayappan,“Tiling multidimensional iteration spaces for nonshared memorymachines,” Proc. Supercomputing’91, IEEE CS Press, 1991.
[33] D.A. Reed, L.M. Adams, and M.L. Patrick, “Stencils and Problem Partitionings: Their Influence on the Performance of Multiple Processor Systems,” IEEE Trans. Computers, vol. 36, no. 7, pp. 845858, July 1987.
[34] V. Sarkar, “Automatic Selection of HighOrder Transformations in the IBM XL FORTRAN Compilers,” IBM J. Research and Development, vol. 41, no. 3, pp. 233264, 1997.
[35] Y. Son and Z. Li, “New Tiling Techniques to Improve Cache Temporal Locality,” Proc. Conf. Programming Language Design and Implementation, pp. 215228, May, 1999.
[36] Stanford SUIF Compiler System,http:/suif.stanford.edu/, 2002.
[37] Sweep3D Benchmark,www.llnl.gov/asci.benchmarks/asci/limtited/ sweep3dasci_sweep3d.html, 1995.
[38] R.C. Whaley and J. Dongarra, Automatically Tuned Linear Algebra Software (ATLAS) Proc. Supercomputing, Nov. 1998.
[39] M.E. Wolf, “Improving Locality and Parallelism in Nested Loops,” doctoral thesis, Dept. of Computer Science, Stanford Univ., 1992.
[40] M. Wolf and M. Lam, “A Data Locality Optimizing Algorithm,” Proc. SIGPLAN Conf. Programming Language Design and Implementation, pp. 3044, June 1991.
[41] M. Wolf and M. Lam, “A Loop Transformation Theory and an Algorithm to Maximize Parallelism,” IEEE Trans. Parallel and Distributed Systems, vol. 2, no. 4, Oct. 1991.
[42] M. Wolf, D. Maydan, and D. Chen, “Combining Loop Transformations Considering Caches and Scheduling,” Proc. MICRO29, pp. 274286, Dec. 1996.
[43] M. Wolfe, “Iteration Space Tiling for Memory Hierarchies,” Proc. Third SIAM Conf. Parallel Processing for Scientific Computing, Dec. 1987.
[44] M. Wolfe, “More Iteration Space Tiling,” Proc. Supercomputing '89, pp. 655664, Nov. 1989.
[45] M.J. Wolfe, High Performance Compilers for Parallel Computing. AddisonWesley, 1996.
[46] D. Wonnacott, “Time Skewing for Parallel Computers,” Languages and Compilers for Parallel Computing, SpringerVerlag, Aug. 1999.
[47] J. Xue and C.H. Huang, “ReuseDriven Tiling for Improving Data Locality,” Int'l J. Parallel Programming, vol. 26, no. 6, pp. 671696, 1998.