
This Article  
 
Share  
Bibliographic References  
Add to:  
Digg Furl Spurl Blink Simpy Del.icio.us Y!MyWeb  
Search  
 
ASCII Text  x  
Marta Jim?nez, Jos? M. Llaber?, Agust? Fern?ndez, "A CostEffective Implementation of Multilevel Tiling," IEEE Transactions on Parallel and Distributed Systems, vol. 14, no. 10, pp. 10061020, October, 2003.  
BibTex  x  
@article{ 10.1109/TPDS.2003.1239869, author = {Marta Jim?nez and Jos? M. Llaber? and Agust? Fern?ndez}, title = {A CostEffective Implementation of Multilevel Tiling}, journal ={IEEE Transactions on Parallel and Distributed Systems}, volume = {14}, number = {10}, issn = {10459219}, year = {2003}, pages = {10061020}, doi = {http://doi.ieeecomputersociety.org/10.1109/TPDS.2003.1239869}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, }  
RefWorks Procite/RefMan/Endnote  x  
TY  JOUR JO  IEEE Transactions on Parallel and Distributed Systems TI  A CostEffective Implementation of Multilevel Tiling IS  10 SN  10459219 SP1006 EP1020 EPD  10061020 A1  Marta Jim?nez, A1  Jos? M. Llaber?, A1  Agust? Fern?ndez, PY  2003 KW  Compilers KW  multilevel tiling KW  loop transformations KW  memory hierarchy. VL  14 JA  IEEE Transactions on Parallel and Distributed Systems ER   
Abstract—This paper presents a new costeffective algorithm to compute exact loop bounds when multilevel tiling is applied to a loop nest having affine functions as bounds (nonrectangular loop nest). Traditionally, exact loop bounds computation has not been performed because its complexity is doubly exponential on the number of loops in the multilevel tiled code and, therefore, for certain classes of loops (i.e. nonrectangular loop nests), can be extremely time consuming. Although computation of exact loop bounds is not very important when tiling only for cache levels, it is critical when tiling includes the register level. This paper presents an efficient implementation of multilevel tiling that computes exact loop bounds and has a much lower complexity than conventional techniques. To achieve this lower complexity, our technique deals simultaneously with all levels to be tiled, rather than applying tiling level by level as is usually done. For loop nests having very simple affine functions as bounds, results show that our method is between 1.5 and 2.8 times faster than conventional techniques. For loop nests having not so simple bounds, we have measured speedups as high as 2,300. Additionally, our technique allows eliminating redundant bounds efficiently. Results show that eliminating redundant bounds in our method is between 2.2 and 11 times faster than in conventional techniques for typical linear algebra programs.
[1] S. Carr, MemoryHierarchy Management PhD dissertation, Dept. of Computer Science, Rice Univ., Sept. 1992.
[2] M.E. Wolf, Improving Locality and Parallelism in Nested Loops PhD dissertation, Dept. of Computer Science, Stanford Univ., Aug. 1992.
[3] S. Carr, K. McKinley, and C.W. Tseng, Compiler Optimizations for Improving Data Locality Proc. Sixth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 252262, Oct. 1994.
[4] J.J. Navarro, M. Valero, J. Llabería, and T. Lang, Multilevel Orthogonal Blocking for Dense Linear Algebra Computations IEEE Computer Soc. TC on Computer Architecture Newsletter, pp. 1014, Fall 1993.
[5] M.E. Wolf and M.S. Lam, A Data Locality Optimizing Algorithm Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, vol. 26, no. 6, pp. 3044, June 1991.
[6] L.C. Lu and M. Chen, A New Loop Transformation Techniques for Massive Parallelism Yale Univ., Computer Science Dept., Technical Report TR833, Apr. 1990.
[7] M. Wolf and M. Lam, “A Loop Transformation Theory and an Algorithm to Maximize Parallelism,” IEEE Trans. Parallel and Distributed Systems, vol. 2, no. 4, Oct. 1991.
[8] L. Carter, J. Ferrante, and S.F. Hummel, “Hierarchical Tiling for Improved Superscalar Performance,” Proc. Nineth Int'l Symp. Parallel Processing, pp. 239245, Apr. 1995.
[9] M. Jiménez, J.M. Llabería, and A. Fernández, Performance Evaluation of Tiling for the Register Level Proc. Fourth Int'l Symp. HighPerformance Computer Architecture, pp. 254265, Jan./Feb. 1998.
[10] M. Jiménez, J.M. Llabería, and A. Fernández, On the Performance of Hand versus Automatically Optimized Numerical Codes Proc. Sixth Int'l Symp. HighPerformance Computer Architecture, pp. 183194, Jan. 1999.
[11] I. Kodukula, K. Pingali, R. Cox, and D. Maydan, An Experimental Evaluation of Tiling and Shackling for Memory Hierarchy Management Proc. Int'l Conf. Supercomputing, pp. 482491, June 1999.
[12] J.J. Dongarra, J.D. Croz, S. Hammarling, and I. Duff, A Set of Level 3 Basic Linear Algebra Subprograms Trans. Math. Software, vol. 16, no. 1, pp. 117, Mar. 1990.
[13] M. Jiménez, Multilevel Tiling for NonRectangular Iteration Spaces PhD thesis, Dept. of Computer Architecture, Universitat Politècnica de Catalunya,http://www.ac.upc.es/pub/reports/DAC/1999 UPCDAC199916.ps, May 1999.
[14] M. Jiménez, J.M. Llabería, and A. Fernández, Register Tiling in Nonrectangular Iteration Spaces ACM Trans. Programming Languages and Systems, vol. 24, no. 4, pp. 409453, July 2002.
[15] S.P. Amarasinghe, Parallelizing Compiler Techniques Based on Linear Inequalities PhD thesis, Stanford Univ., Computer Systems Laboratory, Jan. 1997.
[16] A. Bik and H. Wijshoff, Implementation of FourierMotzkin Elimination Leiden Univ., Dept. of Mathematics and Computer Science, Technical Report TR9442, 1994.
[17] R.H. Kuhn, Optimization and Interconnection Complexity for: Parallel Processors, SingleStage Networks, and Decision Trees PhD thesis, Dept. of Computer Science, Univ. of Illinois, UrbanaChampaign, Feb. 1980.
[18] A. Fernandez, J. Llaberia, and M. Valero, Loop Transformations Using Nonunimodular Matrices IEEE Trans. Parallel and Distributed Systems, vol. 6, no. 8, pp. 832840, Aug. 1995.
[19] A. Schrijver, Theory of Linear and Integer Programming. Chichester, New York: Wiley, 1986.
[20] M.J. Wolfe, High Performance Compilers for Parallel Computing. Reading, Mass.: Addison Wesley, 1996.
[21] W. Li and K. Pingali, Access Normalization: Loop Restructuring for NUMA Compilers Cornell Univ., Computer Science Dept., Technical Report TR921278, Apr. 1992.
[22] J. Ramanujam, Beyond Unimodular Transformations J. Supercomputing, vol. 9, no. 4, pp. 365389, 1995.
[23] C. Ancourt and F. Irigoin, Scanning Polyhedra with DO Loops Proc. Third ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, vol. 26, no. 7, pp. 3950, Apr. 1991.
[24] H. Samukawa, A Proposal of Level 3 Interface for Band and Skyline Matrix Factorization Subroutine Proc. Int'l Conf. Supercomputing, pp. 397406, July 1993.
[25] I. Kodukula, N. Ahmed, and K. Pingali, DataCentric MultiLevel Blocking Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, vol. 32, no. 5, pp. 346357, June 1997.
[26] W. Pugh, A Practical Algorithm for Exact Array Dependence Analysis Comm. ACM, vol. 35, no. 8, pp. 102114, Aug. 1992.