This Article 
 Bibliographic References 
 Add to: 
A Cost-Effective Implementation of Multilevel Tiling
October 2003 (vol. 14 no. 10)
pp. 1006-1020

Abstract—This paper presents a new cost-effective algorithm to compute exact loop bounds when multilevel tiling is applied to a loop nest having affine functions as bounds (nonrectangular loop nest). Traditionally, exact loop bounds computation has not been performed because its complexity is doubly exponential on the number of loops in the multilevel tiled code and, therefore, for certain classes of loops (i.e. nonrectangular loop nests), can be extremely time consuming. Although computation of exact loop bounds is not very important when tiling only for cache levels, it is critical when tiling includes the register level. This paper presents an efficient implementation of multilevel tiling that computes exact loop bounds and has a much lower complexity than conventional techniques. To achieve this lower complexity, our technique deals simultaneously with all levels to be tiled, rather than applying tiling level by level as is usually done. For loop nests having very simple affine functions as bounds, results show that our method is between 1.5 and 2.8 times faster than conventional techniques. For loop nests having not so simple bounds, we have measured speedups as high as 2,300. Additionally, our technique allows eliminating redundant bounds efficiently. Results show that eliminating redundant bounds in our method is between 2.2 and 11 times faster than in conventional techniques for typical linear algebra programs.

[1] S. Carr, Memory-Hierarchy Management PhD dissertation, Dept. of Computer Science, Rice Univ., Sept. 1992.
[2] M.E. Wolf, Improving Locality and Parallelism in Nested Loops PhD dissertation, Dept. of Computer Science, Stanford Univ., Aug. 1992.
[3] S. Carr, K. McKinley, and C.-W. Tseng, Compiler Optimizations for Improving Data Locality Proc. Sixth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 252-262, Oct. 1994.
[4] J.J. Navarro, M. Valero, J. Llabería, and T. Lang, Multilevel Orthogonal Blocking for Dense Linear Algebra Computations IEEE Computer Soc. TC on Computer Architecture Newsletter, pp. 10-14, Fall 1993.
[5] M.E. Wolf and M.S. Lam, A Data Locality Optimizing Algorithm Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, vol. 26, no. 6, pp. 30-44, June 1991.
[6] L.-C. Lu and M. Chen, A New Loop Transformation Techniques for Massive Parallelism Yale Univ., Computer Science Dept., Technical Report TR-833, Apr. 1990.
[7] M. Wolf and M. Lam, “A Loop Transformation Theory and an Algorithm to Maximize Parallelism,” IEEE Trans. Parallel and Distributed Systems, vol. 2, no. 4, Oct. 1991.
[8] L. Carter, J. Ferrante, and S.F. Hummel, “Hierarchical Tiling for Improved Superscalar Performance,” Proc. Nineth Int'l Symp. Parallel Processing, pp. 239-245, Apr. 1995.
[9] M. Jiménez, J.M. Llabería, and A. Fernández, Performance Evaluation of Tiling for the Register Level Proc. Fourth Int'l Symp. High-Performance Computer Architecture, pp. 254-265, Jan./Feb. 1998.
[10] M. Jiménez, J.M. Llabería, and A. Fernández, On the Performance of Hand versus Automatically Optimized Numerical Codes Proc. Sixth Int'l Symp. High-Performance Computer Architecture, pp. 183-194, Jan. 1999.
[11] I. Kodukula, K. Pingali, R. Cox, and D. Maydan, An Experimental Evaluation of Tiling and Shackling for Memory Hierarchy Management Proc. Int'l Conf. Supercomputing, pp. 482-491, June 1999.
[12] J.J. Dongarra, J.D. Croz, S. Hammarling, and I. Duff, A Set of Level 3 Basic Linear Algebra Subprograms Trans. Math. Software, vol. 16, no. 1, pp. 1-17, Mar. 1990.
[13] M. Jiménez, Multilevel Tiling for Non-Rectangular Iteration Spaces PhD thesis, Dept. of Computer Architecture, Universitat Politècnica de Catalunya,, May 1999.
[14] M. Jiménez, J.M. Llabería, and A. Fernández, Register Tiling in Nonrectangular Iteration Spaces ACM Trans. Programming Languages and Systems, vol. 24, no. 4, pp. 409-453, July 2002.
[15] S.P. Amarasinghe, Parallelizing Compiler Techniques Based on Linear Inequalities PhD thesis, Stanford Univ., Computer Systems Laboratory, Jan. 1997.
[16] A. Bik and H. Wijshoff, Implementation of Fourier-Motzkin Elimination Leiden Univ., Dept. of Mathematics and Computer Science, Technical Report TR-94-42, 1994.
[17] R.H. Kuhn, Optimization and Interconnection Complexity for: Parallel Processors, Single-Stage Networks, and Decision Trees PhD thesis, Dept. of Computer Science, Univ. of Illinois, Urbana-Champaign, Feb. 1980.
[18] A. Fernandez, J. Llaberia, and M. Valero, Loop Transformations Using Nonunimodular Matrices IEEE Trans. Parallel and Distributed Systems, vol. 6, no. 8, pp. 832-840, Aug. 1995.
[19] A. Schrijver, Theory of Linear and Integer Programming. Chichester, New York: Wiley, 1986.
[20] M.J. Wolfe, High Performance Compilers for Parallel Computing. Reading, Mass.: Addison Wesley, 1996.
[21] W. Li and K. Pingali, Access Normalization: Loop Restructuring for NUMA Compilers Cornell Univ., Computer Science Dept., Technical Report TR92-1278, Apr. 1992.
[22] J. Ramanujam, Beyond Unimodular Transformations J. Supercomputing, vol. 9, no. 4, pp. 365-389, 1995.
[23] C. Ancourt and F. Irigoin, Scanning Polyhedra with DO Loops Proc. Third ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, vol. 26, no. 7, pp. 39-50, Apr. 1991.
[24] H. Samukawa, A Proposal of Level 3 Interface for Band and Skyline Matrix Factorization Subroutine Proc. Int'l Conf. Supercomputing, pp. 397-406, July 1993.
[25] I. Kodukula, N. Ahmed, and K. Pingali, Data-Centric Multi-Level Blocking Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, vol. 32, no. 5, pp. 346-357, June 1997.
[26] W. Pugh, A Practical Algorithm for Exact Array Dependence Analysis Comm. ACM, vol. 35, no. 8, pp. 102-114, Aug. 1992.

Index Terms:
Compilers, multilevel tiling, loop transformations, memory hierarchy.
Marta Jim?nez, Jos? M. Llaber?, Agust? Fern?ndez, "A Cost-Effective Implementation of Multilevel Tiling," IEEE Transactions on Parallel and Distributed Systems, vol. 14, no. 10, pp. 1006-1020, Oct. 2003, doi:10.1109/TPDS.2003.1239869
Usage of this product signifies your acceptance of the Terms of Use.