This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Optimal Semi-Oblique Tiling
September 2003 (vol. 14 no. 9)
pp. 944-960

Abstract—For 2D iteration space tiling, we address the problem of determining the tile parameters that minimize the total execution time on a parallel machine. We consider uniform dependency computations tiled so that (at least) one of the tile boundaries is parallel to the domain boundaries. We determine the optimal tile size as a closed form solution. In addition, we determine the optimal number of processors and also the optimal slope of the oblique tile boundary. Our results are based on the bsp model, which assures the portability of the results. Our predictions are justified on a sequence global alignment problem specialized to similar sequences using Fickett's k-band algorithm, for which our optimal semi-oblique tiling yields an improvement of a factor of 2.5 over orthogonal tiling. Our optimal solution requires a block-cyclic distribution of tiles to processors. The best one can obtain with only block distribution (as many authors require) is three times slower. Furthermore, our best running time is within 10 percent of the "predicted theoretical peak" performance of the machine!

[1] A. Agarwal, D. Kranz, and V. Natarajan, “Automatic Partitioning of Parallel Loops and Data Arrays for Distributed Shared-Memory Multiprocessors,” IEEE Trans. Parallel and Distributed Systems, vol. 6, no. 9, pp. 943-962, Sept. 1995.
[2] R. Andonov, S. Balev, S. Rajopadhye, and N. Yanev, Optimal Semi-Oblique Tiling Proc. 13th ACM Symp. Parallel Algorithms and Architectures, pp. 153-164, July 2001.
[3] R. Andonov, H. Bourzoufi, and S. Rajopadhye, Two-Dimensional Orthogonal Tiling: From Theory to Practice Proc. Int'l Conf. High Performance Computing, pp. 225-231, Dec. 1996.
[4] R. Andonov, P.-Y. Calland, S. Niar, S. Rajopadhye, and N. Yanev, First Steps Towards Optimal Oblique Tiling of Two-Dimensional Iterations Proc. Workshop Compilers for Parallel Computers, Jan. 2000.
[5] R. Andonov and S. Rajopadhye, Optimal Orthogonal Tiling of 2D Iterations J. Parallel and Distributed Computing, vol. 45, pp. 159-165, Sept. 1997.
[6] R. Andonov, S. Rajopadhye, and N. Yanev, Optimal Orthogonal Tiling Euro-Par'98 Parallel Processing, Lecture Notes in Computer Science, 1470, pp. 480-490, 1998.
[7] P. Boulet, A. Darte, T. Risset, and Y. Robert, (Pen)-Ultimate Tiling? Integration, the VLSI J., vol. 17, pp. 33-51, 1994.
[8] P.-Y. Calland and T. Risset, Precise Tiling for Uniform Loop Nests Application Specific Array Processors, P. Cappello, C. Mongenet, G.-R. Perrin, P. Quinton, and Y. Robert, eds., pp. 330-337, July 1995.
[9] S. Coleman and K.S. McKinley, Tile Size Selection Using Cache Organization and Data Layout Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, June 1995.
[10] F. Desprez, J. Dongarra, F. Rastello, and Y. Robert, Determining the Idle Time of a Tiling: New Results J. Information Science and Eng., vol. 14, pp. 167-190, 1998.
[11] J. Fickett, Fast Optimal Alignement Nucleic Acids Research, vol. 12, no. 1, pp. 175-179, 1984.
[12] S. Hiranandani, K. Kennedy, and C.-W. Tseng, Evaluating Compiler Optimizations for Fortran D J. Parallel and Distributed Computing, vol. 21, pp. 27-45, 1994.
[13] E. Hodzic and W. Shang, On Supernode Transformation with Minimized Total Running Time IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 5, pp. 417-428, May 1998.
[14] K. Högstedt, Predicting Performance for Tiled Perfectly Nested Loops PhD thesis, Dept. of Computer Science and Eng., Univ. of California, San Diego, Dec. 1999.
[15] K. Högstedt, L. Carter, and J. Ferrante, Determining the Idle Time of a Tiling Principles of Programming Languages, Jan. 1997.
[16] K. Högstedt, L. Carter, and J. Ferrante, Selecting Tile Shape for Minimal Execution Time Proc. 11th ACM Symp. Parallel Algorithms and Architectures, pp. 201-211, June 1999.
[17] K. Högstedt, L. Carter, and J. Ferrante, An Analysis of the Execution Time of Tiled Loops http://www-cse.ucsd.edu/ferrantekarjour.ps , (journal submission), Mar. 2000.
[18] F. Irigoin and R. Triolet, Supernode Partitioning Proc. 15th ACM Symp. Principles of Programming Languages, pp. 319-328, Jan. 1988.
[19] R.M. Karp, R.E. Miller, and S. Winograd, The Organization of Computations for Uniform Recurrence Equations J. ACM, vol. 14, no. 3, pp. 563-590, July 1967.
[20] C.-T. King, W.-H. Chou, and L.M. Ni, "Pipelined Data-Parallel Algorithms: Part I-Concept and Modeling," IEEE Trans. Parallel and Distributed Systems, vol. 1, no. 4, pp. 470-485, Oct. 1990.
[21] C.-T. King, W.-H. Chou, and L.M. Ni, "Pipelined Data-Parallel Algorithms: Part II-Design," IEEE Trans. Parallel and Distributed Systems, vol. 1, no. 4, pp. 486-499, Oct. 1990.
[22] W.F. McColl, Scalable Computing Computer Science Today: Recent Trends and Developments, J. van Leeuwen, ed. Springer Verlag, vol. 1000, pp. 46-61, 1995.
[23] D.I. Moldovan and J.A.B. Fortes, Partitioning and Mapping Algorithms into Fixed Size Systolic Arrays IEEE Trans. Computers, vol. 35, no. 1, pp. 1-12, Jan. 1986.
[24] J.J. Navarro, J.M. Llabería, and M. Valero, Computing Size-Independent Matrix Problems on Systolic Array Processors Proc. Int'l Symp. Computer Architecture, no. 13, May 1986.
[25] H. Ohta, Y. Saito, M. Kainaga, and H. Ono, Optimal Tile Size Adjsutment in Compiling General DOACROSS Loop Nests Proc. Int'l Conf. Supercomputing, pp. 270-279, July 1995.
[26] D. Palermo, E. Su, J. Chandy, and P. Banerjee, Communication Optimizations Used in the PARADIGM Compiler for Distributed Memory Multicomputers Proc. Int'l Conf. Parallel Processing, Aug. 1994.
[27] J. Ramanujam and P. Sadayappan, Tiling Multidimensional Iteration Spaces for Non Shared-Memory Machines Supercomputing, pp. 111-120, 1991.
[28] R. Schreiber and J. Dongarra, Automatic Blocking of Nested Loops Technical Report 90.38, RIACS, NASA Ames Research Center, Aug. 1990.
[29] J. Setubal and J. Meidanis, Introduction to Computational Molecular Biology. ITP, 1997.
[30] L.G. Valiant, A Bridging Model for Parallel Computation Comm. ACM, vol. 33, no. 8, pp. 103-111, Aug. 1990.
[31] M.E. Wolf and M. Lam, A Data Locality Optimizing Algorithm Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, June 1991.
[32] M. Wolfe, Iteration Space Tiling for Memory Hierarchies Parallel Processing for Scientific Computing, pp. 357-361, 1987.
[33] D. Wonnacott, Time Skewing for Parallel Computers Technical Report TR-388, Dept. of Computer Science, Rutgers Univ., June 1999.
[34] J. Xue, On Tiling as a Loop Transformation Parallel Processing Letters, vol. 7, no. 4, pp. 490-424, 1997.

Index Terms:
2D uniform recurrences, biological sequence alignment, BSP model, communication-compuation granularity, distributed memory machines, locality, loop blocking, MPI, perfect loop nests, SPMD.
Citation:
Rumen Andonov, Stefan Balev, Sanjay Rajopadhye, Nicola Yanev, "Optimal Semi-Oblique Tiling," IEEE Transactions on Parallel and Distributed Systems, vol. 14, no. 9, pp. 944-960, Sept. 2003, doi:10.1109/TPDS.2003.1233716
Usage of this product signifies your acceptance of the Terms of Use.