This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Tiling, Block Data Layout, and Memory Hierarchy Performance
July 2003 (vol. 14 no. 7)
pp. 640-654

Abstract—Recently, several experimental studies have been conducted on block data layout in conjunction with tiling as a data transformation technique to improve cache performance. In this paper, we analyze cache and TLB performance of such alternate layouts (including block data layout and Morton layout) when used in conjunction with tiling. We derive a tight lower bound on TLB performance for standard matrix access patterns, and show that block data layout and Morton layout achieve this bound. To improve cache performance, block data layout is used in concert with tiling. Based on the cache and TLB performance analysis, we propose a data block size selection algorithm that finds a tight range for optimal block size. To validate our analysis, we conducted simulations and experiments using tiled matrix multiplication, LU decomposition, and Cholesky factorization. For matrix multiplication, simulation results using UltraSparc II parameters show that tiling and block data layout with a block size given by our block size selection algorithm, reduces up to 93 percent of TLB misses compared with other techniques (copying, padding, etc.). The total miss cost is reduced considerably. Experiments on several platforms (UltraSparc II and III, Alpha, and Pentium III) show that tiling with block data layout achieves up to 50 percent performance improvement over other techniques that use conventional layouts. Morton layout is also analyzed and compared with block data layout. Experimental results show that matrix multiplication using block data layout is up to 15 percent faster than that using Morton data layout.

[1] ADVISOR Project,http:/advisor.usc.edu, 2002
[2] D. Burger and T.M. Austin, The SimpleScalar Tool Set, Version 2.0 Technical Report 1342, Computer Science Dept., Univ. of Wisconsin-Madison, June 1997.
[3] J.B. Carter, W.C. Hsieh, L.B. Stoller, M.R. Swanson, L. Zhang, E.L. Brunvand, A. Davis, C.-C. Kuo, R. Kuramkote, M.A. Parker, L. Schaelicke, and T. Tateyama, Impulse: Building a Smarter Memory Controller Proc. Fifth Int'l Symp. High Performance Computer Architecture, pp. 70-79, Jan. 1999.
[4] S. Chatterjee, V.V. Jain, A.R. Lebeck, S. Mundhra, and M. Thottethodi, “Nonlinear Array Layouts for Hierarchical Memory Systems,” Proc. 1999 ACM Int'l Conf. Supercomputing, pp. 444-453, June 1999.
[5] M. Cierniak and W. Li, “Unifying Data and Control Transformations for Distributed Shared Memory Machines,” Proc. SIGPLAN Conf. Programming Language Design and Implementation, June 1995.
[6] S. Coleman and K. McKinley, “Tile Size Selection Using Cache Organization and Data Layout,” Proc. SIGPLAN Conf. Programming Language Design and Implementation, June 1995.
[7] R. Espasa, J. Corbal, and M. Valero, Command Vector Memory Systems: High Performance at Low Cost Technical Report UPC-DAC-1998-8, Universitat Polit`ecnica de Catalunya, 1998.
[8] K. Esseghir, Improving Data Locality for Caches, Master's Thesis, Dept. of Computer Science, Rice Univ., Sept. 1993.
[9] A. González, C. Aliagas, and M. Valero, A Data Cache With Multiple Caching Strategies Tuned to Different Types of Locality Proc. Int'l Conf. Supercomputing, pp. 338-347, July 1995.
[10] T.L. Johnson, M.C. Merten, and W.W. Hwu, Run-Time Spatial Locality Detection and Optimization Proc. 30th Int'l Symp. Microarchitecture, Dec. 1997.
[11] M. Kandemir, A. Choudhary, J. Ramanujam, and P. Banerjee, “Improving Locality Using Loop and Data Transformations in an Integrated Approach,” Proc. MICRO-31, Dec. 1998.
[12] M. Lam, E. Rothberg, and M. Wolf, “The Cache Performance and Optimizations of Blocked Algorithms,” Proc. Fourth Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS '91), 1991.
[13] N. Mitchell, K. Hogstedt, L. Carter, and J. Ferrante, “Quantifying the Multi-Level Nature of Tiling Interactions” Int'l J. Parallel Programming, vol. 26, no. 6, pp. 641-670, 1998.
[14] D. Padua, The Fortran I Compiler IEEE Computing in Science and Eng., Jan./Feb. 2000.
[15] D.A. Padua, Outline of a Roadmap for Compiler Technology IEEE Computing in Science and Eng., Fall 1996.
[16] P.R. Panda, H. Nakamura, N. Dutt, and A. Nicolau, Augmenting Loop Tiling with Data Alignment for Improved Cache Performance IEEE Trans. Computers, vol. 48, no. 2, Feb. 1999.
[17] N. Park, B. Hong, and V.K. Prasanna, Memory Hierarchy Performance of Tiling and Block Data Layout Technical Report USC-CENG 02-15, Dept. of Electrical Eng., Univ. of Southern California, Jan. 2003.
[18] N. Park, D. Kang, K. Bondalapati, and V.K. Prasanna, Dynamic Data Layouts for Cache-Conscious Factorization of DFT Proc. Int'l Parallel and Distributed Processing Symp. 2000, Apr. 2000.
[19] N. Park and V.K. Prasanna, Cache Conscious Walsh-Hadamard Transform Proc. Int'l Conf. Acoustics, Speech, and Signal Processing, May 2001.
[20] D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, and K. Yelick, A Case for Intelligent DRAM: IRAM IEEE Micro, Apr. 1997.
[21] G. Rivera and C.-W. Tseng, “Data Transformations for Eliminating Conflict Misses,” Proc. SIGPLAN Conf. Programming Language Design and Implementation, June 1998.
[22] G. Rivera and C.-W. Tseng, Locality Optimizations for Multi-Level Caches Proc. IEEE Supercomputing, Nov. 1999.
[23] V. Sarkar and G.R. Gao, Optimization of Array Accesses by Collective Loop Transformations Proc. Int'l Conference of Supercomputing, June 1991.
[24] A. Saulsbury, F. Dahgren, and P. Stenström, Receny-Based TLB Preloading Proc. 27th Ann. Int'l Symp. Computer Architecture, June 2000.
[25] H. Sharangpani, Intel Itanium Processor Microarchitecture Overview Microprocessor Forum, Oct. 1999.
[26] O. Temam, E.D. Granston,, and W. Jalby, “To Copy or Not to Copy: A Compile-Time Technique for Assessing When Data Copying Should Be Used to Eliminate Cache Conflicts,” Proc. Supercomputing, Nov. 1993.
[27] R.C. Whaley and J. Dongarra, Automatically Tuned Linear Algebra Software (ATLAS) Proc. Supercomputing, Nov. 1998.
[28] Q. Yi, V. Adve, and K. Kennedy, Transforming Loops to Recursion for Multi-Level Memory Hierarchies Proc. ACM SIGPLAN 2000 Conf. Programming Language Design and Implementation, June 2000.

Index Terms:
Block data layout, tiling, TLB misses, cache misses, memory hierarchy.
Citation:
Neungsoo Park, Bo Hong, Viktor K. Prasanna, "Tiling, Block Data Layout, and Memory Hierarchy Performance," IEEE Transactions on Parallel and Distributed Systems, vol. 14, no. 7, pp. 640-654, July 2003, doi:10.1109/TPDS.2003.1214317
Usage of this product signifies your acceptance of the Terms of Use.