|
| This Article | ||
| ||
| Share | ||
| Bibliographic References | ||
| Add to: | ||
| | ||
| Search | ||
| ||
| ASCII Text | x | ||
| Neungsoo Park, Bo Hong, Viktor K. Prasanna, "Tiling, Block Data Layout, and Memory Hierarchy Performance," IEEE Transactions on Parallel and Distributed Systems, vol. 14, no. 7, pp. 640-654, July, 2003. | |||
| BibTex | x | ||
| @article{ 10.1109/TPDS.2003.1214317, author = {Neungsoo Park and Bo Hong and Viktor K. Prasanna}, title = {Tiling, Block Data Layout, and Memory Hierarchy Performance}, journal ={IEEE Transactions on Parallel and Distributed Systems}, volume = {14}, number = {7}, issn = {1045-9219}, year = {2003}, pages = {640-654}, doi = {http://doi.ieeecomputersociety.org/10.1109/TPDS.2003.1214317}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, } | |||
| RefWorks Procite/RefMan/Endnote | x | ||
| TY - JOUR JO - IEEE Transactions on Parallel and Distributed Systems TI - Tiling, Block Data Layout, and Memory Hierarchy Performance IS - 7 SN - 1045-9219 SP640 EP654 EPD - 640-654 A1 - Neungsoo Park, A1 - Bo Hong, A1 - Viktor K. Prasanna, PY - 2003 KW - Block data layout KW - tiling KW - TLB misses KW - cache misses KW - memory hierarchy. VL - 14 JA - IEEE Transactions on Parallel and Distributed Systems ER - | |||
Abstract—Recently, several experimental studies have been conducted on block data layout in conjunction with tiling as a data transformation technique to improve cache performance. In this paper, we analyze cache and TLB performance of such alternate layouts (including block data layout and Morton layout) when used in conjunction with tiling. We derive a tight lower bound on TLB performance for standard matrix access patterns, and show that block data layout and Morton layout achieve this bound. To improve cache performance, block data layout is used in concert with tiling. Based on the cache and TLB performance analysis, we propose a data block size selection algorithm that finds a tight range for optimal block size. To validate our analysis, we conducted simulations and experiments using tiled matrix multiplication, LU decomposition, and Cholesky factorization. For matrix multiplication, simulation results using UltraSparc II parameters show that tiling and block data layout with a block size given by our block size selection algorithm, reduces up to 93 percent of TLB misses compared with other techniques (copying, padding, etc.). The total miss cost is reduced considerably. Experiments on several platforms (UltraSparc II and III, Alpha, and Pentium III) show that tiling with block data layout achieves up to 50 percent performance improvement over other techniques that use conventional layouts. Morton layout is also analyzed and compared with block data layout. Experimental results show that matrix multiplication using block data layout is up to 15 percent faster than that using Morton data layout.
[1] ADVISOR Project,http:/advisor.usc.edu, 2002
[2] D. Burger and T.M. Austin, The SimpleScalar Tool Set, Version 2.0 Technical Report 1342, Computer Science Dept., Univ. of Wisconsin-Madison, June 1997.
[3] J.B. Carter, W.C. Hsieh, L.B. Stoller, M.R. Swanson, L. Zhang, E.L. Brunvand, A. Davis, C.-C. Kuo, R. Kuramkote, M.A. Parker, L. Schaelicke, and T. Tateyama, Impulse: Building a Smarter Memory Controller Proc. Fifth Int'l Symp. High Performance Computer Architecture, pp. 70-79, Jan. 1999.
[4] S. Chatterjee, V.V. Jain, A.R. Lebeck, S. Mundhra, and M. Thottethodi, “Nonlinear Array Layouts for Hierarchical Memory Systems,” Proc. 1999 ACM Int'l Conf. Supercomputing, pp. 444-453, June 1999.
[5] M. Cierniak and W. Li, “Unifying Data and Control Transformations for Distributed Shared Memory Machines,” Proc. SIGPLAN Conf. Programming Language Design and Implementation, June 1995.
[6] S. Coleman and K. McKinley, “Tile Size Selection Using Cache Organization and Data Layout,” Proc. SIGPLAN Conf. Programming Language Design and Implementation, June 1995.
[7] R. Espasa, J. Corbal, and M. Valero, Command Vector Memory Systems: High Performance at Low Cost Technical Report UPC-DAC-1998-8, Universitat Polit`ecnica de Catalunya, 1998.
[8] K. Esseghir, Improving Data Locality for Caches, Master's Thesis, Dept. of Computer Science, Rice Univ., Sept. 1993.
[9] A. González, C. Aliagas, and M. Valero, A Data Cache With Multiple Caching Strategies Tuned to Different Types of Locality Proc. Int'l Conf. Supercomputing, pp. 338-347, July 1995.
[10] T.L. Johnson, M.C. Merten, and W.W. Hwu, Run-Time Spatial Locality Detection and Optimization Proc. 30th Int'l Symp. Microarchitecture, Dec. 1997.
[11] M. Kandemir, A. Choudhary, J. Ramanujam, and P. Banerjee, “Improving Locality Using Loop and Data Transformations in an Integrated Approach,” Proc. MICRO-31, Dec. 1998.
[12] M. Lam, E. Rothberg, and M. Wolf, “The Cache Performance and Optimizations of Blocked Algorithms,” Proc. Fourth Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS '91), 1991.
[13] N. Mitchell, K. Hogstedt, L. Carter, and J. Ferrante, “Quantifying the Multi-Level Nature of Tiling Interactions” Int'l J. Parallel Programming, vol. 26, no. 6, pp. 641-670, 1998.
[14] D. Padua, The Fortran I Compiler IEEE Computing in Science and Eng., Jan./Feb. 2000.
[15] D.A. Padua, Outline of a Roadmap for Compiler Technology IEEE Computing in Science and Eng., Fall 1996.
[16] P.R. Panda, H. Nakamura, N. Dutt, and A. Nicolau, Augmenting Loop Tiling with Data Alignment for Improved Cache Performance IEEE Trans. Computers, vol. 48, no. 2, Feb. 1999.
[17] N. Park, B. Hong, and V.K. Prasanna, Memory Hierarchy Performance of Tiling and Block Data Layout Technical Report USC-CENG 02-15, Dept. of Electrical Eng., Univ. of Southern California, Jan. 2003.
[18] N. Park, D. Kang, K. Bondalapati, and V.K. Prasanna, Dynamic Data Layouts for Cache-Conscious Factorization of DFT Proc. Int'l Parallel and Distributed Processing Symp. 2000, Apr. 2000.
[19] N. Park and V.K. Prasanna, Cache Conscious Walsh-Hadamard Transform Proc. Int'l Conf. Acoustics, Speech, and Signal Processing, May 2001.
[20] D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, and K. Yelick, A Case for Intelligent DRAM: IRAM IEEE Micro, Apr. 1997.
[21] G. Rivera and C.-W. Tseng, “Data Transformations for Eliminating Conflict Misses,” Proc. SIGPLAN Conf. Programming Language Design and Implementation, June 1998.
[22] G. Rivera and C.-W. Tseng, Locality Optimizations for Multi-Level Caches Proc. IEEE Supercomputing, Nov. 1999.
[23] V. Sarkar and G.R. Gao, Optimization of Array Accesses by Collective Loop Transformations Proc. Int'l Conference of Supercomputing, June 1991.
[24] A. Saulsbury, F. Dahgren, and P. Stenström, Receny-Based TLB Preloading Proc. 27th Ann. Int'l Symp. Computer Architecture, June 2000.
[25] H. Sharangpani, Intel Itanium Processor Microarchitecture Overview Microprocessor Forum, Oct. 1999.
[26] O. Temam, E.D. Granston,, and W. Jalby, “To Copy or Not to Copy: A Compile-Time Technique for Assessing When Data Copying Should Be Used to Eliminate Cache Conflicts,” Proc. Supercomputing, Nov. 1993.
[27] R.C. Whaley and J. Dongarra, Automatically Tuned Linear Algebra Software (ATLAS) Proc. Supercomputing, Nov. 1998.
[28] Q. Yi, V. Adve, and K. Kennedy, Transforming Loops to Recursion for Multi-Level Memory Hierarchies Proc. ACM SIGPLAN 2000 Conf. Programming Language Design and Implementation, June 2000.

