The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.01 - January (2011 vol.22)
pp: 105-118
Byunghyun Jang , Northeastern University, Boston
Dana Schaa , Northeastern University, Boston
Perhaad Mistry , Northeastern University, Boston
David Kaeli , Northeastern University, Boston
ABSTRACT
The introduction of General-Purpose computation on GPUs (GPGPUs) has changed the landscape for the future of parallel computing. At the core of this phenomenon are massively multithreaded, data-parallel architectures possessing impressive acceleration ratings, offering low-cost supercomputing together with attractive power budgets. Even given the numerous benefits provided by GPGPUs, there remain a number of barriers that delay wider adoption of these architectures. One major issue is the heterogeneous and distributed nature of the memory subsystem commonly found on data-parallel architectures. Application acceleration is highly dependent on being able to utilize the memory subsystem effectively so that all execution units remain busy. In this paper, we present techniques for enhancing the memory efficiency of applications on data-parallel architectures, based on the analysis and characterization of memory access patterns in loop bodies; we target vectorization via data transformation to benefit vector-based architectures (e.g., AMD GPUs) and algorithmic memory selection for scalar-based architectures (e.g., NVIDIA GPUs). We demonstrate the effectiveness of our proposed methods with kernels from a wide range of benchmark suites. For the benchmark kernels studied, we achieve consistent and significant performance improvements (up to 11.4{\times} and 13.5{\times} over baseline GPU implementations on each platform, respectively) by applying our proposed methodology.
INDEX TERMS
General-purpose computation on GPUs (GPGPUs), GPU computing, memory optimization, memory access pattern, vectorization, memory selection, memory coalescing, data parallelism, data-parallel architectures.
CITATION
Byunghyun Jang, Dana Schaa, Perhaad Mistry, David Kaeli, "Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures", IEEE Transactions on Parallel & Distributed Systems, vol.22, no. 1, pp. 105-118, January 2011, doi:10.1109/TPDS.2010.107
REFERENCES
[1] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, "NVIDIA Tesla: A Unified Graphics and Computing Architecture," IEEE Micro, vol. 28, no. 2, pp. 39-55, Mar./Apr. 2008.
[2] D. Luebke, M. Harris, J. Krüger, T. Purcell, N. Govindaraju, I. Buck, C. Woolley, and A. Lefohn, "Gpgpu: General Purpose Computation on Graphics Hardware," Proc. ACM SIGGRAPH, p. 33, 2004.
[3] J.D. Owens, M. Houston, D. Luebke, S. Green, J.E. Stone, and J.C. Phillips, "GPU Computing," Proc. IEEE, vol. 96, no. 5, pp. 879-899, May 2008.
[4] "GPGPU Website," www.gpgpu.org, 2010.
[5] H. Chang and W. Sung, "Efficient Vectorization of SIMD Programs with Non-Aligned and Irregular Data Access Hardware," Proc. 2008 Int'l Conf. Compilers, Architectures and Synthesis for Embedded Systems (CASES '08), pp. 167-176, 2008.
[6] H. Chang and W. Sung, "Access-Pattern-Aware On-Chip Memory Allocation for SIMD Processors," IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, vol. 28, no. 1, pp. 158-163, Jan. 2009.
[7] C.G. Lee and M.G. Stoodley, "Simple Vector Microprocessors for Multimedia Applications," pp. 25-36, Nov./Dec. 1998.
[8] M.M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan, "Automatic Data Movement and Computation Mapping for Multi-Level Parallel Architectures with Explicitly Managed Memories," Proc. 13th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP '08), pp. 1-10, 2008.
[9] B. Jang, S. Do, H. Pien, and D. Kaeli, "Architecture-Aware Optimization Targeting Multithreaded Stream Computing," Proc. Second Workshop General Purpose Processing on Graphics Processing Units (GPGPU-2), pp. 62-70, 2009.
[10] B. Jang, D. Kaeli, S. Do, and H. Pien, "Multi GPU Implementation of Iterative Tomographic Reconstruction Algorithms," Proc. Sixth IEEE Int'l Symp. Biomedical Imaging: From Nano to Macro (ISBI '09), June 2009.
[11] M. Silberstein, A. Schuster, D. Geiger, A. Patney, and J.D. Owens, "Efficient Computation of Sum-Products on GPUs through Software-Managed Cache," Proc. 22nd Ann. Int'l Conf. Supercomputing (ICS '08), pp. 309-318, 2008.
[12] B. Jang, P. Mistry, D. Schaa, R. Dominguez, and D. Kaeli, "Data Transformations Enabling Loop Vectorization on Multithreaded Data Parallel Architectures," Proc. 15th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP '10), 2010.
[13] N.K. Govindaraju, S. Larsen, J. Gray, and D. Manocha, "A Memory Model for Scientific Algorithms on Graphics Processors," Proc. ACM/IEEE Conf. Supercomputing (SC '06), p. 89, 2006.
[14] K. Fatahalian, J. Sugerman, and P. Hanrahan, "Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication," Proc. ACM SIGGRAPH/EUROGRAPHICS Conf. Graphics Hardware (HWWS '04), pp. 133-137, 2004.
[15] AMD, "ATI Stream Computing User Guide, V 1.4 Beta, Brook+ SDK," http://developer.amd.com/gpuATIStreamSDK /, 2010.
[16] AMD, "Stream Computing Forum," http://forums.amd.com devforum/, 2010.
[17] NVIDIA, "CUDA Programming Guide 2.3," http://www. nvidia.comcuda/, July 2009.
[18] NVIDIA, "CUDA Forum," http:/forums.nvidia.com/, 2010.
[19] R. Dominguez, D. Kaeli, J. Cavazos, and M. Murphy, "Improving the Open64 Backend for GPUs," Proc. GPU Technology Conf., 2009.
[20] S.T. Leung and J. Zahorjan, "Optimizing Data Locality by Array Restructuring," Technical Report TR 95-09-01, Univ. of Washington, 1995.
[21] S. Ghosh, M. Martonosi, and S. Malik, "Cache Miss Equations: An Analytical Representation of Cache Misses," Proc. 11th Int'l Conf. Supercomputing (ICS '97), pp. 317-324, 1997.
[22] "BLAS (Basic Linear Algebra Subprograms)," http://www. netlib.orgblas/, 2010.
[23] D. Levine, D. Callahan, and J. Dongarra, "A Comparative Study of Automatic Vectorizing Compilers," Parallel Computing, vol. 17, nos. 10/11, pp. 1223-1244, 1991.
[24] NVIDIA, "NVIDIA CUDA C Programming Best Practices Guide 2.3," http://www.nvidia.comcuda/, July 2009.
[25] Impact Research Group, "The Parboil Benchmark Suite," http://impact.crhc.illinois.eduparboil.php , 2007.
[26] "Physics Based Modeling (PhysBAM)," http:/physbam.stanford. edu/, 2010.
[27] J. Teran, E. Sifakis, G. Irving, and R. Fedkiw, "Robust Quasistatic Finite Elements and Flesh Simulation," Proc. ACM SIGGRAPH/Eurographics Symp. Computer Animation (SCA '05), pp. 181-190, 2005.
23 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool