The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.03 - March (2013 vol.24)
pp: 417-427
Yongpeng Zhang , North Carolina State University, Raleigh
Frank Mueller , North Carolina State University, Raleigh
ABSTRACT
This paper develops and evaluates search and optimization techniques for autotuning 3D stencil (nearest neighbor) computations on GPUs. Observations indicate that parameter tuning is necessary for heterogeneous GPUs to achieve optimal performance with respect to a search space. Our proposed framework takes a most concise specification of stencil behavior from the user as a single formula, autogenerates tunable code from it, systematically searches for the best configuration and generates the code with optimal parameter configurations for different GPUs. This autotuning approach guarantees adaptive performance for different generations of GPUs while greatly enhancing programmer productivity. Experimental results show that the delivered floating point performance is very close to previous handcrafted work and outperforms other autotuned stencil codes by a large margin. Furthermore, heterogeneous GPU clusters are shown to exhibit the highest performance for dissimilar tuning parameters leveraging proportional partitioning relative to single-GPU performance.
INDEX TERMS
Graphics processing unit, Arrays, Instruction sets, Kernel, Tuning, Three dimensional displays, Optimization, GPU clusters, Accelerators, GPGPU programming, stencil codes
CITATION
Yongpeng Zhang, Frank Mueller, "Autogeneration and Autotuning of 3D Stencil Codes on Homogeneous and Heterogeneous GPU Clusters", IEEE Transactions on Parallel & Distributed Systems, vol.24, no. 3, pp. 417-427, March 2013, doi:10.1109/TPDS.2012.160
REFERENCES
[1] NVIDIA Cooperation, CUDA Programming Guide.
[2] B. Catanzaro, A. Fox, K. Keutzer, D. Patterson, B.-Y. Su, M. Snir, K. Olukotun, P. Hanrahan, and H. Chafi, "Ubiquitous Parallel Computing from Berkeley, Illinois, and Stanford," IEEE Micro, vol. 30, no. 2, pp. 41-55, Mar. 2010.
[3] K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and K. Yelick, "Stencil Computation Optimization and Auto-Tuning on State-of-the-Art Multicore Architectures," Proc. ACM/IEEE Conf. Supercomputing (SC '08), pp. 4:1-4:12, 2008.
[4] http://www.khronos.orgopencl,penCL, 2012.
[5] W.-m. Hwu, S. Ryoo, S.-Z. Ueng, J.H. Kelm, I. Gelado, S.S. Stone, R.E. Kidd, S.S. Baghsorkhi, A.A. Mahesri, S.C. Tsao, N. Navarro, S.S. Lumetta, M.I. Frank, and S.J. Patel, "Implicitly Parallel Programming Models for Thousand-Core Microprocessors," Proc. 44th Ann. Design Automation Conf., pp. 754-759, 2007.
[6] S. Kamil, C. Chan, L. Oliker, J. Shalf, and S. Williams, "An Auto-Tuning Framework for Parallel Multicore Stencil Computations," Proc. IEEE Int'l Parallel and Distributed Processing Symp. (IPDPS), 2010.
[7] S. Matsuoka, T. Aoki, T. Endo, A. Nukada, T. Kato, and A. Hasegawa, "GPU Accelerated Computing from Hype to Mainstream, the Rebirth of Vector Computing," J. Physics: Conf. Series, vol. 180, 2009.
[8] P. Micikevicius, "3D Finite Difference Computation on GPUs Using CUDA," Proc. Second Workshop General Purpose Processing on Graphics Processing Units (GPGPU-2), pp. 79-84, 2009.
[9] E. Phillips and M. Fatica, "Implementing the Himeno Benchmark with CUDA on GPU Clusters," Proc. Int'l Parallel and Distributed Processing Symp. (IPDPS), Apr. 2010.
17 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool