The Community for Technology Leaders
RSS Icon
Issue No.11 - Nov. (2012 vol.23)
pp: 2045-2057
Jakub Kurzak , University of Tennessee, Knoxville
Stanimire Tomov , University of Tennessee, Knoxville
Jack Dongarra , University of Tennessee, Knoxville and University of Manchester
In recent years, the use of graphics chips has been recognized as a viable way of accelerating scientific and engineering applications, even more so since the introduction of the Fermi architecture by NVIDIA, with features essential to numerical computing, such as fast double precision arithmetic and memory protected with error correction codes. Being the crucial component of numerical software packages, such as LAPACK and ScaLAPACK, the general dense matrix multiplication routine is one of the more important workloads to be implemented on these devices. This paper presents a methodology for producing matrix multiplication kernels tuned for a specific architecture, through a canonical process of heuristic autotuning, based on generation of multiple code variants and selecting the fastest ones through benchmarking. The key contribution of this work is in the method for generating the search space; specifically, pruning it to a manageable size. Performance numbers match or exceed other available implementations.
Graphics processing unit, Instruction sets, Kernel, Computer architecture, Registers, Hardware, CUDA, Graphics processing unit, matrix multiplication, code generation, automatic tuning, GEMM, BLAS
Jakub Kurzak, Stanimire Tomov, Jack Dongarra, "Autotuning GEMM Kernels for the Fermi GPU", IEEE Transactions on Parallel & Distributed Systems, vol.23, no. 11, pp. 2045-2057, Nov. 2012, doi:10.1109/TPDS.2011.311
[1] "Advanced Micro Devices, Inc," AMD Intermediate Language, Version 2.0e, assets AMD_Intermediate_Language _(IL)_Specification_v2.pdf , 2010.
[2] E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, H. Ltaief, P. Luszczek, and S. Tomov, "Numerical Linear Algebra on Emerging Architectures: The PLASMA and MAGMA Projects," J. Physics: Conf. Series, vol. 180, no. 1, 2009, doi: 10.1088/1742-65 96/180/1/012037.
[3] E. Anderson, Z. Bai, C. Bischof, L.S. Blackford, J.W. Demmel, J.J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen, LAPACK Users' Guide. SIAM,, 1992.
[4] S. Barrachina, M. Castillo, F.D. Igual, R. Mayo, and E.S. Quintana-Ortí, "Evaluation and Tuning of the Level 3 CUBLAS for Graphics Processors," Proc. Int'l Workshop Parallel and Distributed Scientific and Eng. Computing (PDSEC '08), Apr. 2008, TR-01-01-2008.pdf.
[5] R.F. Barrett, T.H.F. Chan, E.F. D'Azevedo, E.F. Jaeger, K. Wong, and R.Y. Wong, "Complex Version of High Performance Computing LINPACK Benchmark (HPL)," Concurrency and Computation: Practice and Experiences, vol. 22, no. 5, pp. 573-587, 2009, doi: 10.1002/cpe.1476.
[6] "Basic Linear Algebra Technical Forum," Basic Linear Algebra Technical Forum Standard, forumblas-report.pdf , Aug. 2001.
[7] J. Bilmes, K. Asanovic, J. Demmel, D. Lam, and C.-W. Chin, "LAPACK Working Note 111: Optimizing Matrix Multiply Using PHiPAC: A Portable, High-Performance, ANSI C Coding Methodology," Technical Report UT-CS-96-326, Computer Science Department, Univ. of Tennessee, lapack/lawnspdflawn111.pdf , 1996.
[8] L.S. Blackford, J. Choi, A. Cleary, E. D'Azevedo, J. Demmel, I. Dhillon, J.J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R.C. Whaley, ScaLAPACK Users' Guide, SIAM,, 1997.
[9] C. Chen, J. Chame, and M. Hall, "CHiLL: A Framework for Composing High-Level Loop Transformations," Technical Report 08-897, Computer Science Department, Univ. of Southern California, 2008. .
[10] A. Cohen, M. Sigler, S. Girbal, O. Temam, D. Parello, and N. Vasilache, "Facilitating the Search for Compositions of Program Transformations," Proc. Int'l Conf. Supercomputing (ICS '05), pp. 151-160, June 2005, doi: 10.1145/1088149.108816 9.
[11] H. Cui, L. Wang, J. Xue, Y. Yang, and X. Feng, "Automatic Library Generation for BLAS3 on GPUs," Proc. IEEE 25th Int'l Parallel and Distributed Processing Symp., May 2011.
[12] J. Dongarra et al., "The Int'l Exascale Software Roadmap," Int'l J. High Performance Computer Applications, vol. 25, no. 1, 2011.
[13] J.J. Dongarra, P. Luszczek, and A. Petitet, "The LINPACK Benchmark: Past, Present and Future," Concurrency Computation: Practice and Experience, vol. 15, no. 9, pp. 803-820, 2003.
[14] P. Du, R. Weber, P. Luszczek, S. Tomov, G. Peterson, and J. Dongarra, "From CUDA to OpenCL: Towards a Performance-Portable Solution for Multi-Platform GPU Programming," Technical Report CS-10-656, Electrical Eng. and Computer Science Department, Univ. of Tennessee, LAPACK Working Note 228, 2010.
[15] A.T. Fam, "Efficient Complex Matrix Multiplication," IEEE Trans. Computers, vol. 37, no. 7, pp. 877-879, July 1988, doi: 10.1109/12.2236.
[16] M. Frigo and S. Johnson, "The Design and Implementation of FFTW3," Proc. IEEE, vol. 93, no. 2, pp. 216-231, Feb. 2005, doi: 10.1109/JPROC.2004.8 40301.
[17] K. Goto and R.A. van de Geijn, "Anatomy of High-Performance Matrix Multiplication," ACM Trans. Math. Software, vol. 34, no. 3, pp. 1-25, 2008, doi: 10.1145/1356052.1356053.
[18] N.J. Higham, "Stability of a Method for Multiplying Complex Matrices with Three Real Matrix Multiplications," SIAM. J. Matrix Analysis and Applications, vol. 13, no. 3 pp. 681-687, 1992.
[19] C. Jiang and M. Snir, "Automatic Tuning Matrix Multiplication Performance on Graphics Hardware," Proc. 14th Int'l Conf. Parallel Architectures and Compilation Techniques (PACT '05), pp. 185-194, Sept. 2005, doi: 10.1109/PACT.2005.10.
[20] B. Kågström, P. Ling, and C. van Loan, "GEMM-Based Level 3 BLAS: High-Performance Model Implementations and Performance Evaluation Benchmark," ACM Trans. Math. Software, vol. 24, no. 3, pp. 268-302, Sept. 1998.
[21] Khro nos Group, The OpenCL Specification, Version 1.1, opencl-1.1.pdf. 2010.
[22] P. Kogge, "Exascale Computing Study: Technology Challenges in Achieving Exascale Systems," Technical Report 278, DARPA Information Processing Techniques Office, gov/ascr/Research/CSDARPA exascale-hardware.pdf , 2008.
[23] J. Kurzak, W. Alvaro, and J. Dongarra, "Optimizing Matrix Multiplication for a Short-Vector SIMD Architecture CELL Processor," Parallel Computing, vol. 35, no. 3, pp. 138-150, 2009, doi: 10.1016/j.parco.20 08.12.010.
[24] J. Kurzak, D.A. Bader, and J. Dongarra, Scientific Computing with Multicore and Accelerators, Chapman & Hall/CRC Computational Science Series. CRC Press, 2010.
[25] Y. Li, J. Dongarra, and S. Tomov, "A Note on Auto-Tuning GEMM for GPUs," Proc. Int'l Conf. Computational Science (ICCS '09), May 2009, doi: 10.1007/978-3-64 2-01970-8_89.
[26] D. Luebke and S. Parker, email communication.
[27] N. Nakasato, "A Fast GEMM Implementation on a Cypress GPU," , Proc. First Int'l Workshop Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems (PMBS '10), Nov. 2010, pmbs10/Workshop_Programme_file sfastgemm.pdf .
[28] R. Nath, S. Tomov, and J. Dongarra, "Accelerating GPU Kernels for Dense Linear Algebra," Proc. Int'l Meeting on High Performance Computing for Computational Science (VECPAR '10), June 2010, doi: 10.1007978-3-64 2-19328-6_10.
[29] R. Nath, S. Tomov, and J. Dongarra, "An Improved MAGMA GEMM for Fermi Graphics Processing Units," Int'l J. High Performance Computing Application, vol. 24, no. 4, pp. 511-515, 2010, doi: 10.1177/1094342010385 729.
[30] E. Normand, "Single Event Upset at Ground Level," IEEE Trans. Nuclear Science, vol. 43, no. 6, pp. 2742-2750, Dec. 1996, doi: 10.1109/23.5568 61.
[31] Nvidia, NVIDIA CUDA C Programming Guide, Version 3.2, cuda/3_2_prod/toolkit/docsCUDA_C_Programming_Guide.pdf . 2010.
[32] NVIDIA Corporation, NVIDIA's Next Generation CUDA Compute Architecture: Fermi, Version 1.1, NVIDIA_Fermi_Compute_ Architecture_Whitepaper.pdf , 2009.
[33] NVIDIA Corporation, NVIDIA GF100, World's Fastest GPU Delivering Great Gaming Performance with True Geometric Realism, Version 1.5, , 2010.
[34] NVIDIA Corporation, PTX: Parallel Thread Execution ISA, Version 2.1, cuda/3_1/toolkit/docsptx_ isa_2.1.pdf , 2010.
[35] NVIDIA Corporation, Tuning CUDA Applications for Fermi, Version 1.0, cuda/3_0/toolkit/docsNVID IA_FermiTuningGuide.pdf , 2010.
[36] Open64, http:/, 2012.
[37] M. Puschel, J.M.F. Moura, J.R. Johnson, D. Padua, M.M. Veloso, B.W. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R.W. Johnson, and N. Rizzolo, "SPIRAL: Code Generation for DSP Transforms," Proc. IEEE, vol. 93, no. 2, pp. 232-275, Feb. 2005, doi: 10.1109/JPROC.2004.8 40306.
[38] G. Rudy, M.M. Khan, M. Hall, C. Chen, and J. Chame, "A Programming Language Interface to Describe Transformations and Code Generation," Proc. 23rd Int'l Workshop Languages and Compilers for Parallel Computing (LCPC '10), Oct. 2010, doi: 10.1007/978-3-64 2-19595-2_10.
[39] V. Sarkar (Editor& Study Lead), "Exascale Software Study: Software Challenges in Extreme Scale Systems," Technical Report 159, DARPA Information Processing Techniques Office, 2008, Reports ECSS report 101909.pdf .
[40] G. Tan, L. Li, S. Triechle, E. Phillips, Y. Bao, and N. Sun, "Fast Implementation of DGEMM on Fermi GPU," Proc. IEEE/ACM Supercomputing Conference (SC '11), Nov. 2011.
[41] V. Volkov and J.W. Demmel, "Benchmarking GPUs to Tune Dense Linear Algebra," Proc. ACM/IEEE Conf. Supercomputing (SC '08), Nov. 2008, doi: 10.1145/1413370.141340 2.
[42] R. Vuduc, J.W. Demmel, and K.A. Yelick, "OSKI: A Library of Automatically Tuned Sparse Matrix Kernels," J. Physics: Conf. Series, vol. 16, no. 16, pp. 521-530, 2005, doi: 10.1088/1742-6596/16/1/071.
[43] R.C. Whaley, A. Petitet, and J. Dongarra, "Automated Empirical Optimizations of Software and the ATLAS Project," Parallel Computing System Applications, vol. 27, nos. 1/2, pp. 3-35, 2001, doi: 10.1016/S0167-81 91(00)00087-9.
[44] K. Yotov, X. Li, G. Ren, M.J.S. Garzaran, D. Padua, K. Pingali, and P. Stodghill, "Is Search Really Necessary to Generate High-Performance BLAS?" Proc. IEEE, vol. 93, no. 2, pp. 358-386, Feb. 2005, doi: 10.1109/JPROC.2004.8 40444.
63 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool