This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Autotuning GEMM Kernels for the Fermi GPU
Nov. 2012 (vol. 23 no. 11)
pp. 2045-2057
Jakub Kurzak, University of Tennessee, Knoxville
Stanimire Tomov, University of Tennessee, Knoxville
Jack Dongarra, University of Tennessee, Knoxville and University of Manchester
In recent years, the use of graphics chips has been recognized as a viable way of accelerating scientific and engineering applications, even more so since the introduction of the Fermi architecture by NVIDIA, with features essential to numerical computing, such as fast double precision arithmetic and memory protected with error correction codes. Being the crucial component of numerical software packages, such as LAPACK and ScaLAPACK, the general dense matrix multiplication routine is one of the more important workloads to be implemented on these devices. This paper presents a methodology for producing matrix multiplication kernels tuned for a specific architecture, through a canonical process of heuristic autotuning, based on generation of multiple code variants and selecting the fastest ones through benchmarking. The key contribution of this work is in the method for generating the search space; specifically, pruning it to a manageable size. Performance numbers match or exceed other available implementations.
Index Terms:
Graphics processing unit,Instruction sets,Kernel,Computer architecture,Registers,Hardware,CUDA,Graphics processing unit,matrix multiplication,code generation,automatic tuning,GEMM,BLAS
Citation:
Jakub Kurzak, Stanimire Tomov, Jack Dongarra, "Autotuning GEMM Kernels for the Fermi GPU," IEEE Transactions on Parallel and Distributed Systems, vol. 23, no. 11, pp. 2045-2057, Nov. 2012, doi:10.1109/TPDS.2011.311
Usage of this product signifies your acceptance of the Terms of Use.