2014 23rd International Conference on Parallel Architecture and Compilation (PACT) (2014)
Aug. 23, 2014 to Aug. 27, 2014
Alberto Magni , School of Informatics, University of Edinburgh, United Kingdom
Christophe Dubach , School of Informatics, University of Edinburgh, United Kingdom
Michael O'Boyle , School of Informatics, University of Edinburgh, United Kingdom
OpenCL has been designed to achieve functional portability across multi-core devices from different vendors. However, the lack of a single cross-target optimizing compiler severely limits performance portability of OpenCL programs. Programmers need to manually tune applications for each specific device, preventing effective portability. We target a compiler transformation specific for data-parallel languages: thread-coarsening and show it can improve performance across different GPU devices. We then address the problem of selecting the best value for the coarsening factor parameter, i.e., deciding how many threads to merge together. We experimentally show that this is a hard problem to solve: good configurations are difficult to find and naive coarsening in fact leads to substantial slowdowns. We propose a solution based on a machine-learning model that predicts the best coarsening factor using kernel-function static features. The model automatically specializes to the different architectures considered. We evaluate our approach on 17 benchmarks on four devices: two Nvidia GPUs and two different generations of AMD GPUs. Using our technique, we achieve speedups between 1.11× and 1.33× on average.
Benchmark testing, Instruction sets, Graphics processing units, Kernel, Performance evaluation, Hardware
Alberto Magni, Christophe Dubach, Michael O'Boyle, "Automatic optimization of thread-coarsening for graphics processors", 2014 23rd International Conference on Parallel Architecture and Compilation (PACT), vol. 00, no. , pp. 455-466, 2014, doi:10.1145/2628071.2628087