The Community for Technology Leaders
2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS) (2014)
Hsinchu, Taiwan
Dec. 16, 2014 to Dec. 19, 2014
ISBN: 978-1-4799-7615-7
pp: 534-541
Nick Chaimov , Department of Computer and Information Science, University of Oregon, Eugene, 97403, USA
Boyana Norris , Department of Computer and Information Science, University of Oregon, Eugene, 97403, USA
Allen Malony , Department of Computer and Information Science, University of Oregon, Eugene, 97403, USA
ABSTRACT
Producing high-performance implementations from simple, portable computation specifications is a challenge that compilers have tried to address for several decades. More recently, a relatively stable architectural landscape has evolved into a set of increasingly diverging and rapidly changing CPU and accelerator designs, with the main common factor being dramatic increases in the levels of parallelism available. The growth of architectural heterogeneity and parallelism, combined with the very slow development cycles of traditional compilers, has motivated the development of autotuning tools that can quickly respond to changes in architectures and programming models, and enable very specialized optimizations that are not possible or likely to be provided by mainstream compilers. In this paper we describe the new OpenCL code generator and autotuner OrCL and the introduction of detailed performance measurement into the autotuning process. OrCL is implemented within the Orio autotuning framework, which enables the rapid development of experimental languages and code optimization strategies aimed at achieving good performance on new platforms without rewriting or hand-optimizing critical kernels. The combination of the new OpenCL autotuning and TAU measurement capabilities enables users to consistently evaluate autotuning effectiveness across a range of architectures, including several NVIDIA and AMD accelerators and Intel Xeon Phi processors, and to compare the OpenCL and CUDA code generation capabilities. We present results of autotuning several numerical kernels that typically dominate the execution time of iterative sparse linear system solution and key computations from a 3-D parallel simulation of solid fuel ignition.
INDEX TERMS
Kernel, Graphics processing units, Performance evaluation, Optimization, Computer architecture, Generators, Hardware
CITATION

N. Chaimov, B. Norris and A. Malony, "Toward multi-target autotuning for accelerators," 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS), Hsinchu, Taiwan, 2014, pp. 534-541.
doi:10.1109/PADSW.2014.7097851
238 ms
(Ver 3.3 (11022016))