Cluster Computing and the Grid, IEEE International Symposium on (2013)
Delft, Netherlands Netherlands
May 13, 2013 to May 16, 2013
OpenACC is a new accelerator programming interface that provides a set of OpenMP-like loop directives for the programming of accelerators in an implicit and portable way. It allows the programmer to express the offloading of data and computations to accelerators, such that the porting process for legacy CPU-based applications can be significantly simplified. This paper focuses on the performance aspects of OpenACC using two micro benchmarks and one real-world computational fluid dynamics application. Both evaluations show that in general OpenACC performance is approximately 50\% lower than CUDA. However, for some applications it can reach up to 98\% with careful manual optimizations. The results also indicate several limitations of the OpenACC specification that hamper full use of the GPU hardware resources, resulting in a significant performance gap when compared to a fully tuned CUDA code. The lack of a programming interface for the shared memory in particular results in as much as three times lower performance.
CUDA, GPU, OpenACC
T. Hoshino, N. Maruyama, S. Matsuoka and R. Takaki, "CUDA vs OpenACC: Performance Case Studies with Kernel Benchmarks and a Memory-Bound CFD Application," 2013 13th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)(CCGRID), Delft, 2013, pp. 136-143.