2015 IEEE International Parallel and Distributed Processing Symposium Workshop (IPDPSW) (2015)
May 25, 2015 to May 29, 2015
Scientific applications need to be moved among supercomputers, such as Tianhe-2 and TSUBAME 2.5. OpenACC provides a directive-based approach for a single source code base with function portability across different accelerators used in the supercomputers. However, the performance portability is not guaranteed by the OpenACC standard. Therefore, we propose a systematic optimization method, instead of auto-tuning by compliers, to achieve reasonable portable performance with minor code modifications. With this method, we evaluate the four kernels from Rodin a benchmark suite and one mini-application Hydro on our hybrid "CPU+GPU+MIC" supercomputer À with the CAPS and PGI compilers. We analyze Parallel Thread Execution (PTX) codes to further understand the performance portability, and find CAPS adopts a different strategy from PGI in thread distribution. The evaluation results show the optimized OpenACC versions can archive a better performance portability ratio than the OpenCL version in some cases. The understanding and the method are valuable for OpenACC application developers to efficiently and correctly use the available OpenACC compilers.
Graphics processing units, Microwave integrated circuits, Optimization, Kernel, Supercomputers, Instruction sets, Standards
S. Sawadsitang, J. Lin, S. See, F. Bodin and S. Matsuoka, "Understanding Performance Portability of OpenACC for Supercomputers," 2015 IEEE International Parallel and Distributed Processing Symposium Workshop (IPDPSW), Hyderabad, India, 2015, pp. 699-707.