2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS) (2015)
Dec. 14, 2015 to Dec. 17, 2015
Toshio Endo , Tokyo Inst. of Technol., Tokyo, Japan
Yuki Takasaki , Tokyo Inst. of Technol., Tokyo, Japan
Satoshi Matsuoka , Tokyo Inst. of Technol., Tokyo, Japan
The problem of deepening memory hierarchy towards exascale is becoming serious for applications such as those based on stencil kernels, as it is difficult to satisfy both high memory bandwidth ad capacity requirements simultaneously. This is evident even today, where problem sizes of stencil-based applications on GPU supercomputers are limited by aggregated capacity of GPU device memory. Locality improvement techniques such as temporal blocking is known to preserve performance, but integrating the technique into existing stencil applications results in substantially higher programming cost, especially for complex applications and as a result are not typically utilized. We alleviate this problem with a run-time GPU-MPI process virtualization library we call HHRT that automates data movement across the memory hierarchy, and a systematic methodology to convert and optimize the code to accommodate temporal blocking. The proposed methodology has shown to significantly eases the adaptation of real applications, such as the whole-city airflow simulator embodying more than 12,000 lines of code; with careful tuning, we successfully maintain up to 85% performance even with problems whose footprint is four time larger than GPU device memory capacity, and scale to hundreds of GPUs on the TSUBAME2.5 supercomputer.
Atmospheric modeling, Computational modeling, Solid modeling, Conferences, Graphics processing units
T. Endo, Y. Takasaki and S. Matsuoka, "Realizing Extremely Large-Scale Stencil Applications on GPU Supercomputers," 2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS), Melbourne, Australia, 2016, pp. 625-632.