2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS) (2018)
Vancouver, British Columbia, Canada
May 21, 2018 to May 25, 2018
GPUs fulfill high computation demands, but it is necessary to develop code carefully, selecting algorithms well suited to the GPU architecture and applying different optimizations. This article presents a GPU-suitable algorithm and a tuning strategy for performing the scan primitive over large problem sizes in CUDA. This tuning strategy defines different performance premises to find the GPU execution parameters that maximize performance. Taking these premises into consideration, we easily develop the kernels using CUDA skeletons to ensure efficiency and portability. Based on this, we describe an optimal proposal analyzed over different multiple GPU environments, the first multiple-GPU batch scan proposal to the best of our knowledge. The resulting implementations outperform other well-known libraries in most cases, such as CUDPP, ModernGPU, Thrust, CUB and LightScan.
boundary scan testing, graphics processing units, parallel architectures
A. Perez Dieguez, M. Amor, R. Doallo, A. Nukada and S. Matsuoka, "Efficient Solving of Scan Primitive on Multi-GPU Systems," 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Vancouver, British Columbia, Canada, 2018, pp. 794-803.