The Community for Technology Leaders
2015 IEEE 22nd International Conference on High Performance Computing (HiPC) (2015)
Bengaluru, India
Dec. 16, 2015 to Dec. 19, 2015
ISBN: 978-1-4673-8487-2
pp: 234-243
GPGPUs are becoming ubiquitous entities in high performance computing systems owing to their large compute capacities at low power footprints. Together with high performance interconnects such as InfiniBand (IB), GPGPUs are paving the way for highly capable, energy-efficient distributed computing systems for scientific applications. GPGPUs are throughput devices that benefit immensely from latency hiding techniques. Thus, long latency communication operations such as MPI collectives that originate or finish at GPU memories must be well hidden. Although popular Message Passing libraries offer non-blocking collective operations based on host memories, no known works have explored the same for GPU memories. In this work we propose, for first time, Non-blocking MPI collective operations from GPU memories and realize an efficient implementation through the coupling of CORE-Direct offload mechanisms and NVIDIA CUDA capabilities. In addition, we also propose a novel way to avoid peer-to-peer limitations without explicit host processor intervention to maximize overlap with good latency. Our micro-benchmark evaluations show that the proposed designs can yield close to maximum overlap percentages for dense collective operations. Results with as many as 64 GPU nodes also show that the designs can perform comparably with the latency of blocking variant of the collective in the medium and large message ranges.
Graphics processing units, Peer-to-peer computing, Performance evaluation, Libraries, Programming, Throughput, Message passing
A. Venkatesh, K. Hamidouche, H. Subramoni, Dhabaleswar K. Panda, "Offloaded GPU Collectives Using CORE-Direct and CUDA Capabilities on InfiniBand Clusters", 2015 IEEE 22nd International Conference on High Performance Computing (HiPC), vol. 00, no. , pp. 234-243, 2015, doi:10.1109/HiPC.2015.50
157 ms
(Ver 3.3 (11022016))