2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) (2018)
Washington, DC, USA
May 1, 2018 to May 4, 2018
Applications for deep learning and big data analytics have compute and memory requirements that exceed the limits of a single GPU. However, effectively scaling out an application to multiple GPUs is challenging due to the complexities of communication between the GPUs, particularly for collective communication with irregular message sizes. In this work, we provide a performance evaluation of the Allgatherv routine on multi-GPU systems, focusing on GPU network topology and the communication library used. We present results from the OSU-micro benchmark as well as conduct a case study for sparse tensor factorization, one application that uses Allgatherv with highly irregular message sizes. We extend our existing tensor factorization tool to run on systems with different node counts and varying number of GPUs per node. We then evaluate the communication performance of our tool when using traditional MPI, CUDA-aware MVAPICH and NCCL across a suite of real-world data sets on three different systems: a 16-node cluster with one GPU per node, NVIDIA's DGX-1 with 8 GPUs and Cray's CS-Storm with 16 GPUs. Our results show that irregularity in the tensor data sets produce trends that contradict those in the OSU micro-benchmark, as well as trends that are absent from the benchmark.
application program interfaces, Big Data, data analysis, graphics processing units, message passing, multiprocessing systems, parallel architectures, performance evaluation, tensors
T. B. Rolinger, T. A. Simon and C. D. Krieger, "An Empirical Evaluation of Allgatherv on Multi-GPU Systems," 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), Washington, DC, USA, 2018, pp. 123-132.