The Community for Technology Leaders
Green Image
Issue No. 01 - March (2016 vol. 2)
ISSN: 2332-7790
pp: 57-69
Hideyuki Shamoto , Department of Computing and Mathematical Sciences, Tokyo Institute of Technology, Meguro-ku, Tokyo, Japan
Koichi Shirahata , Department of Computing and Mathematical Sciences, Tokyo Institute of Technology, Meguro-ku, Tokyo, Japan
Aleksandr Drozd , Global Scientific Information and Computing Center, Tokyo Institute of Technology, Meguro-ku, Tokyo, Japan
Hitoshi Sato , Global Scientific Information and Computing Center, Tokyo Institute of Technology, Meguro-ku, Tokyo, Japan
Satoshi Matsuoka , Global Scientific Information and Computing Center, Tokyo Institute of Technology, Meguro-ku, Tokyo, Japan
ABSTRACT
Splitter-based parallel sorting algorithms are known to be highly efficient for distributed sorting due to their low communication complexity. Although using GPU accelerators could help to reduce the computation cost in general, their effectiveness in distributed sorting algorithms remains unclear. We investigate applicability of using GPU devices to the splitter-based algorithms and extend HykSort, an existing splitter-based algorithm by offloading costly computation phases to GPUs. To cope with the volumes of data exceeding the GPU memory capacity, out-of-core local sort is used with small overhead about 7.5 percent when the data size is tripled. We evaluate the performance of our implementation on the TSUBAME2.5 supercomputer that comprises over 4,000 NVIDIA K20x GPUs. Weak scaling analysis shows 389 times speedup with 0.25 TB/s throughput when sorting 4 TB of 64 bit integer values on 1,024 nodes compared to running on one node; this is 1.40 times faster than the reference CPU implementation. Detailed analysis however reveals that the performance is mostly bottlenecked by the CPU-GPU host-to-device bandwidth. With orders of magnitude improvements announced for next generation GPUs, the performance boost will be tremendous in accordance with other successful GPU accelerations.
INDEX TERMS
Sorting, Graphics processing units, Data transfer, Performance evaluation, Bandwidth, Acceleration, Hypercubes
CITATION

H. Shamoto, K. Shirahata, A. Drozd, H. Sato and S. Matsuoka, "GPU-Accelerated Large-Scale Distributed Sorting Coping with Device Memory Capacity," in IEEE Transactions on Big Data, vol. 2, no. 1, pp. 57-69, 2016.
doi:10.1109/TBDATA.2015.2511001
233 ms
(Ver 3.3 (11022016))