The Community for Technology Leaders
2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT) (2010)
Vienna, Austria
Sept. 11, 2010 to Sept. 15, 2010
ISBN: 978-1-5090-5032-1
pp: 545-546
Duane G. Merrill , Department of Computer Science, University of Virginia, USA
Andrew S. Grimshaw , Department of Computer Science, University of Virginia, USA
ABSTRACT
This poster presents efficient strategies for sorting large sequences of fixed-length keys (and values) using GPGPU stream processors. Compared to the state-of-the-art, our radix sorting methods exhibit speedup of at least 2x for all generations of NVIDIA GPGPUs, and up to 3.7x for current GT200-based models. Our implementations demonstrate sorting rates of 482 million key-value pairs per second, and 550 million keys per second (32-bit). For this domain of sorting problems, we believe our sorting primitive to be the fastest available for any fully-programmable microarchitecture. These results motivate a different breed of parallel primitives for GPGPU stream architectures that can better exploit the memory and computational resources while maintaining the flexibility of a reusable component. Our sorting performance is derived from a parallel scan stream primitive that has been generalized in two ways: (1) with local interfaces for producer/consumer operations (visiting logic), and (2) with interfaces for performing multiple related, concurrent prefix scans (multi-scan).
INDEX TERMS
kernel fusion, GPU, sorting, radix sorting, prefix scan
CITATION
Duane G. Merrill, Andrew S. Grimshaw, "Revisiting sorting for GPGPU stream architectures", 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT), vol. 00, no. , pp. 545-546, 2010, doi:
160 ms
(Ver 3.3 (11022016))