The Community for Technology Leaders
2018 IEEE International Conference on Cluster Computing (CLUSTER) (2018)
Belfast, United Kingdom
Sep 10, 2018 to Sep 13, 2018
ISSN: 2168-9253
ISBN: 978-1-5386-8319-4
pp: 482-493
Two-dimensional Summed Area Tables (SAT) is a fundamental primitive used in image processing and machine learning applications. We present a collection of optimization methods for computing SAT on CUDA-enabled GPUs. Conventional approaches rely on computing the prefix sum in one dimension in parallel, transposing the matrix, then computing the prefix sum for the other dimension in parallel. Additionally, conventional methods use the scratchpad memory as cache. We propose a collection of algorithms that are scalable with respect to problem size. We use the register cache technique instead of the scratchpad memory and also employ a naive serial scan on the thread level for computing the prefix sum for one of the dimensions. Using a novel transpose-in-registers method we increase the inter-thread parallelism and outperform conventional SAT implementations. In addition, we significantly reduce both the communication between threads and the number of arithmetic instructions. On an Nvidia Pascal P100 GPU and Volta V100, our evaluations demonstrate that our implementations outperform state of the art libraries and yield up to 2.3x and 3.2x speedup over OpenCV and Nvidia NPP libraries, respectively.
cache storage, coprocessors, data structures, graphics processing units, multi-threading, optimisation, parallel architectures, storage management

P. Chen, M. Wahib, S. Takizawa, R. Takano and S. Matsuoka, "Efficient Algorithms for the Summed Area Tables Primitive on GPUs," 2018 IEEE International Conference on Cluster Computing (CLUSTER), Belfast, United Kingdom, 2018, pp. 482-493.
91 ms
(Ver 3.3 (11022016))