The Community for Technology Leaders
RSS Icon
Issue No.12 - Dec. (2013 vol.24)
pp: 2324-2333
Sang-Won Ha , Yonsei University, Seoul
Tack-Don Han , Yonsei University, Seoul
The parallel scan is a basic tool that is used to parallelize algorithms which appear to have serial dependencies. The performance of these algorithms relies heavily on the efficiency of the parallel scan that is being used. To maintain work efficiency, current parallelization methods either sacrifice the overall depth or limit the scalability. In this study, we present a parallel scan method that is derived from the Han-Carlson parallel prefix graph and is both a work-efficient and a depth-optimal process. In this method, the depth is increased by a small constant value above the lower bound; therefore, the amount of computation and/or memory access is effectively reduced. We also employ a novel cascaded thread-block execution method to exploit the single-program-multiple-data (SPMD) nature of the compute unified device architecture (CUDA) environment developed by NVIDIA. The proposed method facilitates the low-latency interthread accessible shared memory and the single-instruction-multiple-thread (SIMT) characteristics of the graphics hardware to reduce high-latency global memory access and costly barrier synchronization. Our experimental results demonstrate an average speed up of approximately 40 and 10 percent over the CUDA data parallel primitives (CUDPP) library derivation of the Kogge-Stone prefix tree and an implementation of Merrill and Grimshaw's method with coarser combination of the Kogge-Stone graph and the Brent-Kung prefix graph, respectively.
Graphics processing units, Instruction sets, Complexity theory, Algorithm design and analysis,GPGPU, Parallel scan, prefix sum, Han-Carlson adder, high-performance computing
Sang-Won Ha, Tack-Don Han, "A Scalable Work-Efficient and Depth-Optimal Parallel Scan for the GPGPU Environment", IEEE Transactions on Parallel & Distributed Systems, vol.24, no. 12, pp. 2324-2333, Dec. 2013, doi:10.1109/TPDS.2012.336
[1] G.E. Blelloch, "Scans as Primitive Parallel Operations," IEEE Trans. Computers, vol. 38, no. 11, pp. 1526-1538, Nov. 1989.
[2] G.E. Blelloch, "Prefix Sums and Their Applications," technical report, School of Computer Science, Carnegie Mellon Univ., Nov. 1990.
[3] G.E. Blelloch, Vector Models for Data-Parallel Computing. MIT Press, 1990.
[4] G.E. Blelloch, "A Comparison of Sorting Algorithms for the Connection Machine CM-2," Proc. Third Ann. ACM Symp. Parallel Algorithms and Architectures, pp. 3-16, 1991.
[5] R.P. Brent and H.T. Kung, "A Regular Layout for Parallel Adders," IEEE Trans. Computers, vol. C-31, no. 3, pp. 260-264, Mar. 1982.
[6] S. Chatterjee, G.E. Blelloch, and M. Zagha, "Scan Primities for Vector Computers," Proc. ACM/IEEE Conf. Supercomputing, pp. 666-675, 1990.
[7] F.C. Crow, "Summed-Area Tables for Texture Mapping," Proc. ACM SIGGRAPH, pp. 207-212, 1984.
[8] CUDPP: CUDA Data Parallel Primitives Library, http://code., 2013.
[9] Y. Dotsenko, N.K. Govindaraju, P. Sloan, C. Boyd, and J. Manferdelli, "Fast Scan Algorithms on Graphics Processors," Proc. 22nd Ann. Int'l Conf. Supercomputing, pp. 205-213, 2008.
[10] A.D. Falkoff and K.E. Iverson, "The APL Terminal System: Instructions for Operation," IBM Research, 1966.
[11] T. Han and D. Carlson, "Fast Area-Efficient VLSI Adders," Proc. Eighth Ann. Symp. Computer Arithmetic, pp. 49-56, 1987.
[12] M. Harris, S. Sengupta, and J.D. Owens, "Parallel Prefix Sum (Scan) with CUDA," GPU Gems 3, Chapter 31, H. Nguyen, ed., Addison-Wesley, Aug. 2007.
[13] M. Harris and M. Garland, "Optimizing Parallel Prefix Operations for the Fermi Architecture," GPU Computing Gems Jade Edition, Chapter 3, W.-M.W. Hwu, ed., Morgan Kaufmann, 2011.
[14] J. Hensley, T. Scheuermann, G. Coombe, M. Singh, and A. Lastra, "Fast Summed-Area Table Generation and Its Applications," Computer Graphics Forum, vol. 24, no. 3, pp. 547-555, 2005.
[15] D. Horn, "Stream Reduction Operation for GPGPU Applications," GPU Gems 2, M. Pharr, ed., pp. 573-589, Addison-Wesley, 2005.
[16] P.M. Kogge and H.S. Stone, "A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations," IEEE Trans. Computers, vol. C-22, no. 8, pp. 783-791, Aug. 1973.
[17] R.E. Ladner and M.J. Fischer, "Parallel Prefix Computation," J. ACM, vol. 27, no. 4, pp. 831-838, 1980.
[18] V.W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A.D. Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singal, and P. Dubey, "Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU," Proc. 37th Ann. Int'l Symp. Computer Architecture, pp. 451-460, 2010.
[19] D. Merrill Back 40 Computing,, 2013.
[20] D. Merrill and A. Grimshaw, "Parallel Scan for Stream Architectures," Technical Report CS2009-14, Dept. of Computer Science, Univ. of Virginia, Dec. 2009.
[21] NVIDIA CUDA C Programming Guide, Version 4.0, 2011.
[22] J.D. Owens, M. Houston, D. Luebke, and S. Green, "GPU Computing," Proc. IEEE, vol. 96, no. 5, pp. 879-899, 2008.
[23] J.H. Reif, "An Optimal Parallel Algorithm for Integer Sorting," Proc. 25th Ann. Symp. Foundations of Computer Science, pp. 496-504, 1985.
[24] J.T. Schwartz, "Ultracomputers," ACM Trans. Programming Languages and Systems, vol. 2, no. 4, pp. 484-521, 1980.
[25] S. Sengupta, A.E. Lefohn, and J.D. Owens, "A Work-Efficient Step-Efficient Prefix Sum Algorithm," Proc. Workshop Edge Computing Using New Commodity Architectures, pp. 1-2, 2006.
[26] S. Sengupta, M. Harris, and M. Garland, Efficient Parallel Scan Algorithms for GPUs, technical report, NVIDIA Corp., Dec. 2008.
[27] J. Sklansky, "Conditional-Sum Addition Logic," IRE Trans. Electronic Computers, vol. EC-9, no. 2, pp. 226-231, June 1960.
[28] G. Ziegler, A. Tevs, C. Thebalt, and H.P. Seidel, GPU Point List Generation through Histogram Phyramids, Technical Report of the MPI for Informatics MPI-I-2006-4-002, 2006.
91 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool