The Community for Technology Leaders
RSS Icon
Issue No.03 - March (2014 vol.25)
pp: 550-559
Jing Wu , University of Maryland, College Park
Joseph JaJa , University of Maryland, College Park
Elias Balaras , George Washington University, Washington, DC
A highly multithreaded FFT-based direct Poisson solver that makes effective use of the capabilities of the current NVIDIA graphics processing units (GPUs) is presented. Our algorithms carefully manage the multiple layers of the memory hierarchy of the GPUs such that almost all the global memory accesses are coalesced into 128-byte device memory transactions, and all computations are carried out directly on the registers. A new strategy to interleave the FFT computation along each dimension with other computations is used to minimize the total number of accesses to the 3D grid. We illustrate the performance of our algorithms on the NVIDIA Tesla and Fermi architectures for a wide range of grid sizes, up to the largest size that can fit on the device memory ($(512\times 512\times 512)$ on the Tesla C1060/C2050 and $(512\times 256\times 256)$ on the GeForce GTX 280/480). We achieve up to 140 GFLOPS and a bandwidth of 70 GB/s on the Tesla C1060, and up to 375 GFLOPS with a bandwidth of 120GB/s on the GTX 480. The performance of our algorithms is superior to what can be achieved using the CUDA FFT library in combination with well-known parallel algorithms for solving tridiagonal linear systems of equations.
Graphics processing units, Instruction sets, Kernel, Computer architecture, Vectors, Linear systems, Equations,elliptic equations, Fast-Fourier transforms, parallel and vector implementations
Jing Wu, Joseph JaJa, Elias Balaras, "An Optimized FFT-Based Direct Poisson Solver on CUDA GPUs", IEEE Transactions on Parallel & Distributed Systems, vol.25, no. 3, pp. 550-559, March 2014, doi:10.1109/TPDS.2013.53
[1] A.J. Chorin, "A Numerical Method for Solving Incompressible Viscous Flow Problems," J. Computational Physics, vol. 135, no. 2, pp. 118-125, Aug. 1997.
[2] J. Cooley and J. Tukey, "An Algorithm for the Machine Calculation of Complex Fourier Series," Math. of Computation, vol. 19, no. 90, pp. 297-301, 1965.
[3] A. Davidson, Y. Zhang, and J.D. Owens, "An Auto-Tuned Method for Solving Large Tridiagonal Systems on the GPU," Proc. IEEE Int'l Parallel and Distributed Processing Symp. (IPDPS '11), pp. 956-965, 2011.
[4] M. Frigo and S.G. Johnson, "The Design and Implementation of FFTW3," Proc. IEEE, vol. 93, no. 2, pp. 216-231, Feb. 2005.
[5] D. Goddeke and R. Strzodka, "Cyclic Reduction Tridiagonal Solvers on GPUs Applied to Mixed-Precision Multigrid," IEEE Trans. Parallel and Distributed Systems, vol. 22, no. 1, pp. 22-32, Jan. 2011.
[6] R.W. Hockney, "A Fast Direct Solution of Poisson's Equation Using Fourier Analysis," J. ACM, vol. 12, no. 1, pp. 95-113, Jan. 1965.
[7] M. Kass, A. Lefohn, and J.D. Owens, "Interactive Depth of Field Using Simulated Diffusion," Technical Report 06-01, Pixar Animation Studios, Jan. 2006.
[8] H.-S. Kim, S. Wu, L.-w. Chang, and W.-m.-W. Hwu, "A Scalable Tridiagonal Solver for GPUs," Proc. IEEE CS Int'l Conf. Parallel Processing (ICPP '11), pp. 444-453, 2011.
[9] A. Nukada, "Nukada FFT Library Website," /, 2011.
[10] NVIDIA Corporation, "NVIDIA CUDA C Programming Best Practices Guide," 2012.
[11] NVIDIA Corporation, "NVIDIA CUDA C Programming Guide," 2012.
[12] NVIDIA Corporation, "NVIDIA CUDA Cufft Library," 2012.
[13] S. Sengupta, M. Harris, Y. Zhang, and J.D. Owens, "Scan Primitives for GPU Computing," Proc. 22nd ACM SIGGRAPH/EUROGRAPHICS Symp. Graphics Hardware, pp. 97-106, 2007.
[14] J. Wu and J. JaJa, "Optimized Strategies for Mapping Three-Dimensional FFTs onto Cuda GPUs," Innovative Parallel Computing, IEEE Press, 2012.
[15] Y. Zhang, J. Cohen, and J.D. Owens, "Fast Tridiagonal Solvers on the GPU," Proc. 15th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, vol. 45, no. 5, pp. 127-136, Jan. 2010.
67 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool