This Article 
 Bibliographic References 
 Add to: 
Cyclic Reduction Tridiagonal Solvers on GPUs Applied to Mixed-Precision Multigrid
January 2011 (vol. 22 no. 1)
pp. 22-32
Dominik Göddeke, TU Dortmund, Dortmund
Robert Strzodka, Max Planck Institut Informatik, Saarbrücken
We have previously suggested mixed precision iterative solvers specifically tailored to the iterative solution of sparse linear equation systems as they typically arise in the finite element discretization of partial differential equations. These schemes have been evaluated for a number of hardware platforms, in particular, single-precision GPUs as accelerators to the general purpose CPU. This paper reevaluates the situation with new mixed precision solvers that run entirely on the GPU: We demonstrate that mixed precision schemes constitute a significant performance gain over native double precision. Moreover, we present a new implementation of cyclic reduction for the parallel solution of tridiagonal systems and employ this scheme as a line relaxation smoother in our GPU-based multigrid solver. With an alternating direction implicit variant of this advanced smoother, we can extend the applicability of the GPU multigrid solvers to very ill-conditioned systems arising from the discretization on anisotropic meshes, that previously had to be solved on the CPU. The resulting mixed-precision schemes are always faster than double precision alone, and outperform tuned CPU solvers consistently by almost an order of magnitude.

[1] NVIDIA Corporation, "NVIDIA CUDA Programming Guide Version 2.3," http://www.nvidia.comcuda, July 2009.
[2] J.D. Owens, M. Houston, D.P. Luebke, S. Green, J.E. Stone, and J.C. Phillips, "GPU Computing," Proc. IEEE, vol. 96, no. 5, pp. 879-899, May 2008.
[3] M. Garland, S.L. Grand, J. Nickolls, J.A. Anderson, J. Hardwick, S. Morton, E.H. Phillips, Y. Zhang, and V. Volkov, "Parallel Computing Experiences with CUDA," IEEE Micro, vol. 28, no. 4, pp. 13-27, July 2008.
[4] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, "NVIDIA Tesla: A Unified Graphics and Computing Architecture," IEEE Micro, vol. 28, no. 2, pp. 39-55, Mar./Apr. 2008.
[5] J. Nickolls, I. Buck, M. Garland, and K. Skadron, "Scalable Parallel Programming with CUDA," ACM Queue, vol. 6, no. 2, pp. 40-53, Mar./Apr. 2008.
[6] J. Bolz, I. Farmer, E. Grinspun, and P. Schröder, "Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid," ACM Trans. Graphics, vol. 22, no. 3, pp. 917-924, July 2003.
[7] N. Goodnight, C. Woolley, G. Lewin, D.P. Luebke, and G. Humphreys, "A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware," Proc. Conf. Graphics Hardware, M. Doggett, W. Heidrich, W.R. Mark, and A. Schilling, eds., pp. 102-111, July 2003.
[8] R. Strzodka, M. Droske, and M. Rumpf, "Image Registration by a Regularized Gradient Flow—a Streaming Implementation in DX9 Graphics Hardware," Computing, vol. 73, no. 4, pp. 373-389, Nov. 2004.
[9] D. Göddeke, R. Strzodka, and S. Turek, "Performance and Accuracy of Hardware-Oriented Native-, Emulated- and Mixed-Precision Solvers in FEM Simulations," Int'l J. Parallel, Emergent and Distributed Systems, vol. 22, no. 4, pp. 221-256, Jan. 2007.
[10] M. Kazhdan and H. Hoppe, "Streaming Multigrid for Gradient-Domain Operations on Large Images," ACM Trans. Graphics, vol. 27, no. 3, pp. 1-10, Aug. 2008.
[11] Z. Feng and P. Li, "Multigrid on GPU: Tackling Power Grid Analysis on Parallel SIMT Platforms," Proc. IEEE/ACM Int'l Conf. Computer-Aided Design (ICCAD '08), pp. 647-654, Nov. 2008.
[12] E. Elsen, P. LeGresley, and E. Darve, "Large Calculation of the Flow over a Hypersonic Vehicle Using a GPU," J. Computational Physics, vol. 227, no. 24, pp. 10148-10161, Dec. 2008.
[13] M. Kass, A.E. Lefohn, and J.D. Owens, "Interactive Depth of Field Using Simulated Diffusion," Technical Report 06-01, Pixar Animation Studios, Jan. 2006.
[14] S. Sengupta, M.J. Harris, Y. Zhang, and J.D. Owens, "Scan Primitives for GPU Computing," Proc. Conf. Graphics Hardware, T. Aila and M. Segal, eds., pp. 97-106, Aug. 2007.
[15] R.W. Hockney, "A Fast Direct Solution of Poisson's Equation Using Fourier Analysis," J. ACM, vol. 12, no. 1, pp. 95-113, Jan. 1965.
[16] R.W. Hockney and C.R. Jesshope, Parallel Computers. Adam Hilger, Nov. 1981.
[17] H.S. Stone, "An Efficient Parallel Algorithm for the Solution of a Tridiagonal Linear System of Equations," J. ACM, vol. 20, no. 1, pp. 27-38, Jan. 1973.
[18] Y. Zhang, J. Cohen, and J.D. Owens, "Fast Tridiagonal Solvers on the GPU," Proc. 15th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP '10), pp. 127-136, Jan. 2010.
[19] M. Grajewski, M. Köster, and S. Turek, "Mathematical and Numerical Analysis of a Robust and Efficient Grid Deformation Method in the Finite Element Context," SIAM J. Scientific Computing, vol. 31, no. 2, pp. 1539-1557, Nov. 2008.
[20] S. Turek, C. Becker, and S. Kilian, "Hardware-Oriented Numerics and Concepts for PDE Software," Future Generation Computer Systems, vol. 22, nos. 1/2, pp. 217-238, Feb. 2004.
[21] S. Turek, D. Göddeke, C. Becker, S.H. Buijssen, and H. Wobker, "FEAST—Realisation of Hardware-Oriented Numerics for HPC Simulations with Finite Elements," Concurrency and Computation: Practice and Expecience, special issue Proc. ISC 2008, Feb. 2010, doi:10.1002/cpe.1584.
[22] S. Turek, C. Becker, S. Kilian, S.H.M. Buijssen, D. Göddeke, and H. Wobker, "FEAST—Finite Element Analysis and Solution Tools," http:/, 2008.
[23] D. Göddeke, H. Wobker, R. Strzodka, J. Mohd-Yusof, P.S. McCormick, and S. Turek, "Co-Processor Acceleration of an Unmodified Parallel Solid Mechanics Code with FEASTGPU," Int'l J. Computational Science and Eng., vol. 4, no. 4, pp. 254-269, Oct. 2009.
[24] D. Göddeke, S.H. Buijssen, H. Wobker, and S. Turek, "GPU Acceleration of an Unmodified Parallel Finite Element Navier-Stokes Solver," Proc. IEEE Int'l Conf. High Performance Computing and Simulation (HPCS '09), pp. 12-21, June 2009.
[25] O. Axelsson and V.A. Barker, Finite Element Solution of Boundary Value Problems, vol. 35. SIAM, 2001.
[26] D.C. Pham, S. Asano, M. Bolliger, M.N. Day, H.P. Hofstee, C.R. Johns, J.A. Kahle, A. Kameyama, J. Keaty, Y. Masubuchi, M. Riley, D. Shippy, D.L. Stasiak, M. Suzuoki, M. Wang, J. Warnock, S. Weitzel, D. Wendel, T. Yamazaki, and K. Yazawa, "The Design and Implementation of a First-Generation CELL Processor," Proc. Int'l Solid-State Circuits Conf. (ISSCC '05), Digest of Technical Papers, vol. 1, pp. 184-592, Feb. 2005.
[27] NVIDIA Corporation, "Whitepaper: NVIDIA's Next Generation CUDA Compute Architecture: Fermi," , Sept. 2009.
[28] J.H. Wilkinson, Rounding Errors in Algebraic Processes. Prentice-Hall, 1963.
[29] R.S. Martin, G. Peters, and J.H. Wilkinson, "Iterative Refinement of the Solution of a Positive Definite System of Equations," Numerische Mathematik, vol. 8, no. 3, pp. 203-216, May 1966.
[30] H.J. Bowdler, R.S. Martin, G. Peters, and J.H. Wilkinson, "Solution of Real and Complex Systems of Linear Equations," Numerische Mathematik, vol. 8, no. 3, pp. 217-234, May 1966.
[31] C.B. Moler, "Iterative Refinement in Floating Point," J. ACM, vol. 14, no. 2, pp. 316-321, Apr. 1967.
[32] D.E. Knuth, The Art of Computer Programming, Volume 2: Seminumerical Algorithms, third ed. Addison-Wesley, 1997.
[33] L.H. Thomas, "Elliptic Problems in Linear Difference Equations over a Network," Watson Scientific Computing Laboratory Report, Columbia Univ., 1949.
[34] D.W. Peaceman and H.H. Rachford Jr, "The Numerical Solution of Parabolic and Elliptic Differential Equations," J. Soc. for Industrial and Applied Math., vol. 3, no. 1, pp. 28-41, Mar. 1955.

Index Terms:
GPU Computing, mixed-precision iterative refinement, multigrid, tridiagonal solvers, cyclic reduction, finite elements, NVIDIA CUDA.
Dominik Göddeke, Robert Strzodka, "Cyclic Reduction Tridiagonal Solvers on GPUs Applied to Mixed-Precision Multigrid," IEEE Transactions on Parallel and Distributed Systems, vol. 22, no. 1, pp. 22-32, Jan. 2011, doi:10.1109/TPDS.2010.61
Usage of this product signifies your acceptance of the Terms of Use.