This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization
September 2008 (vol. 19 no. 9)
pp. 1175-1186
The Sony/Toshiba/IBM (STI) CELL processor introduces pioneering solutions in processor architecture. At the same time it presents new challenges for the development of numerical algorithms. One is effective exploitation of the differential between the speed of single and double precision arithmetic; the other is efficient parallelization between the short vector SIMD cores. The first challenge is addressed by utilizing the well known technique of iterative refinement for the solution of a dense symmetric positive definite system of linear equations, resulting in a mixed-precision algorithm, which delivers double precision accuracy, while performing the bulk of the work in single precision. The main contribution of this paper lies in addressing the second challenge by successful thread-level parallelization, exploiting fine-grained task granularity and a lightweight decentralized synchronization. The implementation of the computationally intensive sections gets within 90 percent of peak floating point performance, while the implementation of the memory intensive sections reaches within 90 percent of peak memory bandwidth. On a single CELL processor, the algorithm achieves over 170~Gflop/s when solving a symmetric positive definite system of linear equation in single precision and over 150~Gflop/s when delivering the result in double precision accuracy.

[1] H.P. Hofstee, “Power Efficient Processor Architecture and the CellProcessor,” Proc. 11th Int'l Symp. High-Performance Computer Architecture (HPCA), 2005.
[2] J.A. Kahle, M.N. Day, H.P. Hofstee, C.R. Johns, T.R. Maeurer, and D. Shippy, “Introduction to the Cell Multiprocessor,” IBM J. Research and Development, vol. 49, no. 4/5, pp. 589-604, 2005.
[3] Cell Broadband Engine Architecture, Version 1.0. IBM, Aug. 2005.
[4] J.H. Wilkinson, Rounding Errors in Algebraic Processes. Prentice Hall, 1963.
[5] C.B. Moler, “Iterative Refinement in Floating Point,” J. ACM, vol. 14, no. 2, pp. 316-321, 1967.
[6] G.W. Stewart, Introduction to Matrix Computations. Academic Press, 1973.
[7] N.J. Higham, Accuracy and Stability of Numerical Algorithms. SIAM, 1996.
[8] J. Langou, J. Langou, P. Luszczek, J. Kurzak, A. Buttari, and J.J. Dongarraa, “Exploiting the Performance of 32 Bit Floating Point Arithmetic in Obtaining 64 Bit Accuracy,” Proc. ACM/IEEE Conf. Supercomputing, 2006.
[9] R.C. Agarwal and F.G. Gustavson, “A Parallel Implementation of Matrix Multiplication and LU Factorization on the IBM 3090,” Proc. IFIP WG 2.5 Working Conf. Aspects of Computation on Asynchronous Parallel Processors, M.H. Wright, ed., pp. 217-221, 1988.
[10] R.C. Agarwal and F.G. Gustavson, “Vector and Parallel Algorithm for Cholesky Factorization on IBM 3090,” Proc. ACM/IEEE Conf. Supercomputing, 1989.
[11] Cell Broadband Engine Programming Handbook, Version 1.0. IBM, Apr. 2006.
[12] Cell Broadband Engine Programming Tutorial, Version 2.0. IBM, Dec. 2006.
[13] A. Buttari, P. Luszczek, J. Kurzak, J.J. Dongarra, and G. Bosilca, “A Rough Guide to Scientific Computing on the PlayStation 3, Version 1.0,” Technical Report UT-CS-07-595, Computer Science Dept., Univ. of Tennessee, http://www.cs.utk.edu/library/TechReports/ 2007ut-cs-07-595.pdf, 2007.
[14] J.J. Dongarra, I.S. Duff, D.C. Sorensen, and H.A. van der Vorst, Numerical Linear Algebra for High-Performance Computers. SIAM, 1998.
[15] J. Kurzak and J.J. Dongarra, “Implementing Linear Algebra Routines on Multi-Core Processors with Pipelining and a Look-Ahead,” Proc. Workshop State-of-the-Art in Scientific and Parallel Computing (PARA), 2006.
[16] J. Kurzak and J.J. Dongarra, “Implementation of Mixed Precision in Solving Systems of Linear Equations on the CELL Processor,” Concurrency Computation: Practice & Experience, vol. 19, no. 10, pp.1371-1385, July 2007, DOI: 10.1002/cpe.1164.
[17] T. Chen, R. Raghavan, J. Dale, and E. Iwata, Cell Broadband Engine Architecture and Its First Implementation, a Performance View, http://www-128.ibm.com/developerworks/power/ librarypa-cellperf/, Nov. 2005.
[18] B.S. Andersen, J.A. Gunnels, F.G. Gustavson, J.K. Reid, and J. Waśniewski, “A Fully Portable High Performance Minimal Storage Hybrid Format Cholesky Algorithm,” ACM Trans. Math. Software, vol. 31, no. 2, pp. 201-227, 2005.

Index Terms:
Parallel algorithms, Numerical Linear Algebra, Linear systems
Citation:
Jakub Kurzak, Alfredo Buttari, Jack Dongarra, "Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization," IEEE Transactions on Parallel and Distributed Systems, vol. 19, no. 9, pp. 1175-1186, Sept. 2008, doi:10.1109/TPDS.2007.70813
Usage of this product signifies your acceptance of the Terms of Use.