Subscribe

Issue No.04 - April (2010 vol.21)

pp: 417-423

Hatem Ltaief , University of Tennessee, Knoxville

Jakub Kurzak , University of Tennessee, Knoxville

Jack Dongarra , University of Tennessee, Knoxville

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TPDS.2009.79

ABSTRACT

The objective of this paper is to extend, in the context of multicore architectures, the concepts of tile algorithms [Buttari et al., 2007] for Cholesky, LU, and QR factorizations to the family of two-sided factorizations. In particular, the bidiagonal reduction of a general, dense matrix is very often used as a preprocessing step for calculating the Singular Value Decomposition. Furthermore, in the Top500 list of June 2008, 98 percent of the fastest parallel systems in the world were based on multicores. This confronts the scientific software community with both a daunting challenge and a unique opportunity. The challenge arises from the disturbing mismatch between the design of systems based on this new chip architecture—hundreds of thousands of nodes, a million or more cores, reduced bandwidth and memory available to cores—and the components of the traditional software stack, such as numerical libraries, on which scientific applications have relied for their accuracy and performance. The many-core trend has even more exacerbated the problem, and it becomes critical to efficiently integrate existing or new numerical linear algebra algorithms suitable for such hardware. By exploiting the concept of tile algorithms in the multicore environment (i.e., high level of parallelism with fine granularity and high-performance data representation combined with a dynamic data-driven execution), the band bidiagonal reduction presented here achieves 94 Gflop/s on a 12,000\times 12,000 matrix with 16 Intel Tigerton 2.4 GHz processors. The main drawback of the tile algorithms approach for the bidiagonal reduction is that the full reduction cannot be obtained in one stage. Other methods have to be considered to further reduce the band matrix to the required form.

INDEX TERMS

Bidiagonal reduction, singular value decomposition, tile algorithms, multicores.

CITATION

Hatem Ltaief, Jakub Kurzak, Jack Dongarra, "Parallel Two-Sided Matrix Reduction to Band Bidiagonal Form on Multicore Architectures",

*IEEE Transactions on Parallel & Distributed Systems*, vol.21, no. 4, pp. 417-423, April 2010, doi:10.1109/TPDS.2009.79REFERENCES

- [1] http:/www.top500.org, 2009.
- [3] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J.D. Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen,
LAPACK Users' Guide, third ed. Soc. Industrial and Applied Math., 1999.- [4] J.L. Barlow, N. Bosner, and Z. Drmač, "A New Stable Bidiagonal Reduction Algorithm,"
Linear Algebra and Its Applications, vol. 397, no. 1, pp. 35-84, Mar. 2005.- [5] M.W. Berry, J.J. Dongarra, and Y. Kim, "LAPACK Working Note 68: A Highly Parallel Algorithm for the Reduction of a Nonsymmetric Matrix to Block Upper-Hessenberg Form," Technical Report UT-CS-94-221, Dept. of Computer Science, Univ. of Tennessee, Feb. 1994.
- [6] N. Bosner and J.L. Barlow, "Block and Parallel Versions of One-Sided Bidiagonalization,"
SIAM J. Matrix Analysis and Applications, vol. 29, no. 3, pp. 927-953, 2007.- [7] A. Buttari, J. Langou, J. Kurzak, and J. Dongarra, "Parallel Tiled QR Factorization for Multicore Architectures,"
Concurrency and Computation, vol. 20, no. 13, pp. 1573-1590, 2008.- [8] T.F. Chan, "An Improved Algorithm for Computing the Singular Value Decomposition,"
ACM Trans. Math. Software, vol. 8, no. 1, pp. 72-83, Mar. 1982.- [9] J. Choi, J. Demmel, I. Dhillon, J. Dongarra, S. Ostrouchov, A. Petitet, K. Stanley, D. Walker, and R.C. Whaley, "ScaLAPACK, a Portable Linear Algebra Library for Distributed Memory Computers-Design Issues and Performance,"
Computer Physics Comm., vol. 97, nos. 1/2, pp. 1-15, 1996.- [10] D.M. Christopher, K. Eugenia, and M. Takemasa, "Estimating and Correcting Global Weather Model Error,"
Monthly Weather Rev., vol. 135, no. 2, pp. 281-299, 2007.- [11] E. Elmroth and F.G. Gustavson, "New Serial and Parallel Recursive QR Factorization Algorithms for SMP Systems,"
Proc. Fourth Int'l Workshop Applied Parallel Computing, Large Scale Scientific and Industrial Problems (PARA '98), pp. 120-128, June 1998.- [12] E. Elmroth and F.G. Gustavson, "Applying Recursion to Serial and Parallel QR Factorization Leads to Better Performance,"
IBM J. Research and Development, vol. 44, no. 4, pp. 605-624, 2000.- [13] E. Elmroth and F.G. Gustavson, "High-Performance Library Software for QR Factorization,"
Proc. Fifth Int'l Workshop, Applied Parallel Computing, New Paradigms for HPC in Industry and Academia (PARA '00), pp. 53-63. June 2000, http://dx.doi.org/10.10073-540-70734-4_9 . - [14] G.H. Golub and C.F. Van Loan,
Matrix Computation, John Hopkins Studies in the Math. Sciences, third ed. Johns Hopkins Univ. Press, 1996.- [15] G.H. Golub and W. Kahan, "Calculating the Singular Values and the Pseudo Inverse of a Matrix,"
SIAM J. Numerical Analysis, vol. 2, pp. 205-224, 1965.- [16] B. Grosser and B. Lang, "Efficient Parallel Reduction to Bidiagonal Form,"
Parallel Computing, vol. 25, no. 8, pp. 969-986, 1999.- [17] B.C. Gunter and R.A. van de Geijn, "Parallel Out-of-Core Computation and Updating of the QR Factorization,"
ACM Trans. Math. Software, vol. 31, no. 1, pp. 60-78, Mar. 2005.- [18] J. Kurzak, A. Buttari, and J.J. Dongarra, "Solving Systems of Linear Equation on the CELL Processor Using Cholesky Factorization,"
IEEE Trans. Parallel and Distributed Systems, vol. 19, no. 9, pp. 1175-1186, Sept. 2008.- [19] J. Kurzak, A. Buttari, and J.J. Dongarra, "Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization,"
IEEE Trans. Parallel and Distributed Systems, vol. 19, no. 9, pp. 1-11, Sept. 2008.- [20] J. Kurzak and J.J. Dongarra, "QR Factorization for the CELL Processor,"
J. Scientific Programming, special issue on high performance computing on CELL B.E. processors, pp. 1-12, 2008.- [21] B. Lang, "Parallel Reduction of Banded Matrices to Bidiagonal Form,"
Parallel Computing, vol. 22, no. 1, pp. 1-18, 1996.- [22] H. Ltaief, J. Kurzak, and J. Dongarra, "LAPACK Working Note 208: Parallel Block Hessenberg Reduction Using Algorithms-by-Tiles for Multicore Architectures Revisited," Technical Report UT-CS-08-624, Dept. of Computer Science, Univ. of Tennessee, Aug. 2008.
- [23] E.S. Quintana-Ortí and R.A. van de Geijn, "Updating an LU Factorization with Pivoting,"
ACM Trans. Math. Software, vol. 35, no. 2, July 2008.- [24] G. Quintana-Ortí, E.S. Quintana-Ortí, E. Chan, R.A. van de Geijn, and F.G. Van Zee, "Scheduling of QR Factorization Algorithms on SMP and Multi-Core Architectures,"
Proc. Int'l Conf. Parallel, Distributed and Network-Based Processing (PDP), pp. 301-310, 2008.- [25] R. Ralha, "One-Sided Reduction to Bidiagonal Form,"
Linear Algebra and Its Applications, vol. 358, pp. 219-238, Jan. 2003.- [26] R. Schreiber and C. Van Loan, "A Storage Efficient WY Representation for Products of Householder Transformations,"
SIAM J. Scientific and Statistical Computing, vol. 10, pp. 53-57, 1989.- [27] G.W. Stewart,
Matrix Algorithms Volume I: Matrix Decompositions. SIAM, 1998.- [28] L.N. Trefethen and D. Bau,
Numerical Linear Algebra. SIAM, 1997.- [29] E.L. Yip, "Fortran Subroutines for Out-of-Core Solutions of Large Complex Linear Systems," Technical Report CR-159142, NASA, Nov. 1979.
- [30] K. Yotov, T. Roeder, K. Pingali, J. Gunnels, and F. Gustavson, "An Experimental Comparison of Cache-Oblivious and Cache-Conscious Programs,"
Proc. 19th Ann. ACM Symp. Parallel Algorithms and Architectures (SPAA '07), pp. 93-104, 2007. |