This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
FPGA-Based High-Performance and Scalable Block LU Decomposition Architecture
January 2012 (vol. 61 no. 1)
pp. 60-72
Manish Kumar Jaiswal, Indian Institute of Technology, Madras
Nitin Chandrachoodan, Indian Institute of Technology, Madras
Decomposition of a matrix into lower and upper triangular matrices (LU decomposition) is a vital part of many scientific and engineering applications, and the block LU decomposition algorithm is an approach well suited to parallel hardware implementation. This paper presents an approach to speed up implementation of the block LU decomposition algorithm using FPGA hardware. Unlike most previous approaches reported in the literature, the approach does not assume the matrix can be stored entirely on chip. The memory accesses are studied for various FPGA configurations, and a schedule of operations for scaling well is shown. The design has been synthesized for FPGA targets and can be easily retargeted. The design outperforms previous hardware implementations, as well as tuned software implementations including the ATLAS and MKL libraries on workstations.

[1] A. Edelman, "Large Dense Numerical Linear Algebra in 1993: The Parallel Computing Influence," Int'l J. Supercomputer Applications, vol. 7, pp. 113-128, 1993.
[2] J.J. Dongarra and D.W. Walker, "Software Libraries for Linear Algebra Computations on High Performance Computers," SIAM Rev., vol. 37, pp. 151-180, 1995.
[3] B.A. Hendrickson and D.E. Womble, "The Torus-Wrap Mapping for Dense Matrix Calculations on Massively Parallel Computers," SIAM J. Scientific Computing, vol. 15, no. 5, pp. 1201-1226, 1994.
[4] R. Harrington, "Origin and Development of the Method of Moments for Field Computation," IEEE Antennas and Propagation Magazine, vol. 32, no. 3, pp. 31-35, June 1990.
[5] J.L. Hess, "Panel Methods in Computational Fluid Dynamics," Ann. Rev. of Fluid Mechanics, vol. 22, pp. 225-274, Jan. 1990.
[6] L. Zhuo and V.K. Prasanna, "High-Performance and Parameterized Matrix Factorization on FPGAs," Proc. Int'l Conf. Field Programmable Logic and Applications (FPL '06), pp. 1-6, Aug. 2006.
[7] J.W. Demmel, N.J. Higham, and R.S. Schreiber, "Stability of Block LU Factorization," Numerical Linear Algebra with Applications, vol. 2, no. 2, pp. 173-190, 1995.
[8] J.W. Demmel and N.J. Higham, "Stability of Block Algorithms with Fast Level-3 BLAS," ACM Trans. Math. Software, vol. 18, no. 3, pp. 274-291, Sept. 1992.
[9] M.K. Jaiswal and N. Chandrachoodan, "A High Performance Implementation of LU Decomposition on FPGA," Proc. 13th VLSI Design and Test Symp. (VDAT '09), pp. 124-134, July 2009.
[10] "Automatically Tuned Linear Algebra Software (ATLAS)," http://www.netlib.orgatlas/, 2011.
[11] H.T. Kung and J. Subhlok, "A New Approach for Automatic Parallelization of Blocked Linear Algebra Computations," Supercomputing '91: Proc. ACM/IEEE Conf. Supercomputing, pp. 122-129, 1991.
[12] G. von Laszewski, M. Parashar, A.G. Mohamed, and G.C. Fox, "On the Parallelization of Blocked LU Factorization Algorithms on Distributed Memory Architectures," Supercomputing '92: Proc. ACM/IEEE Conf. Supercomputing, pp. 170-179, 1992.
[13] Y. Zhang, T. Tang, G. Li, and X. Yang, "Implementation and Optimization of Dense LU Decomposition on the Stream Processor," Parallel Processing and Applied Mathematics, R. Wyrzykowski, J. Dongarra, K. Karczewski, and J. Wasniewski, eds., pp. 78-88, Springer, 2008.
[14] A. Sudarsanam, S. Young, A. Dasu, and T. Hauser, "Multi-FPGA Based High Performance LU Decomposition," Proc. 10th High Performance Embedded Computing (HPEC) Workshop, Sept. 2006.
[15] S. Choi and V.K. Prasanna, "Time and Energy Efficient Matrix Factorization Using FPGA," Proc. Int'l Conf. Field-Programmable Logic and Applications (FPL '03), vol. 2278, pp. 507-519, Sept. 2003.
[16] G. Govindu, S. Choi, and V.K. Prasanna, "Efficient Floating-Point Based Block LU Decomposition on FPGAs," Proc. 11th Reconfigurable Architectures Workshop, Apr. 2004.
[17] G. Govindu, S. Choi, V. Prasanna, V. Daga, S. Gangadharpalli, and V. Sridhar, "A High-Performance and Energy-Efficient Architecture for Floating-Point Based LU Decomposition on FPGAs," Proc. 18th Int'l Parallel and Distributed Processing Symp., p. 149, Apr. 2004.
[18] L. Zhuo and V.K. Prasanna, "High-Performance Designs for Linear Algebra Operations on Reconfigurable Hardware," IEEE Trans. Computers, vol. 57, no. 8, pp. 1057-1071, Aug. 2008.
[19] W. Zhang, V. Betz, and J. Rose, "Portable and Scalable FPGA-Based Acceleration of a Direct Linear System Solver," Proc. Int'l Conf. Field-Programmable Technology (FPT '08), pp. 17-24, Dec. 2008.
[20] "SRC Supercomputers," http:/www.srccomp.com/, 2008.
[21] "SGI Supercomputers," http:/www.sgi.com/, 2011.
[22] "Cray XD1 Supercomputers," http:/www.cray.com/, 2008.
[23] N. Galoppo, N. Govindaraju, M. Henson, and D. Manocha, "LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware," Proc. ACM/IEEE Conf. Supercomputing (SC), p. 3, Nov. 2005.
[24] V. Volkov and J.W. Demmel, "Benchmarking GPUs to Tune Dense Linear Algebra," SC '08: Proc. ACM/IEEE Conf. Supercomputing, pp. 1-11, 2008.
[25] F. Ino, M. Matsui, K. Goda, and K. Hagihara, "Performance Study of LU Decomposition on the Programmable GPU," Proc. Int'l Conf. High Performance Computing (HiPC), vol. 3769, pp. 83-94, 2005.
[26] S. Tomov, R. Nath, H. Ltaief, and J. Dongarra, "Dense Linear Algebra Solvers for Multicore with GPU Accelerators," Proc. Int'l Workshop High-Level Parallel Programming Models and Supportive Environments (HIPS '10), Jan. 2010.
[27] M.K. Jaiswal and N. Chandrachoodan, "Efficient Implementation of Floating-Point Reciprocator on FPGA," Proc. 22nd Int'l Conf. VLSI Design (VLSID '09). pp. 267-271, 2009.
[28] M.K. Jaiswal and N. Chandrachoodan, "Efficient Implementation of IEEE Double Precision Floating-Point Multiplier on FPGA," Proc. IEEE Region 10 and the Third Int'l Conf. Industrial and Information Systems (ICIIS '08), pp. 1-4, Dec. 2008.
[29] L. Gopalakrishnan, "QDR II SRAM Interface for Virtex-5 Devices," Xilinx Application Note (XAPP853), http://www.xilinx.com/support/documentation/ application_notesxapp853.pdf, Oct. 2008.
[30] J. Sun, G. Peterson, and O. Storaasli, "High-Performance Mixed-Precision Linear Solver for FPGAs," IEEE Trans. Computers, vol. 57, no. 12, pp. 1614-1623, Dec. 2008.
[31] "AMD Core Math Library (ACML)," http://developer.amd.com/cpu/Libraries/acml/ Pagesdefault.aspx, 2011.
[32] Intel Corporation "Intel Math Kernel Library (Intel MKL) 10.2 In-Depth," http://software.intel.com/sites/products/ collateral/hpc/mklmkl_indepth.pdf, 2009.
[33] E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, H. Ltaief, P. Luszczek, and S. Tomov, "Numerical Linear Algebra on Emerging Architectures: The PLASMA and MAGMA Projects," J. Physics: Conference Series, vol. 180, 2009.
[34] J. Dongarra, "LINPACK Benchmarking and beyond," http://www.netlib.org/utk/people/JackDongarra/ SLIDESdod-0610. pdf, June 2010.
[35] J. Humphrey, "CULA 2.2 Sneak Preview," http://www. culatools.com/blog/2010/09/10 cula-2-2-sneak-preview/, 2010.

Index Terms:
LU decomposition, block LU, FPGA, hardware acceleration, floating point arithmetics, single/double precision, scaling, ATLAS, Intel-MKL, GPU.
Citation:
Manish Kumar Jaiswal, Nitin Chandrachoodan, "FPGA-Based High-Performance and Scalable Block LU Decomposition Architecture," IEEE Transactions on Computers, vol. 61, no. 1, pp. 60-72, Jan. 2012, doi:10.1109/TC.2011.24
Usage of this product signifies your acceptance of the Terms of Use.