2010 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (2010)
Charlotte, North Carolina, USA
May 2, 2010 to May 4, 2010
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/FCCM.2010.25
To efficiently perform large matrix LU decomposition on FPGAs with limited local memory, the original algorithm needs to be blocked. In this paper, we propose a block LU decomposition algorithm for FPGAs, which is applicable for matrices of arbitrary size. We introduce a high performance hardware design, which mainly consists of a linear array of processing elements (PEs), to implement our block LU decomposition algorithm. A total of 36 PEs can be integrated into a Xilinx Virtex-5 xc5vlx330 FPGA on our self-designed PCI-Express card, reaching a sustained performance of 8.50 GFLOPS at 133 MHz, which outperforms previous work.
Y. Dou, G. D. Peterson and G. Wu, "Blocking LU Decomposition for FPGAs," 2010 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines(FCCM), Charlotte, North Carolina, USA, 2010, pp. 109-112.