Field-Programmable Custom Computing Machines, Annual IEEE Symposium on (2010)
Charlotte, North Carolina, USA
May 2, 2010 to May 4, 2010
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/FCCM.2010.25
To efficiently perform large matrix LU decomposition on FPGAs with limited local memory, the original algorithm needs to be blocked. In this paper, we propose a block LU decomposition algorithm for FPGAs, which is applicable for matrices of arbitrary size. We introduce a high performance hardware design, which mainly consists of a linear array of processing elements (PEs), to implement our block LU decomposition algorithm. A total of 36 PEs can be integrated into a Xilinx Virtex-5 xc5vlx330 FPGA on our self-designed PCI-Express card, reaching a sustained performance of 8.50 GFLOPS at 133 MHz, which outperforms previous work.
Yong Dou, Gregory D. Peterson, Guiming Wu, "Blocking LU Decomposition for FPGAs", Field-Programmable Custom Computing Machines, Annual IEEE Symposium on, vol. 00, no. , pp. 109-112, 2010, doi:10.1109/FCCM.2010.25