Subscribe

Issue No.08 - August (2008 vol.57)

pp: 1057-1071

Ling Zhuo , University of Southern California, Los Angeles

Viktor K. Prasanna , University of Southern California, Los Angeles

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TC.2008.55

ABSTRACT

Numerical linear algebra operations are key primitives in scientific computing. Performance optimizations of such operations have been extensively investigated. With the rapid advances in technology, hardware acceleration of linear algebra applications using FPGAs (Field Programmable Gate Arrays) has become feasible. In this paper, we propose FPGA-based designs for several basic linear algebra operations, including dot product, matrix-vector multiplication, matrix multiplication and matrix factorization. By identifying the parameters for each operation, we analyze the trade-offs and propose a high-performance design. In the implementations of the designs, the values of the parameters are determined according to the hardware constraints, such as the available chip area, the size of available memory, the memory bandwidth, and the number of I/O pins. The proposed designs are implemented on Xilinx Virtex-II Pro FPGAs. Experimental results show that our designs scale with the available hardware resources. Also, the performance of our designs compares favorably with that of general-purpose processor based designs. We also show that with faster floating-point units and larger devices, the performance of our designs increases accordingly.

INDEX TERMS

Reconfigurable hardware, Computations on matrices, Parallel algorithms

CITATION

Ling Zhuo, Viktor K. Prasanna, "High-Performance Designs for Linear Algebra Operations on Reconfigurable Hardware",

*IEEE Transactions on Computers*, vol.57, no. 8, pp. 1057-1071, August 2008, doi:10.1109/TC.2008.55REFERENCES

- [1] Xilinx Incorporated, http:/www.xilinx.com, 2008.
- [2] O. Storaasli, R.C. Singleterry, and S. Brown, “Scientific Computations on a NASA Reconfigurable Hypercomputer,”
Proc. Fifth Ann. Int'l Conf. Military and Aerospace Programmable Logic Devices, Sept. 2002.- [3] K.D. Underwood and K.S. Hemmert, “Closing the Gap: CPU and FPGA Trends in Sustainable Floating-Point BLAS Performance,”
Proc. 12th Ann. IEEE Symp. Field-Programmable Custom Computing Machines, Apr. 2004.- [4] M. Smith, J. Vetter, and X. Liang, “Accelerating Scientific Applications with the SRC-6 Reconfigurable Computer: Methodologies and Analysis,”
Proc. 19th IEEE Int'l Parallel and Distributed Processing Symp., Apr. 2005.- [5] Z. Guo, W. Najjar, F. Vahid, and K. Vissers, “A Quantitative Analysis of the Speedup Factors of FPGAs over Processors,”
Proc. 12th ACM/SIGDA Int'l Symp. Field Programmable Gate Arrays, pp.162-170, Feb. 2004.- [6] V. Aggarwal, A. George, and K. Slatton, “Reconfigurable Computing with Multiscale Data Fusion for Remote Sensing,”
Proc. 14th ACM/SIGDA Int'l Symp. Field Programmable Gate Arrays, p. 235, Feb. 2006.- [7] S. Bajracharya, C. Shu, K. Gaj, and T. El-Ghazawi, “Implementation of Elliptic Curve Cryptosystems over ${\rm gf}(2^{\rm n})$ in Optimal Normal Basis on a Reconfigurable Computer,”
Proc. 12th ACM/SIGDA Int'l Symp. Field Programmable Gate Arrays, Feb. 2004.- [8] D.A. Buell and J.P. Davis, “Reconfigurable Computing Applied to Problems in Communications Security,”
Proc. Fifth Ann. Int'l Conf. Military and Aerospace Programmable Logic Devices, Sept. 2002.- [9] A. Koohi, N. Bagherzadeh, and C. Pan, “A Fast Parallel Reed-Solomon Decoder on a Reconfigurable Architecture,”
Proc. First IEEE/ACM/IFIP Int'l Conf. Hardware/Software Codesign and System Synthesis, Oct. 2003.- [10] Cray Inc., http:/www.cray.com/, 2008.
- [11] SRC Computers, Inc., http:/www.srccomp.com/, 2008.
- [12] Silicon Graphics, Inc., http:/www.sgi.com/, 2008.
- [13] D. Bader, B. Moret, and P. Sanders, “High-Performance Algorithm Engineering for Parallel Computation,”
Lecture Notes in Computer Science, vol. 2547, pp. 1-23, 2002.- [15] L. Zhuo and V.K. Prasanna, “Scalable and Modular Algorithms for Floating-Point Matrix Multiplication on FPGAs,”
Proc. 18th Int'l Parallel and Distributed Processing Symp., Apr. 2004.- [16] M. Smith, J. Vetter, and S. Alam, “Scientific Computing Beyond CPUs: FPGA Implementations of Common Scientific Kernels,”
Proc. Eighth Ann. Int'l Conf. Military and Aerospace Programmable Logic Devices, Sept. 2005.- [17] R. Barrett, M. Berry, T.F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, and H.V. der Vorst,
Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, second ed. SIAM, 1994.- [18] W.H. Press, B.P. Flannery, S.A. Teukolsky, and W.T. Vetterling,
Numerical Recipes in C: The Art of Scientific Computing. Cambridge Univ. Press, 1992.- [19]
IEEE 754 Standard for Binary Floating-Point Arithmetic, IEEE, 1984.- [21] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J.D. Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen, “LAPACK User's Guide Third Edition,” www.netlib.org/lapack/lawns/lawn147.pshttp:/ /www.netlib.org/lapack/luglapack_lug.html , Aug. 1999.
- [22] L.S. Blackford, J. Choi, A. Cleary, E. D'Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R.C. Whaley,
ScaLAPACK Users' Guide, SIAM, 1997.- [23] A. Chtchelkanova, J. Gunnels, G. Morrow, J. Overfelt, and R. van de Geijn, “Parallel Implementation of BLAS: General Techniques for Level 3 BLAS,”
Concurrency: Practice and Experience, vol. 9, no. 9, pp. 837-857, 1997.- [25] G. Govindu, R. Scrofano, and V.K. Prasanna, “A Library of Parameterizable Floating-Point Cores for FPGAs and Their Application to Scientific Computing,”
Proc. Int'l Conf. Eng. Reconfigurable Systems and Algorithms, June 2005.- [26] X. Wang, S. Braganza, and M. Leeser, “Advanced Components in the Variable Precision Floating-Point Library,”
Proc. 14th Ann. IEEE Symp. Field-Programmable Custom Computing Machines, Apr. 2006.- [29] J.W. Jang, S. Choi, and V.K. Prasanna, “Area and Time Efficient Implementation of Matrix Multiplication on FPGAs,”
Proc. First IEEE Int'l Conf. Field Programmable Technology, Dec. 2002.- [30] S. Choi and V.K. Prasanna, “Time and Energy Efficient Matrix Factorization Using FPGAs,”
Proc. 13th Int'l Conf. Field Programmable Logic and Applications, Sept. 2003.- [31] Y. Dou, S. Vassiliadis, G. Kuzmanov, and G. Gaydadjiev, “64-Bit Floating-Point FPGA Matrix Multiplication,”
Proc. 13th ACM/SIGDA Int'l Symp. Field Programmable Gate Arrays, Feb. 2005.- [32] L. Zhuo and V.K. Prasanna, “Sparse Matrix-Vector Multiplication on FPGAs,”
Proc. 13th ACM/SIGDA Int'l Symp. Field Programmable Gate Arrays, Feb. 2005.- [33] J. Sun, G. Peterson, and O. Storaasli, “Sparse Matrix-Vector Multiplication Design on FPGAs,”
Proc. 15th Ann. IEEE Symp. Field-Programmable Custom Computing Machines, Apr. 2007.- [34] M. deLorimier and A. DeHon, “Floating-Point Sparse Matrix-Vector Multiply for FPGAs,”
Proc. 13th ACM/SIGDA Int'l Symp. Field Programmable Gate Arrays, Feb. 2005.- [35] S. Akella, M. Smith, R. Mills, S. Alam, R. Barrett, and J. Vetter, “Sparse Matrix-Vector Multiplication Kernel on a Reconfigurable Computer,”
Proc. Workshop High Performance Embedded Computing, Sept. 2005.- [36] G. Govindu, S. Choi, V.K. Prasanna, V. Daga, S. Gangadharpalli, and V. Sridhar, “A High-Performance and Energy-Efficient Architecture for Floating-Point Based LU Decomposition on FPGAs,”
Proc. Int'l Conf. Eng. Reconfigurable Systems and Algorithms, June 2004.- [37] V. Daga, G. Govindu, S. Gangadharpalli, V. Sridhar, and V.K. Prasanna, “Efficient Floating-Point Based Block LU Decomposition on FPGAs,”
Proc. Int'l Conf. Eng. Reconfigurable Systems and Algorithms, June 2004.- [38] L. Zhuo and V.K. Prasanna, “Design Tradeoffs for BLAS Operations on Reconfigurable Hardware,”
Proc. 34th Int'l Conf. Parallel Processing, June 2005.- [39] L. Zhuo and V.K. Prasanna, “High-Performance and Area-Efficient Reduction Circuits on FPGAs,”
Proc. 17th Int'l Symp. Computer Architecture and High Performance Computing, Oct. 2005.- [40] J. Hong and H. Kung, “I/O Complexity: The Red Blue Pebble Game,”
Proc. 13th Ann. ACM Symp. Theory of Computing, pp. 326-333, May 1981.- [42] T.H. Cormen, C.E. Leiserson, R.L. Rivest, and C. Stein,
Introduction to Algorithms, second ed. The MIT Press, 2001.- [43] D. Womble, D. Greenberg, R. Riesen, and S. Wheat, “Out of Core, Out of Mind: Practical Parallel I/O,”
Proc. Scalable Parallel Libraries Conf., pp. 10-16, citeseer.ist.psu.eduwomble93out.html, 1993.- [44] Mentor Graphics Corp., http:/www.mentor.com/, 2008.
- [45] AMD Core Math Library, http://developer.amd.comacml.aspx, 2008.
- [46] S. Hunold and T. Rauber, “Automatic Tuning of PDGEMM Towards Optimal Performance,”
Proc. European Conf. Parallel Processing, Aug. 2005. |