Subscribe

Issue No.12 - December (2008 vol.57)

pp: 1661-1675

Ling Zhuo , University of Southern California, Los Angeles

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TC.2008.84

ABSTRACT

Recently, high-end reconfigurable computing systems have been built that employ Field Programmable Gate Arrays (FPGAs) as hardware accelerators for general-purpose processors. These systems not only provide new opportunities for high-performance computing, but also pose new challenges to application developers. In this paper, we build a design model for hybrid designs that utilize both the processors and the FPGAs. The model characterizes a reconfigurable computing system using various parameters. Based on the model, we propose a design methodology for hardware/software co-design. The methodology partitions workload between the processors and the FPGAs, maintains load balance in the system, and realizes scalability over multiple nodes. Designs are proposed for several computationally intensive applications: matrix multiplication, matrix factorization and the Floyd-Warshall algorithm for the all-pairs shortest-paths problem. To illustrate our ideas, the proposed hybrid designs are implemented on a Cray XD1. Experimental results show that our designs utilize both the processors and the FPGAs efficiently, and overlap most of the data transfer overheads and network communication costs with the computations. Our designs achieve up to 90% of the total performance of the nodes, and 90% of the performance predicted by the design model. In addition, our designs scale over a large number of nodes.

INDEX TERMS

Algorithms implemented in hardware, Gate arrays, Heterogeneous (hybrid) systems, Computations on matrices

CITATION

Ling Zhuo, "Scalable Hybrid Designs for Linear Algebra on Reconfigurable Computing Systems",

*IEEE Transactions on Computers*, vol.57, no. 12, pp. 1661-1675, December 2008, doi:10.1109/TC.2008.84REFERENCES

- [1] Xilinx, http:/www.xilinx.com, 2008.
- [2] K.D. Underwood and K.S. Hemmert, “Closing the Gap: CPU and FPGA Trends in Sustainable Floating-Point BLAS Performance,”
Proc. 12th Ann. IEEE Symp. Field-Programmable Custom Computing Machines (FCCM '04), Apr. 2004.- [4] N. Srivastava, J.L. Trahan, R. Vaidyanathan, and S. Rai, “Adaptive Image Filtering Using Run-Time Reconfiguration,”
Proc. 10th Reconfigurable Architectures Workshop (RAW '03), Apr. 2003.- [5] C.M. Wee, P.R. Sutton, and N.W. Bergmann, “An FPGA Network Architecture for Accelerating 3DES-CBC,”
Proc. 15th Int'l Conf. Field Programmable Logic and Applications (FPL '05), Aug. 2005.- [6] M. Babst,
Reconfigurable Computing Made Easy! presented at 2005 Reconfigurable Systems Summer Inst., July 2005.- [8] J. Resano, D. Mozos, D. Verkest, S. Vernalde, and F. Catthoor, “Run-Time Minimization of Reconfiguration Overhead in Dynamically Reconfigurable Systems,”
Proc. 13th Int'l Conf. Field Programmable Logic and Applications (FPL '03), Sept. 2003.- [9] SRC Computers, http:/www.srccomp.com/, 2008.
- [10] Cray, http:/www.cray.com/, 2008.
- [11] Silicon Graphics, http:/www.sgi.com/, 2008.
- [12] T.H. Cormen, C.E. Leiserson, R.L. Rivest, and C. Stein,
Introduction to Algorithms, second ed. The MIT Press, 2001.- [13] Message Passing Interface Forum, “MPI: A Message-Passing Interface Standard,” Technical Report UT-CS-94-230, http:// citeseer.ist.psu.edu519858.html, 1994.
- [14]
IEEE 754 Standard for Binary Floating-Point Arithmetic. IEEE, 1984.- [15] Mentor Graphics, http:/www.mentor.com/, 2008.
- [16] AMD Core Math Library, http://developer.amd.comacml.aspx, 2008.
- [17] L. Zhuo and V.K. Prasanna, “Scalable and Modular Algorithms for Floating-Point Matrix Multiplication on FPGAs,”
Proc. 18th Int'l Parallel and Distributed Processing Symp. (IPDPS'04), Apr. 2004.- [18] Y. Dou, S. Vassiliadis, G. Kuzmanov, and G. Gaydadjiev, “64-Bit Floating-Point FPGA Matrix Multiplication,”
Proc. 13th Int'l Symp. Field Programmable Gate Arrays (FPGA '05), Feb. 2005.- [19] L. Zhuo and V.K. Prasanna, “Sparse Matrix-Vector Multiplication on FPGAs,”
Proc. 13th ACM Int'l Symp. Field Programmable Gate Arrays (FPGA '05), Feb. 2005.- [20] M. deLorimier and A. DeHon, “Floating-Point Sparse Matrix-Vector Multiply for FPGAs,”
Proc. 13th ACM Int'l Symp. Field Programmable Gate Arrays (FPGA '05), Feb. 2005.- [21] U. Bondhugula, A. Devulapalli, J. Dinan, J. Fernando, P. Wyckoff, E. Stahlberg, and P. Sadayappan, “Hardware/Software Integration for All-Pairs Shortest-Paths on a Reconfigurable Supercomputer,”
Proc. 14th IEEE Symp. Field-Programmable Custom Computing Machines (FCCM '06), Apr. 2006.- [22] Y. El-Kurdi, W. Gross, and D. Giannacopoulos, “Sparse Matrix-Vector Multiplication for Finite Element Method Matrices on FPGAs,”
Proc. 14th IEEE Symp. Field-Programmable Custom Computing Machines (FCCM '06), Apr. 2006.- [24] R. Scrofano, M. Gokhale, F. Trouw, and V.K. Prasanna, “A Hardware/Software Approach to Molecular Dynamics on Reconfigurable Computers,”
Proc. 14th IEEE Symp. Field-Programmable Custom Computing Machines (FCCM '06), Apr. 2006.- [25] G. Morris, R. Anderson, and V. Prasanna, “A Hybrid Approach for Mapping Conjugate Gradient onto an FPGA-Augmented Reconfigurable Supercomputer,”
Proc. 14th IEEE Symp. Field-Programmable Custom Computing Machines (FCCM '06), Apr. 2006.- [26] L. Zhuo and V.K. Prasanna, “Scalable Hybrid Designs for Linear Algebra on Reconfigurable Computing Systems,”
Proc. 12th Int'l Conf. Parallel and Distributed Systems (ICPADS '06), July 2006.- [27] M. Baleani, F. Gennari, Y. Jiang, Y. Patel, R. Brayton, and A. Sangiovanni-Vincentelli, “Hw/Sw Partitioning and Code Generation of Embedded Control Applications on a Reconfigurable Architecture Platform,”
Proc. 10th Int'l Symp. Hardware/Software Codesign (CODES '02), May 2002.- [28] E. Anderson, J. Agron, W. Peck, J. Stevens, F. Baijot, E. Komp, R. Sass, and D. Andrews, “Enabling a Uniform Programming Model across the Software/Hardware Boundary,”
Proc. 14th IEEE Symp. Field-Programmable Custom Computing Machines (FCCM '06), Apr. 2006.- [29] DRC, The Coprocessor, http:/www.drccomputer.com/, 2008.
- [30] J. Choi, J.J. Dongarra, L.S. Ostrouchov, A.P. Petitet, D.W. Walker, and R.C. Whaley, “Design and Implementation of the ScaLAPACKLU, QR, and Cholesky Factorization Routines,”
Scientific Programming, vol. 5, no. 3, pp. 173-184, http://citeseer.ist.psu.edu/articlechoi96design.html , Fall 1996.- [31] G. Venkataraman, S. Sahni, and S. Mukhopadhyaya, “A Blocked All-Pairs Shortest-Paths Algorithm,”
J. Experimental Algorithmics, vol. 8, 2003.- [32] U. Bondhugula, A. Devulapalli, J. Fernando, P. Wyckoff, and P. Sadayappan, “Parallel FPGA-Based All-Pairs Shortest-Paths in a Directed Graph,”
Proc. 20th IEEE Int'l Parallel and Distributed Processing Symp. (IPDPS '06), Apr. 2006.- [33] Cray XD1 FPGA Development, Cray, 2005.
- [34] G. Govindu, R. Scrofano, and V.K. Prasanna, “A Library of Parameterizable Floating-Point Cores for FPGAs and Their Application to Scientific Computing,”
Proc. Int'l Conf. Eng. Reconfigurable Systems and Algorithms (ERSA '05), June 2005.- [35] L. Zhuo and V.K. Prasanna, “High Performance Linear Algebra Operations on Reconfigurable Systems,”
Proc. Supercomputing Conf. (SC '05), Nov. 2005.- [36] M. Penner and V. Prasanna, “Cache-Friendly Implementations of Transitive Closure,”
Proc. 10th Int'l Conf. Parallel Architectures and Compilation Techniques (PACT '01), Sept. 2001.- [37] L.S. Blackford, J. Choi, A. Cleary, E. D'Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R.C. Whaley,
ScaLAPACK Users' Guide. SIAM, 1997.- [38] J. Choi, J.J. Dongarra, and D.W. Walker, “PUMMA: Parallel Universal Matrix Multiplication Algorithms,”
Concurrency: Practice and Experience, vol. 6, no. 7, pp. 543-570, Oct. 1994. |