Subscribe

Issue No.02 - February (2012 vol.23)

pp: 202-210

Yi-Gang Tai , University of Texas at San Antonio, San Antonio

Chia-Tien Dan Lo , Southern Polytechnic State University, Marietta

Kleanthis Psarris , University of Texas at San Antonio, San Antonio

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TPDS.2011.141

ABSTRACT

Many scientific or engineering applications involve matrix operations, in which reduction of vectors is a common operation. If the core operator of the reduction is deeply pipelined, which is usually the case, dependencies between the input data elements cause data hazards. To tackle this problem, we propose a new reduction method with low latency and high pipeline utilization. The performance of the proposed design is evaluated for both single data set and multiple data set scenarios. Further, QR decomposition is used to demonstrate how the proposed method can accelerate its execution. We implement the design on an FPGA and compare its results to other methods.

INDEX TERMS

Reconfigurable hardware, pipeline processors, parallel algorithms, parallel and vector implementations, algorithm design and analysis.

CITATION

Yi-Gang Tai, Chia-Tien Dan Lo, Kleanthis Psarris, "Accelerating Matrix Operations with Improved Deeply Pipelined Vector Reduction",

*IEEE Transactions on Parallel & Distributed Systems*, vol.23, no. 2, pp. 202-210, February 2012, doi:10.1109/TPDS.2011.141REFERENCES

- [1]
Xilinx Floating-Point Operator v3.0, Xilinx, Inc., http://www. xilinx.com/support/documentation/ ip_documentationfloating_ point_ds335.pdf , Sept. 2006.- [2] Y.-G. Tai, C.-T. D. Lo, and K. Psarris, "An Improved Reduction Algorithm with Deeply Pipelined Operators,"
Proc. IEEE Int'l Conf. Systems, Man and Cybernetics (SMC '09), pp. 3060-3065, Oct. 2009.- [3] Y.-G. Tai, C.-T. D. Lo, and K. Psarris, "Multiple Data Set Reduction on FPGAs,"
Proc. Int'l Conf. Field-Programmable Technology (FPT '10), Dec. 2010.- [4] P.M. Kogge,
The Architecture of Pipelined Computers. McGraw-Hill, 1981.- [5] L.M. Ni and K. Hwang, "Vector-Reduction Techniques for Arithmetic Pipelines,"
IEEE Trans. Computer, vol. C-34, no. 5, pp. 404-411, May 1985.- [6] H. Sips and H. Lin, "An Improved Vector-Reduction Method,"
IEEE Trans. Computer, vol. 40, no. 2, pp. 214-217, Feb. 1991.- [7] G.R. Morris, V.K. Prasanna, and R.D. Anderson, "A Hybrid Approach for Mapping Conjugate Gradient onto an FPGA-Augmented Reconfigurable Supercomputer,"
Proc. 14th Ann. IEEE Symp. Field-Programmable Custom Computing Machines (FCCM '06), pp. 3-12, 2006.- [8] G.R. Morris, V.K. Prasanna, and R.D. Anderson, "An FPGA-Based Application-Specific Processor for Efficient Reduction of Multiple Variable-Length Floating-Point Data Sets,"
Proc. 17th IEEE Int'l Conf. Application-Specific Systems, Architectures and Processors (ASAP '06), pp. 323-330, 2006.- [9] L. Zhuo, G.R. Morris, and V.K. Prasanna, "Designing Scalable FPGA-Based Reduction Circuits Using Pipelined Floating-Point Cores,"
Proc. 19th IEEE Int'l Parallel and Distributed Processing Symp. (IPDPS '05) p. 147a, 2005.- [10] L. Zhuo and V.K. Prasanna, "High-Performance and Area-Efficient Reduction Circuits on FPGAs,"
Proc. 17th Int'l Symp. Computer Architecture and High Performance Computing, Oct. 2005.- [11] G.R. Morris, L. Zhuo, and V.K. Prasanna, "High-Performance FPGA-Based General Reduction Methods,"
Proc. 10th IEEE Symp. Field-Programmable Custom Computing Machines (FCCM '05), Apr. 2005.- [12] L. Zhuo, G.R. Morris, and V.K. Prasanna, "High-Performance Reduction Circuits Using Deeply Pipelined Operators on FPGAs,"
IEEE Trans. Parallel Distributed Systems, vol. 18, no. 10, pp. 1377-1392, Oct. 2007.- [13] Y.-G. Tai, C.-T. D. Lo, and K. Psarris, "Applying Out-of-Core QR Decomposition Algorithms on FPGA-Based Systems,"
Proc. 17th Int'l Conf. Field Programmable Logic and Applications (FPL '07), 2007.- [14] Y.-G. Tai, C.-T. D. Lo, and K. Psarris, "Accelerating Matrix Decomposition with Replications,"
Proc. 15th Reconfigurable Architectures Workshop (RAW '08), 2008.- [15] B.C. Gunter and R.A.V.D. Geijn, "Parallel Out-of-Core Computation and Updating of the QR Factorization,"
ACM Trans. Math. Software, vol. 31, no. 1, pp. 60-78, 2005.- [16] A. Buttari, J. Langou, J. Kurzak, and J. Dongarra, "Parallel Tiled QR Factorization For Multicore Architectures," technical report, LAPack Working Notes #190, http://www.netlib.org/lapack/lawnspdflawn190.pdf , 2007.
- [17] B. Hadri, H. Ltaief, E. Agullo, and J. Dongarra, "Enhancing Parallelism of Tile QR Factorization for Multicore Architectures," technical report, LAPack Working Notes #222, Innovative Computing Laboratory, Univ. of Tennessee, http://www.netlib.org/lapack/lawnspdflawn222.pdf , 2009.
- [18]
Virtex-II Pro / Virtex-II Pro X Complete Data Sheet, Xilinx, Inc., http://direct.xilinx.com/bvdocs/publications ds083.pdf, 2007.- [19]
Virtex-4 Family Overview, Xilinx, Inc., http://www.xilinx.com/support/documentation/ data_sheetsds112.pdf, 2007.- [20]
Virtex-5 Family Overview, Xilinx, Inc., http://www.xilinx.com/support/documentation/ data_sheetsds100.pdf, 2009. |