Issue No.10 - October (2007 vol.18)
Field programmable gate arrays (FPGAs) have become an attractive option for accelerating scientific applications. Many scientific operations such as matrix-vector multiplication and dot product involve the reduction of a sequentially produced stream of values. Unfortunately, because of the pipelining in FPGA-based floating-point units, data hazards may occur during these sequential reduction operations. Improperly designed reduction circuits can adversely impact the performance, impose unrealistic buffer requirements, and consume a significant portion of the FPGA. In this paper, we identify two basic methods for designing serial reduction circuits, the tree-traversal method and the striding method. Using accumulation as an example, we analyze the design tradeoffs between the number of adders, buffer size and latency, and propose high-performance and area-efficient designs using each method. The proposed designs reduce multiple sets of sequentially delivered floating-point values without stalling the pipeline or imposing unrealistic buffer requirements. Using a Xilinx Virtex-II Pro FPGA as the target device, we implemented our designs and present performance and area results.
G.1.0.g Parallel algorithms, C.3.e Reconfigurable hardware
Gerald R. Morris, Viktor K. Prasanna, "High-Performance Reduction Circuits Using Deeply Pipelined Operators on FPGAs", IEEE Transactions on Parallel & Distributed Systems, vol.18, no. 10, pp. 1377-1392, October 2007, doi:10.1109/TPDS.2007.1068