Issue No. 10 - October (2007 vol. 18)
Field programmable gate arrays (FPGAs) have become an attractive option for accelerating scientific applications. Many scientific operations such as matrix-vector multiplication and dot product involve the reduction of a sequentially produced stream of values. Unfortunately, because of the pipelining in FPGA-based floating-point units, data hazards may occur during these sequential reduction operations. Improperly designed reduction circuits can adversely impact the performance, impose unrealistic buffer requirements, and consume a significant portion of the FPGA. In this paper, we identify two basic methods for designing serial reduction circuits, the tree-traversal method and the striding method. Using accumulation as an example, we analyze the design tradeoffs between the number of adders, buffer size and latency, and propose high-performance and area-efficient designs using each method. The proposed designs reduce multiple sets of sequentially delivered floating-point values without stalling the pipeline or imposing unrealistic buffer requirements. Using a Xilinx Virtex-II Pro FPGA as the target device, we implemented our designs and present performance and area results.
G.1.0.g Parallel algorithms, C.3.e Reconfigurable hardware
Viktor K. Prasanna, Ling Zhuo, Gerald R. Morris, "High-Performance Reduction Circuits Using Deeply Pipelined Operators on FPGAs", IEEE Transactions on Parallel & Distributed Systems, vol. 18, no. , pp. 1377-1392, October 2007, doi:10.1109/TPDS.2007.1068