Subscribe

Issue No.02 - February (2009 vol.58)

pp: 208-219

Nachiket Kapre , CALTECH, Pasadena

Stephanie Chan , Numerica Corp., Loveland

André DeHon , University of Pennsylvania, Philadelphia

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TC.2008.110

ABSTRACT

Aggressive pipelining and spatial parallelism allow integrated circuits (e.g., custom VLSI, ASICs, and FPGAs) to achieve high throughput on many Digital Signal Processing applications. However, cyclic data dependencies in the computation can limit parallelism and reduce the efficiency and speed of an implementation. Saturated accumulation is an important example where such a cycle limits the throughput of signal processing applications. We show how to reformulate saturated addition as an associative operation so that we can use a parallel-prefix calculation to perform saturated accumulation at any data rate supported by the device. This allows us, for example, to design a 16-bit saturated accumulator which can operate at 280 MHz on a Xilinx Spartan-3 (XC3S-5000-4) FPGA, the maximum frequency supported by the component's DCM.

INDEX TERMS

High-speed arithmetic, pipeline and parallel arithmetic and logic structures, saturated arithmetic, accumulation, parallel prefix.

CITATION

Nachiket Kapre, Stephanie Chan, André DeHon, "Pipelining Saturated Accumulation",

*IEEE Transactions on Computers*, vol.58, no. 2, pp. 208-219, February 2009, doi:10.1109/TC.2008.110REFERENCES

- [1] V. Agarwal, M.S. Hrishikesh, S.W. Keckler, and D. Burger, “Clock Rate versus IPC: The End of the Road for Conventional Microarchitectures,”
Proc. 27th Int'l Symp. Computer Architecture (ISCA '00), pp. 248-259, 2000.- [2] D. Chinnery and K. Keutzer,
Closing the Gap between ASIC & Custom: Tools and Techniques for High-Performance ASIC Design. Kluwer Academic Publishers, 2002.- [3] W. Tsu, K. Macy, A. Joshi, R. Huang, N. Walker, T. Tung, O. Rowhani, V. George, J. Wawrzynek, and A. DeHon, “HSRA: High-Speed, Hierarchical Synchronous Reconfigurable Array,”
Proc. Int'l Symp. Field-Programmable Gate Arrays (FPGA '99), pp.125-134, Feb. 1999.- [4] D.P. Singh and S.D. Brown, “The Case for Registered Routing Switches in Field Programmable Gate Arrays,”
Proc. Int'l Symp. Field-Programmable Gate Arrays (FPGA '01), pp. 161-169, Feb. 2001.- [5] C. Leiserson, F. Rose, and J. Saxe, “Optimizing Synchronous Circuitry by Retiming,”
Proc. Third Caltech Conf. VLSI, Mar. 1983.- [6] N. Weaver, Y. Markovskiy, Y. Patel, and J. Wawrzynek, “Post-Placement C-Slow Retiming for the Xilinx Virtex FPGA,”
Proc. Int'l Symp. Field-Programmable Gate Arrays (FPGA '03), pp. 185-194, 2003.- [7] B. Smith, “Architecture and Applications of the HEP Multiprocessor Computer System,”
Proc. Fourth Symp. Real-Time Signal Processing, pp. 241-248, 1981.- [8] D.M. Tullsen, S.J. Eggers, J.S. Emer, H.M. Levy, J.L. Lo, and R.L. Stamm, “Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor,”
Proc. 23rd Int'l Symp. Computer Architecture (ISCA '96), pp. 191-202, 1996.- [9] Z. Luo and M. Martonosi, “Accelerating Pipelined Integer and Floating-Point Accumulations in Configurable Hardware with Delayed Addition Techniques,”
IEEE Trans. Computers, vol. 49, no. 3, pp. 208-218, Mar. 2000.- [10]
Xilinx Spartan-3 FPGA Family Data Sheet, Xilinx, Inc., dS099, http://direct.xilinx.com/bvdocs/publications ds099.pdf, Dec. 2004.- [11] K. Papadantonakis, N. Kapre, S. Chan, and A. DeHon, “Pipelining Saturated Accumulation,”
Proc. IEEE Int'l Conf. Field-Programmable Technology (FPT '05), pp. 19-26, Dec. 2005.- [12] C. Lee, M. Potkonjak, and W.H. Mangione-Smith, “MediaBench: A Tool for Evaluating and Synthesizing Multimedia and Communications Systems,”
Proc. 30th Ann. Int'l Symp. Microarchitecture (MICRO '97), pp. 330-335, 1997.- [13] R. Barua, W. Lee, S. Amarasinghe, and A. Agarwal, “Maps: A Compiler-Managed Memory System for Raw Machines,”
Proc. 26th Int'l Symp. Computer Architecture (ISCA '99), pp. 4-15, 1999.- [14] W.D. Hillis and G.L. Steele, “Data Parallel Algorithms,”
Comm. ACM, vol. 29, no. 12, pp. 1170-1183, Dec. 1986.- [15] R.P. Brent and H.T. Kung, “A Regular Layout for Parallel Adders,”
IEEE Trans. Computers, vol. 31, no. 3, pp. 260-264, Mar. 1982.- [16] F.T. Leighton,
Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes. Morgan Kaufmann, 1992.- [17] B.D. de Dinechin, C. Monat, and F. Rastello, “Parallel Execution of the Saturated Reductions,”
Proc. IEEE Workshop Signal Processing Systems (SiPS '01), pp. 373-384, 2001.- [18] M. Schulte, P. Balzola, J. Ruan, and J. Glossner, “Parallel Saturating Multioperand Adders,”
Proc. Int'l Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES '00), pp. 172-179, 2000.- [19] P.I. Balzola, M.J. Schulte, J. Ruan, J. Glossner, and E. Hokenek, “Design Alternatives for Parallel Saturating Multioperand Adders,”
Proc. Int'l Conf. Computer Design (ICCD '01), pp. 172-177, Sept. 2001.- [20] J.H. Hubbard and B.B.H. Hubbard,
Vector Calculus, Linear Algebra, and Differential Forms: A Unified Approach. Prentice Hall, 1999.- [21] S. Winograd, “On the Time Required to Perform Addition,”
J.ACM, vol. 12, no. 2, pp. 277-285, Apr. 1965.- [22] M. Hrishikesh, N.P. Jouppi, K.I. Farkas, D. Burger, S.W. Keckler, and P. Shivakumar, “The Optimal Logic Depth Per Pipeline Stage Is 6 to 8 FO4 Inverter Delays,”
Proc. 29th Int'l Symp. Computer Architecture (ISCA '02), pp. 14-24, 2002. |