• Publication
  • 2000
  • Issue No. 3 - March
  • Abstract - Accelerating Pipelined Integer and Floating-Point Accumulations in Configurable Hardware with Delayed Addition Techniques
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Accelerating Pipelined Integer and Floating-Point Accumulations in Configurable Hardware with Delayed Addition Techniques
March 2000 (vol. 49 no. 3)
pp. 208-218

Abstract—The speed of arithmetic calculations in configurable hardware is limited by carry propagation, even with the dedicated hardware found in recent FPGAs. This paper proposes and evaluates an approach called delayed addition that reduces the carry-propagation bottleneck and improves the performance of arithmetic calculations. Our approach employs the idea used in Wallace trees to store the results in an intermediate form and delay addition until the end of a repeated calculation such as accumulation or dot-product; this effectively removes carry propagation overhead from the calculation's critical path. We present both integer and floating-point designs that use our technique. Our pipelined integer multiply-accumulate (MAC) design is based on a fairly traditional multiplier design, but with delayed addition as well. This design achieves a 72MHz clock rate on an XC4036xla-9 FPGA and 170MHz clock rate on an XV300epq240-8 FPGA. Next, we present a 32-bit floating-point accumulator based on delayed addition. Here, delayed addition requires a novel alignment technique that decouples the incoming operands from the accumulated result. A conservative version of this design achieves a 40 MHz clock rate on an XC4036xla-9 FPGA and 97MHz clock rate on an XV100epq240-8 FPGA. We also present a 32-bit floating-point accumulator design with compiler-managed overflow avoidance that achieves a 80MHz clock rate on an XC4036xla-9 FPGA and 150MHz clock rate on an XCV100epq240-8 FPGA.

[1] ANSI/IEEE Std. 754-1985, Binary Floating-Point Arithmetic, IEEE Press, Piscataway, N.J., 1985 (also called ISO/IEC 559).
[2] L. Louca, T.A. Cook, and W.H. Johnson, “Implementation of IEEE Single Precision Floating Point Addition and Multiplication on FGPAs,” Proc. IEEE Symp. FPGAs for Custom Computing Machines, Apr. 1996.
[3] R.W. Canik and E.E. Swartzlander, “Implementing Array Multipliers in XILINX FPGAs,” Proc. 1994 28th Asilomar Conf. Signals, Systems, and Computers, 1994.
[4] J.R. Taylor, An Introduction to Error Analysis. Univ. Science Books, 1982.
[5] J.H. Wilkinson, Rounding Errors in Algebraic Processes. Prentice Hall, 1963.
[6] D.A. Patterson, J.L. Hennessy, and D. Goldberg, Computer Architecture, A Quantitative Approach, Appendix A, second ed. Morgan Kaufmann, 1996.
[7] N. Weste and K. Eshraghian, Principles of CMOS VLSI Design, Addison-Wesley, 1994.
[8] N. Ohkubo et al., “A 4.4-ns CMOS 54$\times$54-b Multiplier Using Pass-Transistor Multiplexor,” IEEE J. Solid-State Circuits, vol. 30, pp. 251-256, Mar. 1995.
[9] H. Makino, Y. Nakase, H. Susuki, H. Morinaka, H. Shinohara, and K. Mashiko, "An 8.8-ns 54×54-Bit Multiplier with High Speed Redundant Binary Architecture," IEEE J. Solid State Circuits, vol. 31, pp. 773-783, June 1996.
[10] C.S. Wallace, “Suggestions for a Fast Multiplier,” IEEE Trans. Electronic Computers, vol. 13, pp. 114-117, Feb. 1964.
[11] Y. Kanie et al., “4-2 Compressor with Complementary Pass-Transistor Logic,” IEICE Trans. Electron, vol. E77-c, no. 4, pp. 789-796, Apr. 1994.
[12] C. Heikes and G. Colon-Bonet, A Dual Floating Point Coprocessor with an FMAC Architecture Proc. IEEE Int'l Solid State Circuits Conf. (ISSCC96), pp. 354-355, 1996.
[13] N. Shirazi, A. Walters, and P. Athanas, “Quantitative Analysis of Floating Point Arithmetic on FPGA Based Custom Computing Machines,” Proc. IEEE Symp. FPGAs for Custom Computing Machines, pp. 155-162, Apr. 1995.
[14] W.B. Ligion III, S. McMillan, G. Monn, F. Stivers, and K.D. Underwood, “A Re-Evaluation of the Practicality of Floating-Point Operations on FPGAs,” Proc. IEEE Symp. FPGAs for Custom Computing Machines, Apr. 1998.
[15] D.P. Bhandarkar, Alpha Implementations and Architecture, Complete Reference and Guide. Digital Press, 1996.
[16] R.K. Yu and G.B. Zyner, “167 MHz Radix-4 Floating Point Multiplier,” Proc. 12th Symp. Computer Arithmetic, vol. 12, pp. 149-154, 1995.
[17] F.M. McMahon, “The Livermore FORTRAN Kernels: A Computer Test of Numerical Performance Range,” Technical Report UCRL-55745, Lawrence Livermore Nat'l Laboratory, Univ. of California, Davis, Dec. 1986.
[18] D. Priest, “Differences among IEEE 754 Implementations,” http://www.validgh.com/goldbergaddendum.html , 1997.
[19] Xilinx, “XC4000E and XC4000X Series Field Programmable Gate Arrays, Product Specification,” V1.4, Nov. 1997.
[20] D. Goldberg, “What Every Computer Scientist Should Know about Floating-Point Arithmetic,” http://www.validgh.com/goldbergpaper.ps, 1991.
[21] N.J. Higham, Accuracy and Stability of Numerical Algorithms, SIAM, Philadelphia, 1996.
[22] Microelectronics Group, Lucent Tech nologies, “Create Multiply-Accumulate Functions in ORCA FPGAs,” Feb. 1997.
[23] Altera, “FLEX 10K v.s. FPGA performance,” Technical Brief 12, Sept. 1996.
[24] Altera, “Implementing Multipliers in Flex 10K Devices,” Application Note 53, Mar. 1996.
[25] Xilinx, “Virtex-E 1.8V Field Programmable Gate Arrays Datasheet Description v1.1,” 1999.
[26] M. Nomura et al., “A 300-MHz 16-b 0.5 um BiCMOS Digital Signal Processor Core LSI,” IEEE J. Solid State Circuits, vol. 29, no. 3, Mar. 1994.
[27] N.D. Gupta, “Reconfigurable Computing for Space-Time Adaptive Processing,” master's thesis proposal, Dept. of Computer Science, Texas Tech Univ., Fall 1997.
[28] S.T. Smith et al., “Linear and Nonlinear Conjugate Gradient Methods for Adaptive Processing,” Proc. 1996 Int'l Conf. Acoustics, Speech, and Signal Processing, May 1996.

Index Terms:
Delayed addition, accumulation, multiply-accumulate, MAC, FPGA.
Citation:
Zhen Luo, Margaret Martonosi, "Accelerating Pipelined Integer and Floating-Point Accumulations in Configurable Hardware with Delayed Addition Techniques," IEEE Transactions on Computers, vol. 49, no. 3, pp. 208-218, March 2000, doi:10.1109/12.841125
Usage of this product signifies your acceptance of the Terms of Use.