Subscribe

Issue No.07 - July (2011 vol.60)

pp: 913-922

Sameh Galal , Stanford University, Stanford

Mark Horowitz , Stanford University, Stanford

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TC.2010.121

ABSTRACT

Energy-efficient computation is critical if we are going to continue to scale performance in power-limited systems. For floating-point applications that have large amounts of data parallelism, one should optimize the {\rm throughput/mm}^{2} given a power density constraint. We present a method for creating a trade-off curve that can be used to estimate the maximum floating-point performance given a set of area and power constraints. Looking at FP multiply-add units and ignoring register and memory overheads, we find that in a 90 nm CMOS technology at 1 {\rm W/mm}^{2}, one can achieve a performance of {\rm 27 GFlops/mm}^{2} single precision, and {\rm 7.5 GFlops/mm}^{2} double precision. Adding register file overheads reduces the throughput by less than 50 percent if the compute intensity is high. Since the energy of the basic gates is no longer scaling rapidly, to maintain constant power density with scaling requires moving the overall FP architecture to a lower energy/performance point. A 1 {\rm W}/{\rm mm}^{2} design at 90 nm is a “high-energy” design, so scaling it to a lower energy design in 45 nm still yields a 7\times performance gain, while a more balanced 0.1 {\rm W/mm}^{2} design only speeds up by 3.5{\times} when scaled to 45 nm. Performance scaling below 45 nm rapidly decreases, with a projected improvement of only {\sim} 3{\times} for both power densities when scaling to a 22 nm technology.

INDEX TERMS

Arithmetic and logic structures, high-speed arithmetic, floating point, fused multiply-add, throughput/{\rm mm}^{2} optimization.

CITATION

Sameh Galal, Mark Horowitz, "Energy-Efficient Floating-Point Unit Design",

*IEEE Transactions on Computers*, vol.60, no. 7, pp. 913-922, July 2011, doi:10.1109/TC.2010.121REFERENCES

- [1] R.H. Dennard, F.H. Gaensslen, L. Kuhn, and H.N. Yu, "Design of Micron MOS Switching Devices,"
Proc. IEEE Int'l Electron Devices Meeting, pp. 168-170, 1972.- [2] D. Patil, O. Azizi, and M. Horowitz, "Robust Energy-Efficient Adder Topologies,"
Proc. 18th IEEE Symp. Computer Arithmetic (ARITH '07), pp. 16-28, 2007.- [3] P.-M. Seidel and G. Even, "Delay-Optimized Implementation of IEEE Floating-Point Addition,"
IEEE Trans. Computers, pp. 97-113, vol. 53, no. 2, Feb. 2004.- [4] T. Lang and J.D. Bruguera, "Floating-Point Fused Multiply-Add: Reduced Latency for Floating-Point Addition,"
Proc. 17th IEEE Symp. Computer Arithmetic (ARITH '05), pp. 42-51, 2005.- [5] P.M. Seidel, "Multiple Path IEEE Floating-Point Fused Multiply-Add,"
Proc. 46th Int'l IEEE Mid-West Symp. Circuits and Systems (MWSCAS), 2003.- [6] E. Hokenek, R.K. Montoye, and P.W. Cook, "Second-Generation RISC Floating Point with Multiply-Add Fused,"
IEEE J. Solid-State Circuits, vol. 25, no. 5, pp. 1207-1213, Oct. 1990.- [7] H.-J. Oh et al., "A Fully Pipelined Single-Precision Floating-Point Unit in the Synergistic Processor Element of a CELL Processor,"
IEEE J. Solid-State Circuits, vol. 41, no. 4, pp. 759-771, Apr. 2006.- [8] S. Dao Trong, M.S. Schmookler, E.M. Schwarz, and M. Kroener, "P6 Binary Floating-Point Unit,"
Proc. 18th IEEE Symp. Computer Arithmetic (ARITH '07), pp. 77-86, 2007.- [9] N. Ide et al., "2.44-GFLOPS 300-MHz Floating-Point Vector-Processing Unit for High-Performance 3D Computer Graphics Computing,"
IEEE J. Solid-State Circuits, vol. 35, no. 7, pp. 1025-1033, July 2000.- [10] D.R. Lutz and C.N. Hinds, "A New Floating-Point Architecture for Wireless 3D Graphics,"
Proc. 38th Asilomar Conf. Signals, Systems and Computers (ACSSC '04), vol. 2, pp. 1879-1883, Nov. 2004.- [11] E.M. Schwarz, "Binary Floating-Point Unit Design: The Fused Multiply-Add Dataflow,"
High-Performance Energy-Efficient Microprocessor, V.G. Oklobdzija and R.K. Krishnamurthy, eds., Springer, 2006.- [12] K. Johguchi, Y. Mukuda, K. Aoyama, H.J. Mattausch, and T. Koide, "A 2-Stage-Pipelined 16 Port SRAM with 590 Gbps Random Access Bandwidth and Large Noise Margin,"
IEICE Electronics Express, vol. 4, no. 2, pp. 21-25, 2007.- [13] J.E. Lindholm, M.Y. Siu, S.S. Moy, S. Liu, and J.R. Nickolls, "Simulating Multiported Memories Using Lower Port Count Memories," US Patent US 7,339,592 B2, Nvidia Corporation, Mar. 2008.
- [14] L. Yue, J.W. Berendsen, K.M. Abdalla, R.M. Bastos, and R. Danilak, "Architecture for Compact Multi-Ported Register File," US Patent US 7,339,592 B2, Nvidia Corporation, Mar. 2008.
- [15] S. Thoziyoor, N. Muralimanohar, and N.P. Jouppi, "CACTI 5.0: An Integrated Cache Timing, Power, and AreaModel," technical report, HP Laboratories, 2007.
- [16] J.D. Owens, M. Houston, D. Luebke, S. Green, J.E. Stone, and J.C. Phillips, "GPU Computing,"
Proc. IEEE, vol. 96, no. 5, pp. 879-899, May 2008.- [17] C. Patel et al., "Cost Model for Planning, Development and Operation of a Datacenter," http://www.hpl.hp.com/ techreports/ 2005HPL-2005-107R1.pdf, 2009.
- [18] Predictive Transistor Models, http:/ptm.asu.edu/, 2010.
- [19] Hynix 1 Gb (32Mx32) GDDR5 SGRAM H5GQ1H24AFR Datasheet, http://www.hynix.com/datasheet/pdf/graphics H5GQ1H24AFR(Rev1.0).pdf, 2010.
- [20] ATI Radeon HD 5870 GPU Feature Summary, http:/www.amd. com, 2010.
- [21] R.W. Brodersen, M.A. Horowitz, D. Markovic, B. Nikolic, and V. Stojanovic, "Methods for True Power Minimization,"
Proc. IEEE/ACM Int'l Conf. Computer Aided Design (ICCAD), pp.35-42, Nov. 2002, |