The Community for Technology Leaders
Green Image
Issue No. 07 - July (2011 vol. 60)
ISSN: 0018-9340
pp: 913-922
Sameh Galal , Stanford University, Stanford
Mark Horowitz , Stanford University, Stanford
Energy-efficient computation is critical if we are going to continue to scale performance in power-limited systems. For floating-point applications that have large amounts of data parallelism, one should optimize the {\rm throughput/mm}^{2} given a power density constraint. We present a method for creating a trade-off curve that can be used to estimate the maximum floating-point performance given a set of area and power constraints. Looking at FP multiply-add units and ignoring register and memory overheads, we find that in a 90 nm CMOS technology at 1 {\rm W/mm}^{2}, one can achieve a performance of {\rm 27 GFlops/mm}^{2} single precision, and {\rm 7.5 GFlops/mm}^{2} double precision. Adding register file overheads reduces the throughput by less than 50 percent if the compute intensity is high. Since the energy of the basic gates is no longer scaling rapidly, to maintain constant power density with scaling requires moving the overall FP architecture to a lower energy/performance point. A 1 {\rm W}/{\rm mm}^{2} design at 90 nm is a “high-energy” design, so scaling it to a lower energy design in 45 nm still yields a 7\times performance gain, while a more balanced 0.1 {\rm W/mm}^{2} design only speeds up by 3.5{\times} when scaled to 45 nm. Performance scaling below 45 nm rapidly decreases, with a projected improvement of only {\sim} 3{\times} for both power densities when scaling to a 22 nm technology.
Arithmetic and logic structures, high-speed arithmetic, floating point, fused multiply-add, throughput/{\rm mm}^{2} optimization.
Sameh Galal, Mark Horowitz, "Energy-Efficient Floating-Point Unit Design", IEEE Transactions on Computers, vol. 60, no. , pp. 913-922, July 2011, doi:10.1109/TC.2010.121
96 ms
(Ver 3.3 (11022016))