This Article 
 Bibliographic References 
 Add to: 
A Floating-Point Unit for 4D Vector Inner Product with Reduced Latency
July 2009 (vol. 58 no. 7)
pp. 890-901
Donghyun Kim, Qualcomm Inc., San Diego
Lee-Sup Kim, Korea Advanced Institute of Science and Technology, Daejeon
This paper presents the algorithm and implementation of a new high-performance functional unit for floating-point four-dimensional vector inner product (4D dot product; DP4), which is most frequently performed in 3D graphics application. The proposed IEEE-compliant DP4 unit computes {\rm Z} = {\rm AB} + {\rm CD} + {\rm EF} + {\rm GH} in one path and keeps the intermediate rounding by IEEE-754 rounding to nearest even. The intermediate rounding is merged with shift alignment, and intermediate carry-propagated addition and normalization are omitted to reduce latency in the proposed architecture. The proposed DP4 unit is implemented with 0.18-\mu{\rm m} CMOS technology and has 12.8-ns critical path delay, which is reduced by 45.5 percent compared to a previous DP4 implementation using discrete multipliers and adders. The proposed DP4 unit also reduces the cycle time of 3D graphics applications by 12.4 percent on the average compared to the usual 3D graphics FPU based on four-way multiply-add-fused units.

[1] E. Lindholm, M.J. Kilgard, and H. Moreton, “A User-Programmable Vertex Engine,” Proc. ACM SIGGRAPH '01, pp. 149-158, 2001.
[2] D. Kim, K. Chung, C.-H. Yu, C.-H. Kim, I. Lee, J. Bae, Y.-J. Kim, J.-H. Park, S. Kim, Y.-H. Park, N.-H. Seong, J.-A. Lee, J. Park, S. Oh, S.-W. Jeong, and L.-S. Kim, “An SoC with 1.3 Gtexels/sec 3-D Graphics Full Pipeline Engine for Consumer Applications,” IEEE J. Solid-State Circuits, vol. 41, no. 1, pp. 71-84, Jan. 2006.
[3] C.-H. Yu, K. Chung, D. Kim, and L.-S. Kim, “A 120Mvertices/s Multi-Threaded VLIW Vertex Processor for Mobile Multimedia Applications,” Proc. IEEE Int'l Solid-State Circuit Conf. (ISSCC '06), pp. 408-409, 2006.
[4] C.-H. Yu, K. Chung, D. Kim, and L.-S. Kim, “A 186Mvertices/s 161mW Floating-Point Vertex Processor for Mobile Graphics Systems,” Proc. IEEE Custom Integrated Circuits Conf. (CICC '07), pp. 579-582, 2007.
[5] D. Blythe, “The Direct3D 10 System,” ACM Trans. Graphics, vol. 25, no. 3, pp. 724-734, July 2006.
[6] P.M. Seidel and G. Even, “On the Design of Fast IEEE Floating-Point Adders,” Proc. 15th IEEE Symp. Computer Arithmetic (ARITH '01), pp. 184-194, 2001.
[7] M.R. Santoro, G. Bewick, and M.A. Horowitz, “Rounding Algorithms for IEEE Multipliers,” Proc. Ninth IEEE Symp. Computer Arithmetic (ARITH '89), pp. 176-183, 1989.
[8] P.-M. Seidel and G. Even, “Delay-Optimized Implementation of IEEE Floating-Point Addition,” IEEE Trans. Computers, vol. 53, no. 2, pp. 99-113, Feb. 2004.
[9] P. Farmwald, Bifurcated Method and Apparatus for Floating-Point Addition with Decreased Latency Time, US Patent 4639887, 1987.
[10] K. Ng, Floating-Point ALU with Parallel Paths, US Patent 5136536, Weitek Corp., 1992.
[11] G. Even and W.J. Paul, “On the Design of IEEE Compliant Floating-Point Units,” IEEE Trans. Computers, vol. 49, no. 5, pp.398-413, May 2000.
[12] G. Gerwig and M. Kroener, “Floating-Point Unit in Standard Cell Design with 116 Bit Wide Dataflow,” Proc. 14th IEEE Symp. Computer Arithmetic (ARITH '99), pp. 266-273, 1999.
[13] E. Hokenek, R.K. Montoye, and P.W. Cook, “Second-Generation RISC Floating Point with Multiply-Add Fused,” IEEE J. Solid-State Circuits, vol. 25, no. 5, pp. 1207-1213, Oct. 1990.
[14] T. Lang and J.D. Bruguera, “Floating-Point Multiply-Add-Fused with Reduced Latency,” IEEE Trans. Computers, vol. 53, no. 8, pp.988-1003, Aug. 2004.
[15] G. Li and Z. Li, “Design of a Fully Pipelined Single-Precision Multiply-Add-Fused Unit,” Proc. 20th IEEE Int'l Conf. VLSI Design, pp. 318-323, 2007.
[16] S.-H. Kim, J.-S. Yoon, C.-H. Yu, D. Kim, K. Chung, H.S. Lim, H.-W. Park, and L.-S. Kim, “36 fps SXGA 3D Display Processor with a Programmable 3D Graphics Rendering Engine,” Proc. IEEE Int'l Solid-State Circuit Conf. (ISSCC '07), pp. 276-277, 2007.
[17] S.M. Mueller, C. Jacobi, H.-J. Oh, K.D. Tran, S.R. Cottier, B.W. Michael, H. Nishikawa, Y. Totsuka, T. Namatame, N. Yano, T. Machida, and S.H. Dhong, “The Vector Floating-Point Unit in a Synergistic Processor Element of a CELL Processor,” Proc. IEEE Symp. Computer Arithmetic, pp. 59-67, June 2005.
[18] IEEE Standard for Binary Floating-Point Arithmetic, ANSI/IEEE Standard 754, 1985.
[19] S.F. Oberman and M.Y. Siu, “A High-Performance Area-Efficient Multifunction Interpolator,” Proc. 17th IEEE Symp. Computer Arithmetic (ARITH '05), pp. 272-279, June 2005.
[20] N.J. Rohrer, M. Canada, E. Cohen, M. Ringler, M. Mayfield, P. Sandon, P. Kartschoke, J. Heaslip, J. Allen, P. McCormick, T. Pfluger, J. Zimmerman, C. Lichtenau, T. Werner, G. Salem, M. Ross, D. Appenzeller, and D. Thygesen, “PowerPC 970 in 130nm and 90 nm Technologies,” Proc. IEEE Int'l Solid-State Circuit Conf. (ISSCC '04), pp. 68-69, 2004.
[21] Nvidia Corp., FX Composer 2.0, objectfx_composer_home.html , 2008.
[22] J.-M. Muller, ““Partially Rounded” Small-Order Approximations for Accurate, Hardware-Oriented, Table-Based Methods,” Proc. 16th IEEE Symp. Computer Arithmetic (ARITH '03), pp. 114-121, 2003.

Index Terms:
Floating-point arithmetic, vector inner product, DP4, 3D graphics.
Donghyun Kim, Lee-Sup Kim, "A Floating-Point Unit for 4D Vector Inner Product with Reduced Latency," IEEE Transactions on Computers, vol. 58, no. 7, pp. 890-901, July 2009, doi:10.1109/TC.2008.210
Usage of this product signifies your acceptance of the Terms of Use.