Subscribe

Issue No.07 - July (2009 vol.58)

pp: 890-901

Donghyun Kim , Qualcomm Inc., San Diego

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TC.2008.210

ABSTRACT

This paper presents the algorithm and implementation of a new high-performance functional unit for floating-point four-dimensional vector inner product (4D dot product; DP4), which is most frequently performed in 3D graphics application. The proposed IEEE-compliant DP4 unit computes {\rm Z} = {\rm AB} + {\rm CD} + {\rm EF} + {\rm GH} in one path and keeps the intermediate rounding by IEEE-754 rounding to nearest even. The intermediate rounding is merged with shift alignment, and intermediate carry-propagated addition and normalization are omitted to reduce latency in the proposed architecture. The proposed DP4 unit is implemented with 0.18-\mu{\rm m} CMOS technology and has 12.8-ns critical path delay, which is reduced by 45.5 percent compared to a previous DP4 implementation using discrete multipliers and adders. The proposed DP4 unit also reduces the cycle time of 3D graphics applications by 12.4 percent on the average compared to the usual 3D graphics FPU based on four-way multiply-add-fused units.

INDEX TERMS

Floating-point arithmetic, vector inner product, DP4, 3D graphics.

CITATION

Donghyun Kim, "A Floating-Point Unit for 4D Vector Inner Product with Reduced Latency",

*IEEE Transactions on Computers*, vol.58, no. 7, pp. 890-901, July 2009, doi:10.1109/TC.2008.210REFERENCES

- [1] E. Lindholm, M.J. Kilgard, and H. Moreton, “A User-Programmable Vertex Engine,”
Proc. ACM SIGGRAPH '01, pp. 149-158, 2001.- [3] C.-H. Yu, K. Chung, D. Kim, and L.-S. Kim, “A 120Mvertices/s Multi-Threaded VLIW Vertex Processor for Mobile Multimedia Applications,”
Proc. IEEE Int'l Solid-State Circuit Conf. (ISSCC '06), pp. 408-409, 2006.- [5] D. Blythe, “The Direct3D 10 System,”
ACM Trans. Graphics, vol. 25, no. 3, pp. 724-734, July 2006.- [8] P.-M. Seidel and G. Even, “Delay-Optimized Implementation of IEEE Floating-Point Addition,”
IEEE Trans. Computers, vol. 53, no. 2, pp. 99-113, Feb. 2004.- [9] P. Farmwald,
Bifurcated Method and Apparatus for Floating-Point Addition with Decreased Latency Time, US Patent 4639887, 1987.- [10] K. Ng,
Floating-Point ALU with Parallel Paths, US Patent 5136536, Weitek Corp., 1992.- [16] S.-H. Kim, J.-S. Yoon, C.-H. Yu, D. Kim, K. Chung, H.S. Lim, H.-W. Park, and L.-S. Kim, “36 fps SXGA 3D Display Processor with a Programmable 3D Graphics Rendering Engine,”
Proc. IEEE Int'l Solid-State Circuit Conf. (ISSCC '07), pp. 276-277, 2007.- [18]
IEEE Standard for Binary Floating-Point Arithmetic, ANSI/IEEE Standard 754, 1985.- [20] N.J. Rohrer, M. Canada, E. Cohen, M. Ringler, M. Mayfield, P. Sandon, P. Kartschoke, J. Heaslip, J. Allen, P. McCormick, T. Pfluger, J. Zimmerman, C. Lichtenau, T. Werner, G. Salem, M. Ross, D. Appenzeller, and D. Thygesen, “PowerPC 970 in 130nm and 90 nm Technologies,”
Proc. IEEE Int'l Solid-State Circuit Conf. (ISSCC '04), pp. 68-69, 2004.- [21] Nvidia Corp.,
FX Composer 2.0, http://developer.nvidia.com/ objectfx_composer_home.html , 2008. |