The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.05 - May (2012 vol.61)
pp: 745-751
Nong Xiao , Sch. of Comput., Nat. Univ. of Defense Technol., Changsha, China
Zhiying Wang , Sch. of Comput., Nat. Univ. of Defense Technol., Changsha, China
Li Shen , Sch. of Comput., Nat. Univ. of Defense Technol., Changsha, China
Sheng Ma , Sch. of Comput., Nat. Univ. of Defense Technol., Changsha, China
Libo Huang , Sch. of Comput., Nat. Univ. of Defense Technol., Changsha, China
ABSTRACT
Binary64 arithmetic is rapidly becoming inadequate to cope with today's large-scale computations due to an accumulation of errors. Therefore, binary128 arithmetic is now required to increase the accuracy and reliability of these computations. At the same time, an obvious trend emerging in modern processors is to extend their instruction sets by allowing single instruction multiple data (SIMD) execution, which can significantly accelerate the data-parallel applications. To address the combined demands mentioned above, this paper presents the architecture of a low-cost binary128 floating-point fused multiply add (FMA) unit with SIMD support. The proposed FMA design can execute a binary128 FMA every other cycle with a latency of four cycles, or two binary64 FMAs fully pipelined with a latency of three cycles, or four binary32 FMAs fully pipelined with a latency of three cycles. We use two binary64 FMA units to support binary128 FMA which requires much less hardware than a fully pipelined binary128 FMA. The presented binary128 FMA design uses both segmentation and iteration hardware vectorization methods to trade off performance, such as throughput and latency, against area and power. Compared with a standard binary128 FMA implementation, the proposed FMA design has 30 percent less area and 29 percent less dynamic power dissipation.
INDEX TERMS
pipeline arithmetic, parallel processing, dynamic power dissipation, low cost binary128 floating point FMA, unit design, SIMD support, binary64 arithmetic, single instruction multiple data execution, data parallel applications, binary32 FMAs, segmentation hardware, iteration hardware, vectorization methods, Computer architecture, Adders, Hardware, Multiplexing, Program processors, Pipelines, Compounds, computer arithmetic., Floating point, binary128, fused multiply add, SIMD, implementation
CITATION
Nong Xiao, Zhiying Wang, Li Shen, Sheng Ma, Libo Huang, "Low-Cost Binary128 Floating-Point FMA Unit Design with SIMD Support", IEEE Transactions on Computers, vol.61, no. 5, pp. 745-751, May 2012, doi:10.1109/TC.2011.77
REFERENCES
[1] R.K. Montoye, E. Hokenek, and S.L. Runyon, “Design of the IBM RISC System/6000 Floating-Point Execution Unit,” IBM J. Research & Development, vol. 34, pp. 59-70, 1990.
[2] S.K. Raman, V. Pentkovski, and J. Keshava, “Implementing Streaming SIMD Extensions on the Pentium III Processor,” IEEE Micro, vol. 20, no. 4, pp. 47-57, July/Aug. 2000.
[3] C. Keltcher, K. McGrath, A. Ahmed, and P. Conway, “The AMD Opteron Processor for Multiprocessor Servers,” IEEE Micro, vol. 23, no. 2, pp. 66-76, Mar./Apr. 2003.
[4] S. Chatterjee and L.R. Bachega, “Design and Exploitation of a High-Performance SIMD Floating-Point Unit for Blue Gene/L,” IBM J. Research and Development, vol. 49, pp. 377-392, 2005.
[5] F. Dinechin and G. Villard, “High Precision Numerical Accuracy in Physics Research,” Nuclear Instruments and Methods in Physics Research, vol. 559, pp. 207-210, 2006.
[6] A. Akkas and M. Schulte, “A Quadruple Precision and Dual Double Precision Floating-Point Multiplier,” Proc. Euromicro Symp. Digital System Design (DSD '03), pp. 76-81, 2003.
[7] IEEE Standard for Floating-Point Arithmetic, ANSI/IEEE Standard 754-2008, 2008.
[8] T. Lang and J.D. Bruguera, “Floating-Point Multiply-Add-Fused with Reduced Latency,” IEEE Trans. Computers, vol. 53, no. 8, pp. 988-1003, Aug. 2004.
[9] M.S. Schmookler and K.J. Nowka, “Leading Zero Anticipation and Detection a Comparison of Methods,” Proc. IEEE 15th Symp. Computer Arithmetic (ARITH), pp. 7-12, June 2001.
[10] G. Even, S. Mueller, and P.-M. Seidel, “A Dual Mode IEEE Multiplier,” Proc. IEEE Second Int'l Conf. Innovative Systems in Silicon, pp. 282-289, 1997.
[11] C. Hinds, “An Enhanced Floating Point Coprocessor for Embedded Signal Processing and Graphics Applications,” Proc. 33rd Asilomar Conf. Signals, Systems, and Computers, pp. 147-151, 1999.
[12] S.M. Mueller et al., “The Vector Floating-Point Unit in a Synergistic Processor Element of a CELL Processor,” Proc. IEEE 17th Symp. Computer Arithmetic (ARITH ), 2005.
[13] C. Hinds, “An Enhanced Floating Point Coprocessor for Embedded Signal Processing and Graphics Applications,” Proc. 33rd Asilomar Conf. Signals, Systems, and Computers, pp. 147-151, 1999.
[14] R. Kolla et al., “The IAX Architecture: Interval Arithmetic Extension,” Technical Report 225, Universitat Wurzburg, 1999.
[15] R. Lee, “Multimedia Extensions for General-Purpose Processors,” Proc. IEEE Workshop Signal Processing Systems, pp. 9-23, 1997.
[16] L. Huang, L. Shen, K. Dai, and Z. Wang, “A New Architecture for Multiple-Precision Floating-Point Multiply-Add Fused Unit Design,” Proc. IEEE 18th Symp. Computer Arithmetic, pp. 69-76, June 2007.
[17] S. Ma, L. Huang, M. Lai, and Z. Wang, “A Comparative Study of Subword Parallel Adders for Multimedia Applications,” The Eight Int'l Conf. ASIC, 2009.
[18] D. Tan, A. Danysh, and M. Liebelt, “Multiple-Precision Fixed-Point Vector Multiply-Accumulator Using Shared Segmentation,” Proc. 16th IEEE Symp. Computer Arithmetic (ARITH-16), pp. 12-19, 2003.
[19] S. Krithivasan and M.J. Schulte, “Multiplier Architectures for Media Processing,” Proc. 37th Asilomar Conf. Signals, Systems, and Computers, pp. 2193-2197, 2003.
[20] D. Tan, C.E. Lemonds, and M.J. Schulte, “Low-Power Multiple-Precision Iterative Floating-Point Multiplier with SIMD Support,” IEEE Trans. Computers, vol. 58, no. 2, pp. 175-187, Feb. 2009.
[21] M. Gok and M.M. Ozbilen, “Multi-Functional Floating-Point MAF Designs with Dot Product Support,” Microelectronics J., vol. 39, pp. 30-43, Jan. 2008.
[22] A. Tyagi, “A Reduced-Area Scheme for Carry-Select Adders,” IEEE Trans. Computers, vol. 42, no. 10, pp. 1163-1170, Oct. 1993.
[23] J. Bruguera and T. Lang, “Leading-One Prediction with Concurrent Position Correction,” IEEE Trans. Computers, vol. 48, no. 10, pp. 298-305, Oct. 1999.
[24] L. Chen and J. Cheng, “Architectural Design of a Fast Floating-Point Multiplication-Add Fused Unit Using Signed-Digit Addition,” Proc. Euromicro Symp. Digital Systems Design, p. 346, 2001.
[25] E. Quinnell, E. Swartzlander, and C. Lemonds, “Floating-Point Fused Multiply-Add Architectures,” Proc. 41st Asilomar Conf. Signals, Systems, and Computers, (ACSSC '07), pp. 331-337, 2007.
[26] R.M. Jessani and M. Putrino, “Comparison of Single- and Dual-Pass Multiply-Add Fused Floating-Point Units,” IEEE Trans. Computers, vol. 47, no. 9, pp. 927-937, Sept. 1998.
19 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool