This Article 
 Bibliographic References 
 Add to: 
Architecture and Implementation of a Vector/SIMD Multiply-Accumulate Unit
March 2005 (vol. 54 no. 3)
pp. 284-293
This paper presents a 64-bit fixed-point vector multiply-accumulator (MAC) architecture capable of supporting multiple precisions. The vector MAC can perform one 64\times64, two 32\times32, four 16\times16, or eight 8\times8 bit signed/unsigned multiply-accumulates using essentially the same hardware as a scalar 64-bit MAC and with only a small increase in delay. The scalar MAC architecture is "vectorized” by inserting mode-dependent multiplexing into the partial product generation and by inserting mode-dependent kills in the carry chain of the reduction tree and the final carry-propagate adder. This is an example of "shared segmentation” in which the existing scalar structure is segmented and then shared between vector modes. The vector MAC is area efficient and can be fully pipelined, which makes it suitable for high-performance processors and, possibly, dynamically reconfigurable processors. The "shared segmentation” method is compared to an alternative method, referred to as the "shared subtree” method, by implementing vector MAC designs using two different technologies and three different vector widths.

[1] C.G. Lee and M.G. Stoodley, “Simple Vector Microprocessors for Multimedia Applications,” Proc. 31st Ann. ACM/IEEE Int'l Symp. Microarchitecture, pp. 25-36, 1998.
[2] R.B. Lee, “Multimedia Extensions for General-Purpose Processors,” Proc. Signal Processing Systems (SIPS '97), pp. 9-23, Nov. 1997.
[3] K. Ganapathy et al., “Processors with Data Typer and Aligner Selectively Coupling Data Bits of Data Buses to Adder and Multiplier Functional Blocks to Execute Instructions with Flexible Data Types,” US Patent 6,557,096 29 Apr. 2003.
[4] A.D. Booth, “A Signed Binary Multiplication Algorithm,” Quarterly J. Mechanical and Applied Math., vol. 4, pp. 236-240, 1951.
[5] S. Vassiliadis, E.M. Schwarz, and B.M. Sung, “Hard-Wired Multipliers with Encoded Partial Products,” IEEE Trans. Computers, vol. 40, no. 11, pp. 1181-1197, Nov. 1991.
[6] N. Burgess, “Removal of Sign-Extension Circuitry from Booth's Algorithm Multiplier-Accumulators,” Electric Letters, vol. 26, no. 17, pp. 1413-1415, Aug. 1990.
[7] C.S. Wallace, “A Suggestion for a Fast Multiplier,” IEEE Trans. Electronic Computers, vol. 13, pp. 14-17, 1964.
[8] L. Dadda, “Some Schemes for Parallel Multipliers,” Alta Freq., vol. 34, pp. 349-356, 1965.
[9] C.R. Baugh and B.A. Wooley, “A Two's Complement Parallel Array Multiplication Algorithm,” IEEE Trans. Computers, vol. 22, no. 12, pp. 1045-1047, Dec. 1973.
[10] B. Parhami, Computer Arithmetic— Algorithms and Hardware Designs, pp. 178-180, 191-195, 2000.
[11] G.W. Bewick, “Fast Multiplication Algorithms and Implementation,” PhD dissertation, Stanford Univ., Feb. 1994.
[12] M.S. Schmookler et al., “A Low-Power, High-Speed Implementation of a PowerPC Microprocessor Vector Extension,” Proc. 14th IEEE Symp. Computer Arithmetic, pp. 12-19, 1999.
[13] A.A. Farooqui and V.G. Oklobdzija, “General Data-Path Organization of a MAC Unit for VLSI Implementation of DSP Processors,” Proc. IEEE Int'l Symp. Circuits and Systems, vol. 2, pp. 260-263, 1998.
[14] W.F. Wong and E. Goto, “Division and Square-Rooting Using A Split Multiplier,” Electric Letters, vol. 28, no. 18, pp. 1758-1759, Aug. 1992.
[15] Y. Liao and D.B. Roberts, “A High-Performance and Low-Power 32-Bit Multiply-Accumulate Unit with Single-Instruction-Multiple-Data (SIMD) Feature,” IEEE J. Solid-State Ciruits, vol. 37, no. 7, July 2002.
[16] H. Lee et al., “Virtual Parallel Multiplier-Accumulator,” US Patent 6,622,153, 16 Sept. 2003.
[17] A.N. Danysh and E.E. Swartzlander Jr., “A Recursive Fast Multiplier,” Proc. 32nd Asilomar Conf. Signals, Systems, and Computers, vol. 1, pp 197-201, Nov. 1998.
[18] S. Krithivasan and M.J. Schulte, “Multiplier Architectures for Media Processing,” Proc. 37th Asilomar Conf. Signals, Systems, and Computers, pp. 2193-2197, Nov. 2003.
[19] F. Chehrazi et al., “High Performance Pipelined Data Path for a Media Processor,” US Patent 6,282,556, Aug. 2001.
[20] K.C. Tang, A.K.M. Wu, A.S. Fong, and D.C.W. Pao, “Integrated Partition Integer Execution Unit for Multimedia and Conventional Applications,” Proc. IEEE Int'l Conf. Electronics, Circuits, and Systems, vol. 2, pp. 103-107, 1998.
[21] N. Burgess, “PAPA— Packed Arithmetic on a Prefix Adder For Multimedia Applications,” Proc. IEEE Int'l Conf. Application-Specific Systems, Architectures and Processors, pp. 197-207, July 2002.
[22] A.A. Farooqui, V.G. Oklobdzija, and F. Chehrazi, “Multiplexer Based Adder for Media Signal Processing,” Proc. 1999 Int'l Symp. VLSI Technnology, Systems, and Applications, pp 100-103, June 1999.

Index Terms:
Parallel, high-speed arithmetic, multimedia, data-path design, VLSI, MAC, multiply-accumulate, multiplier, vector, SIMD, Booth, Wallace, signed, unsigned, integer, fixed-point.
Albert Danysh, Dimitri Tan, "Architecture and Implementation of a Vector/SIMD Multiply-Accumulate Unit," IEEE Transactions on Computers, vol. 54, no. 3, pp. 284-293, March 2005, doi:10.1109/TC.2005.41
Usage of this product signifies your acceptance of the Terms of Use.