This Article 
 Bibliographic References 
 Add to: 
Bottlenecks in Multimedia Processing with SIMD Style Extensions and Architectural Enhancements
August 2003 (vol. 52 no. 8)
pp. 1015-1031

Abstract—Multimedia SIMD extensions such as MMX and AltiVec speed up media processing; however, our characterization shows that the attributes of current general-purpose processors enhanced with SIMD extensions do not match very well with the access patterns and loop structures of media programs. We find that 75 to 85 percent of the dynamic instructions in the processor instruction stream are supporting instructions necessary to feed the SIMD execution units rather than true/useful computations, resulting in the underutilization of SIMD execution units (only 1 to 12 percent of the peak SIMD execution units' throughput is achieved). Contrary to focusing on exploiting more data-level parallelism (DLP), in this paper, we focus on the instructions that support the SIMD computations and exploit both fine and coarse-grained instruction level parallelism (ILP) in the supporting instruction stream. We propose the MediaBreeze architecture that uses hardware support for efficient address generation, looping, and data reorganization (permute, packing/unpacking, transpose, etc.). Our results on multimedia kernels show that a 2-way processor with SIMD extensions enhanced with MediaBreeze provides a better performance than a 16-way processor with current SIMD extensions. In the case of application benchmarks, a 2-/4-way processor with SIMD extensions augmented with MediaBreeze outperforms a 4-/8-way processor with SIMD extensions. A first-order approximation using ASIC synthesis tools and cell-based libraries shows that this acceleration is achieved at a 10 percent increase in area required by MMX and SSE extensions (0.3 percent increase in overall chip area) and 1 percent of total processor power consumption.

[1] R.B. Lee, “Multimedia Extensions for General Purpose Processors,” Proc. IEEE Workshop VLSI Signal Processing, pp. 1-15, 1997.
[2] K. Diefendorff, P.K. Dubey, R. Hochsprung, and H. Scales, AltiVec Extension to PowerPC Accelerates Media Processing IEEE Micro, vol. 20, no. 2, pp. 85-95, Mar./Apr. 2000.
[3] TMS320C64x DSP Technical Brief, available: / products/dsp/c6000c64xmptb.pdf, 2000.
[4] J. Fridman and Z. Greenfield, The TigerSHARC DSP Architecture IEEE Micro, vol. 20, no. 1, pp. 66-76, Jan./Feb. 2000.
[5] P. Ranganathan, S. Adve, and N. Jouppi, “Performance of Image and Video Processing with General-Purpose Processors and Media ISA Extensions,” Proc. 26th Ann. Int'l Symp. Computer Architecture, pp. 124-135, 1999.
[6] E. Salami, J. Corbal, M. Valero, and R. Espasa, An Evaluation of Different DLP Alternatives for the Embedded Domain Proc. Workshop Media Processors and DSPs in conjunction with Micro-32, Nov. 1999.
[7] R. Bhargava, L.K. John, B.L. Evans, and R. Radhakrishnan, “Evaluating MMX Technology Using DSP and Multimedia Applications,” Proc. IEEE Symp. Microarchitecture, pp. 37-46, Dec. 1998.
[8] H. Nguyen and L.K. John, “Exploiting SIMD Parallelism in DSP and Multimedia Algorithms Using the AltiVec Technology,” Proc. 1999 Int'l Conf. Supercomputing, pp. 11-20, 1999.
[9] Sample source code for the Benchmarks, available: mediabenchmarks/, 2001.
[10] C. Lee, M. Potkonjak, and W.H. Mangione-Smith, MediaBench: A Tool For Evaluating and Synthesizing Multimedia and Communications Systems Proc. 30th Ann. IEEE/ACM Int'l Symp. Microarchitecture, pp. 330-335, 1997.
[11] D. Burger and T.M. Austin, The SimpleScalar Tool Set version 2.0. Technical Report 1342, Computer Science Dept., Univ. of Wisconsin-Madison, 1997.
[12] J. Fritts and W. Wolf, Dynamic Parallel Media Processing Using Speculative Broadcast Loop (SBL) Proc. Workshop Parallel and Distributed Computing in Image Processing, Video Processing, and Multimedia (held in conjunction with IPDPS '01), Apr. 2001.
[13] P.T. Hulina, L.D. Coraor, L. Kurian, and E. John, Design and VLSI Implementation of an Address Generation Coprocessor IEE Proc. Computers and Digital Techniques, vol. 142, no. 2, pp. 145-151, Mar. 1995.
[14] J.E. Smith,“Decoupled access/execute architectures,” ACM Trans. Computer Systems, vol. 2, no. 4, pp. 289-308, Nov. 1984.
[15] J.E. Smith, S. Weiss, and N.Y. Pang, A Simulation Study of Decoupled Architecture Computers IEEE Trans. Computers, vol. 35, no. 8, pp. 692-701, Aug. 1986.
[16] J. Corbal, R. Espasa, and M. Valero, On the Efficiency of Reductions in Micro-SIMD Media Extensions Proc. Int'l Conf. Parallel Architectures and Compilation Techniques, Sept. 2001.
[17] Intel Architecture Optimization Reference Manual, available: manuals245127.htm, 1999.
[18] P. Lapsley et al., DSP Processor Fundamentals, Architectures and Features, Berkeley Design Tech nology, Berkeley, Calif., 1996.
[19] A.R. Pleszkun and E.S. Davidson, Structured Memory Access Architecture Proc. IEEE Int'l Conf. Parallel Processing, pp. 461-471, 1983.
[20] F. Vermeulen, L. Nachtergaele, F. Catthoor, D. Verkest, and H. De Man, Flexible Hardware Acceleration for Multimedia Oriented Microprocessors Proc. IEEE/ACM Int'l Symp. Microarchitecture, pp. 171-177, Dec. 2000.
[21] Synopsis Sold Documentation, version 2000-0.5-1, distributed with Synopsys CAD tools, 2001.
[22] LSI Logic ASIC technologies, available at:http://www.lsilogic/products/asic/technologies index.html, 2001.
[23] LSI Logic, ASKK Documentation System, distributed with LSI Logic CAD tools, 2001
[24] H.G. Cragon and W.J. Watson, The TI Advanced Scientific Computer Computer, pp. 55-64, vol. 22, no. 1, Jan. 1989.
[25] L. Gwennap, AltiVec Vectorizes PowerPC Microprocessor Report, vol. 12, no. 6, May 1998.
[26] Pentium III implementation (IA-32), available:http://www., 2000.
[27] K. Wilcox and S. Manne, Alpha Processors: A History of Power Issues and a Look at the Future Cool Chips Tutorial in Conjunction with IEEE/ACM Int'l Symp. Microarchitecture, Nov. 1999.
[28] J. Fridman, Subword Parallelism in Digital Signal Processing IEEE Signal Processing Magazine, vol. 17, no. 2, pp. 27-35, Mar. 2000.
[29] S. Thakkar and T. Huff, Internet Streaming SIMD Extensions Computer, vol. 32, no. 12, pp. 26-34, Dec. 1999.
[30] J.E. Thornton, Parallel Operation in the Control Data 6600 Proc. Fall Joint Computers Conf., vol. 26, pp. 33-40, 1961.
[31] R.R. Shively, Architecture of a Programmable Digital Signal Processor IEEE Trans. Computers, vol. 31, no. 1, pp. 16-22, Jan. 1978.
[32] J.R. Goodman,J. Hsieh,K. Kiou,A.R. Pleszkun,P.B. Scheuchter,, and H.C. Young,“PIPE: A VLSI decoupled architecture,” Proc. 12th Int’l Symp. Computer Architecture, pp. 20-27,Boston, June 1985.
[33] W.A. Wolf, Evaluation of the WM Architecture Proc. IEEE/ACM Int'l Symp. Computer Architecture, pp. 382-390, May 1992.
[34] Y. Zhang and G.B. Adams, Performance Modeling and Code Partitioning for the DS Architecture Proc. IEEE/ACM Int'l Symp. Computer Architecture, pp. 293-304, June 1998.
[35] A.S. Berrached, P.T. Hulina, and L.D. Coraor, Specification of a Coprocessor for Efficient Access of Data Structures Proc. Ann. Hawaii Int'l Conf. System Sciences, pp. 496-505, Jan. 1992.
[36] J. Corbal, M. Valero, and R. Espasa, Exploiting a New Level of DLP in Multimedia Applications Proc. IEEE/ACM Int'l Symp. Microarchitecture, pp. 72-79, Nov. 1999.
[37] S. Vassiliadis, B. Juurlink, and E.A. Hakkennes, Complex Streamed Instructions: Introduction and Initial Evaluation Proc. IEEE Euromicro Conf., vol. 1, pp. 400-408, Sept. 2000.
[38] B. Juurlink, D. Tcheressiz, S. Vassiliadis, and H. Wijshoff, Implementation and Evaluation of the Complex Streamed Instruction Set Proc. Int'l Conf. Parallel Architectures and Compilation Techniques, Sept. 2001.
[39] C.G. Lee and M.G. Stoodley, Simple Vector Microprocessors for Multimedia Applications Proc. IEEE/ACM Int'l Symp. Microarchitecture, pp. 25-36, Dec. 1998.
[40] S. Rixner et al., "A Bandwidth-Efficient Architecture for Media Processing," Proc. 31st Int'l Symp. Microarchitecture, IEEE Computer Society Press, Los Alamitos, Calif., 1998, pp. 3-13.
[41] S. Goldstein et al., "PipeRench: A Coprocessor for Streaming Multimedia Acceleration," Proc. 26th Int'l Symp. Computer Architecture (ISCA 99), IEEE CS Press, Los Alamitos, Calif., 1999, pp. 28-39.
[42] D.J. Kuck and R.A. Stokes, The Burroughs Scientific Processor (BSP) IEEE Trans. Computers, vol. 31, no. 5, pp. 363-376, May 1982.
[43] T.M. Conte, P.K. Dubey, M.D. Jennings, R.B. Lee, A. Peleg, S. Rathnam, M. Schlansker, P. Song, and A. Wolfe, “Challenges to Combining General-Purpose and Multimedia Processors,” Computer, vol. 30, no. 12, pp. 33-37, Dec. 1997.
[44] P. Ranganathan, S. Adve, and N.P. Jouppi, Reconfigurable Caches and Their Application to Media Processing Proc. 27th Int'l Symp. Computer Architecture (ISCA), pp. 214-224, June 2000.
[45] S.A. Mckee, Maximizing Memory Bandwidth for Streamed Computations PhD Thesis, School of Eng. and Applied Science, Univ. of Virginia, Charlottesville, May 1995.
[46] Z.A. Ye, A. Moshovos, S. Hauck, and P. Banerjee, CHIMAERA: A High-Performance Architecture with a Tightly-Coupled Reconfigurable Functional Unit Proc. 27th Int'l Symp. Computer Architecture, pp. 225-235, 2000.
[47] H. Lieske, J. Wittenburg, W. Hinrichs, H. Kloos, M. Ohmacht, and P. Pirsch, Enhancements for a Second Generation Parallel Multimedia-DSP Proc. Workshop Media Processors and DSPs in Conjunction with Micro-32, Nov. 1999.
[48] D. Talla and L.K. John, Cost-Effective Hardware Acceleration of Multimedia Applications Proc. IEEE Int'l Conf. Computer Design, pp. 415-424, Sept. 2001.
[49] D. Talla, Architectural Techniques to Accelerate Multimedia Applications on General-Purpose Processors PhD thesis, Dept. of Electrical and Computer Eng., Univ. of Texas, Austin, Aug. 2001, available at: psdeepu_talla_dissertation.pdf.
[50] N. Sreraman and R. Govindarajan, A Vectorizing Compiler for Multimedia Extensions Int'l J. Parallel Programming, vol. 28, no. 4, pp. 363-400, Aug. 2000.
[51] G. Pokam, J. Simonnet, and F. Bodin, A Retargetable Preprocessor for Multimedia Instructions Proc. Workshop Compilers for Parallel Computers, June 2001.
[52] A. Bik, M. Girkar, P. Grey, and X. Tian, Experiments with Automatic Vectorization for the Pentium 4 Processor Proc. Workshop Compilers for Parallel Computers, June 2001.
[53] G. Cheong and M.S. Lam, An Optimizer for Multimedia Instruction Sets Proc. SUIF Compiler Workshop, Aug. 1997.
[54] S.P. Amarasinghe, Parallelizing Compiler Techniques Based on Linear Inequalities, PhD thesis, Dept. of Electrical Engineering, Stanford University, Jan. 1997.
[55] M. Wolfe, High Performance Compilers for Parallel Computing. Addison-Wesley, 1996.
[56] D. Rice, High-Performance Image Processing Using Special-Purpose CPU Instructions: The UltraSPARC Visual Instruction Set master's thesis, Stanford Univ., 1996.
[57] D. Talla and L.K. John, MediaBreeze: A Decoupled Architecture for Accelerating Multimedia Applications ACM Computer Architecture News, vol. 29, no. 5, Dec. 2001.
[58] D. Talla, L.K. John, and D. Burger, Hardware Support to Reduce Overhead in Fine-Grain Media Codes technical report, Laboratory for Computer Architecture, Dept. of Electrical and Computer Eng., Univ. of Texas, Austin, Nov. 2001.

Index Terms:
Media processing, subword parallelism, bottlenecks in SIMD extensions, workload characterization, performance evaluation, hardware address generation, low-overhead looping, data reorganization, superscalar general-purpose processors.
Deepu Talla, Lizy Kurian John, Doug Burger, "Bottlenecks in Multimedia Processing with SIMD Style Extensions and Architectural Enhancements," IEEE Transactions on Computers, vol. 52, no. 8, pp. 1015-1031, Aug. 2003, doi:10.1109/TC.2003.1223637
Usage of this product signifies your acceptance of the Terms of Use.