The Community for Technology Leaders
RSS Icon
Issue No.05 - May (2012 vol.61)
pp: 650-665
Carlos Álvarez Martínez , UPC, Barcelona
Jesús Corbal San Adrián , Intel Corporation, Barcelona
Mateo Valero Cortés , UPC, Barcelona
Multimedia computation has clearly become a primary and demanding application segment for new architectures targeted at portable devices. The main challenge for such architectures is to keep pace with the computational requirements of ever evolving media standards and applications while satisfying the power and energy consumption required to leverage smaller form-factors and longer battery lifetimes. One technique aimed at reducing both the energy consumption and the execution time of an application is Reuse. This technique memorizes the outcome of an instruction or set of instructions so that we can reuse it the next time we perform the same operation with the same inputs. In this paper, we analyze a region reuse schema specially focused on multimedia applications. While the technique appears to be, in theory, a promising vehicle to improve both timing and energy for low-end media applications, we will show that the extra hardware cost required becomes a severe shortcoming as we find the undesirable situation where we have to consume more energy in order to reduce the execution time (hence becoming a poor power-oriented solution). To mitigate the overhead of the reuse hardware, we advocate for exploiting a third variable in the power-time trade-off and we evaluate tolerant region reuse, a technique that relies in the tolerance in the output precision of media algorithms to improve reuse. With this technique, we afford to use less consuming hardware structures that drives benefits in both energy and timing. As a trade-off, tolerant region reuse introduces non-noticeable errors in the output data. The main drawback of tolerant region reuse is the strong reliance on application profiling, the need for careful tuning from the application developer, and the inability of the technique to adapt to the variability of the media contents being used as inputs. To address that inflexibility, we introduce dynamic tolerant region reuse. This novel technique overcomes the drawbacks of tolerant region reuse by allowing the hardware to study the precision quality of the region reuse output. Our mechanism allows the programmer to grant a minimum threshold on signal-to-noise ratio (SNR) while letting the technique adapt to the characteristics of the specific application and workload to minimize time and energy consumption. This leads to greater energy-delay savings while keeps output error below noticeable levels, avoiding at the same time the need of profiling. We studied our mechanism applied to a set of three different processors, from low to high end. As we will show our technique leads to consistent performance improvements in all of our benchmark programs while reducing energy consumption. We can report savings up to 30 percent in the energy*delay factor for all three processors.
Special-purpose hardware, energy-aware systems, multimedia applications and multimedia signal processing, mobile applications.
Carlos Álvarez Martínez, Jesús Corbal San Adrián, Mateo Valero Cortés, "Dynamic Tolerance Region Computing for Multimedia", IEEE Transactions on Computers, vol.61, no. 5, pp. 650-665, May 2012, doi:10.1109/TC.2011.79
[1] Intel Corporation, Ultra Mobile Pc, en-us/articles ultra-mobile-pc/, 2008.
[2] Archos, “Archos 5 Internet Tablet Tech Specs,” http://www. specs.html, 2010.
[3] Sony, “Sony's Psp Specs Released,” /, 2003.
[4] TomTom, “Tomtom Go 950 - Specifications,” http://www. go-950-liveindex. jsp#tab:specifications , 2010.
[5] Motorola, “Motorola Razr v3i Spec Sheet,” http://www. razr-v3i.html, 2006.
[6] Apple, “Apple ipod Nano Specs,”, 2010.
[7] M.H. Lipasti and J.P. Shen, “Exceeding the Dataflow Limit via Value Prediction,” MICRO 29: Proc. 29th Ann. ACM/IEEE Int'l Symp. Microarchitecture, pp. 226-237, 1996.
[8] J. Steffan and T. Mowry, “The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization,” HPCA '98: Proc. Fourth Int'l Symp. High-Performance Computer Architecture, pp. 2-13, 1998.
[9] H. Akkary and M.A. Driscoll, “A Dynamic Multithreading Processor,” MICRO 31: Proc. 31st Ann. ACM/IEEE Int'l Symp. Microarchitecture, pp. 226-236, 1998.
[10] A. Roth and G.S. Sohi, “Speculative Data-Driven Multithreading,” HPCA '01: Proc. Seventh Int'l Symp. High-Performance Computer Architecture, pp. 37-48, 2001.
[11] J. Zhang, D. Wu, S. Ci, H. Wang, and A.K. Katsaggelos, “Power-Aware Mobile Multimedia: A Survey (Invited Paper),” J. Comm., vol. 4, no. 9, pp. 600-613, 2009.
[12] J.-H. Woo, H. Kim, H.-J. Yoo, and J.-H. Sohn, “A Low-Power Multimedia SoC with Fully Programmable 3D Graphics for Mobile Devices,” IEEE Computer Graphics and Applications, vol. 29, no. 5, pp. 82-90, Sept./Oct. 2009.
[13] TI, “TMS320C62XX Family,” technical report, http://www.ti. com/sc/docs/products/dsptms320c6201.html, Texas Instruments, 2010.
[14] J.-W. van de Waerdt, S. Vassiliadis, S. Das, S. Mirolo, C. Yen, B. Zhong, C. Basto, J.-P. van Itegem, D. Amirtharaj, K. Kalra, P. Rodriguez, and H. van Antwerpen, “The TM3270 Media-Processor,” MICRO 38: Proc. 38th Ann. IEEE/ACM Int'l Symp. Microarchitecture, pp. 331-342, 2005.
[15] Intel Corporation, “Intel Pentium m Processor on 90 nm Process with 2-Mb l2 Cache Datasheet,” pentiumm.htm, 2006.
[16] Intel Corporation, “Intel Atom Processor Overview,” atomindex.htm, 2010.
[17] D. Tan, C.E. Lemonds, and M.J. Schulte, “Low-Power Multiple-Precision Iterative Floating-Point Multiplier with SIMD Support,” IEEE Trans. Computers, vol. 58, no. 2, pp. 175-187, Feb. 2009.
[18] M. Lanuzza, M. Margala, and P. Corsonello, “Cost-Effective Low-Power Processor-in-Memory-Based Reconfigurable Datapath for Multimedia Applications,” ISLPED '05: Proc. Int'l Symp. Low Power Electronics and Design, pp. 161-166, 2005.
[19] P. Ranganathan, S. Adve, and N.P. Jouppi, “Reconfigurable Caches and Their Application to Media Processing,” ISCA '00: Proc. 27th Ann. Int'l Symp. Computer Architecture, pp. 214-224, 2000.
[20] D. Citron, D. Feitelson, and L. Rudolph, “Accelerating Multi-Media Processing by Implementing Memoing in Multiplication and Division Units,” ASPLOS-VIII: Proc. Eighth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 252-261, 1998.
[21] S.S. Sastry, R. Bodik, and J.E. Smith, “Characterizing Coarse-Grained Reuse of Computation,” Proc. Third ACM Workshop Feedback-Directed and Dynamic Optmization, in Conjuction with MICRO 33, 2000.
[22] O. Mutlu, H. Kim, J. Stark, and Y.N. Patt, “On Reusing the Results of Pre-Executed Instructions in a Runahead Execution Processor,” Computers Architecture Letters, vol. 4, no. 1, p. 2, 2005.
[23] C. Álvarez, J. Corbal, E. Salamí, and M. Valero, “On the Potential of Tolerant Region Reuse for Multimedia Applications,” ICS '01: Proc. 15th Int'l Conf. Supercomputing, pp. 218-228, 2001.
[24] F. Arakawa, O. Nishii, K. Uchiyama, and N. Nakagawa, “Sh4 Risc Multimedia Microprocessor,” IEEE Micro, vol. 18, no. 2, pp. 26-34, Mar./Apr. 1998.
[25] S. Hagiwara and I. Oliver, “Sega Dreamcast: Creating a Unified Entertainment World,” IEEE Micro, vol. 19, no. 6, pp. 29-35, Nov./Dec. 1999.
[26] C. Lee, M. Potkonjak, and W.H. Mangione-Smith, “Mediabench: A Tool for Evaluating and Synthesizing Multimedia and Communicatons Systems,” MICRO 30: Proc. 30th Ann. ACM/IEEE Int'l Symp. Microarchitecture, pp. 330-335, 1997.
[27] J.E. Fritts, F.W. Steiling, and J.A. Tucek, “Mediabench ii Video: Expediting the Next Generation of Video Systems Research,” Proc. SPIE: Embedded Processors for Multimedia and Comm. II, vol. 5683, pp. 79-93, Mar. 2005.
[28] D. Burger and T.M. Austin, “The Simplescalar Tool Set, Version 2.0.,” SIGARCH Computer Architecture News, vol. 25, no. 3, pp. 13-25, 1997.
[29] D. Brooks, V. Tiwari, and M. Martonosi, “Wattch: A Framework for Architectural-Level Power Analysis and Optimizations,” ISCA '00: Proc. 27th Ann. Int'l Symp. Computer Architecture, pp. 83-94, 2000.
[30] D.A. Connors and W.-M.W. Hwu, “Compiler-Directed Dynamic Computation Reuse: Rationale and Initial Results,” MICRO 32: Proc. 32nd Ann. ACM/IEEE Int'l Symp. Microarchitecture, pp. 158-169, 1999.
[31] D.A. Connors, H.C. Hunter, B.-C. Cheng, and W.-m.W. Hwu, “Hardware Support for Dynamic Activation of Compiler-Directed Computation Reuse,” ASPLOS-IX: Proc. Ninth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 222-233, 2000.
[32] A. Gonzalez, J. Tubella, and C. Molina, “Trace-Level Reuse,” ICPP '99: Proc. Int'l Conf. Parallel Processing, pp. 30-37, 1999.
[33] Int'l Telecomm. Union Telecomm. Standardization Sector (ITU-T), “H.262: Information Technology - Generic Coding of Moving Pictures and Associated Audio Information: Video,”, 2000.
[34] C.-Y. Chen, S.-Y. Chien, Y.-W. Huang, T.-C. Chen, T.-C. Wang, and L.-G. Chen, “Analysis and Architecture Design of Variable Block-Size Motion Estimation for H.264/AVC,” IEEE Trans. Circuits and Systems-I: Regular Papers, vol. 53, no. 3, pp. 578-593, Mar. 2006.
[35] R. Gonzalez and M. Horowitz, “Energy Dissipation in General Purpose Microprocessors,” IEEE J. Solid State Circuits, vol. 31, no. 9, pp. 1277-1284, Sept. 1996.
[36] S.E. Richardson, “Exploiting Trivial and Redundant Computation,” Proc. 11th IEEE Int'l Symp. Computer Arithmetic, pp. 220-227, 1993.
[37] A. Sodani and G.S. Sohi, “Dynamic Instruction Reuse,” ISCA '97: Proc. 24th Ann. Int'l Symp. Computer Architecture, pp. 194-205, 1997.
[38] M. Azam, P. Franzon, and W. Liu, “Low Power Data Processing by Elimination of Redundant Computations,” Proc. Int'l Symp. Low Power Electronics and Design, pp. 259-264, 1997.
[39] D. Citron and D.G. Feitelson, “‘Look It Up’ or ‘Do the Math’: An Energy, Area, and Timing Analysis of Instruction Reuse and Memoization,” Lecture Notes in Computer Science, vol. 3164, pp. 101-116, 2005.
[40] J. Huang and D. Lilja, “Exploiting Basic Block Value Locality with Block Reuse,” HPCA '99: Proc. Fifth Int'l Symp. High Performance Computer Architecture, pp. 106-114, 1999.
[41] Y. Ding and Z. Li, “A Compiler Scheme for Reusing Intermediate Computation Results,” CGO '04: Proc. Int'l Symp. Code Generation and Optimization, pp. 277-288, 2004.
[42] T. Tsumura, I. Suzuki, Y. Ikeuchi, H. Matsuo, H. Nakashima, and Y. Nakashima, “Design and Evaluation of an Auto-Memoization Processor,” PDCN '07: Proc. 25th Conf. Parallel and Distributed Computing and Networks, pp. 245-250, 2007.
[43] C. Alvarez, J. Corbal, E. Salami, and M. Valero, “Initial Results on Fuzzy Floating Point Computation for Multimedia Processors,” IEEE Computer Architecture Letters, vol. 1, no. 1, p. 1, Jan.-Dec. 2002.
[44] C. Alvarez, J. Corbal, and M. Valero, “Fuzzy Memoization for Floating-Point Multimedia Applications,” IEEE Trans. Computers, vol. 54, no. 7, pp. 922-927, July 2005.
[45] X. Cheng and M.S. Hsiao, “Region-Level Approximate Computation Reuse for Power Reduction in Multimedia Applications,” ISLPED '05: Proc. Int'l Symp. Low Power Electronics and Design, pp. 119-122, 2005.
[46] T. Tsumura, Y. Shimizu, Y. Nakashima, M. Goshima, S. Mori, T. Kitamura, and S. Tomita, “An Evaluation of Tolerant Function Reuse on Stereo Depth Extraction,” Trans. Information Processing Soc. Japan, vol. 44, pp. 246-256, 2003.
[47] N. Takemura, T. Tsumura, Y. Nakashima, M. Goshima, S. Mori, and S. Tomita, “A Technique to Speedup MP3 Encoding with Tolerant Reuse,” Joho Shori Gakkai Kenkyu Hokoku, vol. 2003, no. 27, pp. 145-150, 2003.
[48] T. Yeh, P. Faloutsos, M. Ercegovac, S. Patel, and G. Reinman, “The Art of Deception: Adaptive Precision Reduction for Area Efficient Physics Acceleration,” MICRO '07: Proc. 40th Ann. IEEE/ACM Int'l Symp. Microarchitecture, pp. 394-406, 2007.
15 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool