This Article 
 Bibliographic References 
 Add to: 
Evaluating Integrated Hardware-Software Optimizations Using a Unified Energy Estimation Framework
January 2003 (vol. 52 no. 1)
pp. 59-76

Abstract—With the emergence of a plethora of embedded and portable applications, energy dissipation has joined throughput, VLSI layout area, and accuracy/precision as a major design constraint. Thus, designers must be concerned with both estimating and optimizing the energy consumption of circuits, architectures, and software. Most of the research in energy optimization and/or estimation has focused on single components of the system and has not looked across the interacting spectrum of the hardware and software. The novelty of our energy estimation framework, SimplePower, is that it evaluates the energy considering the system as a whole rather than just as a sum of parts, and that it concurrently supports both compiler and architectural experimentation. We present the design and use of the SimplePower framework that includes a transition-sensitive, cycle-accurate datapath energy model that interfaces with analytical and transition-sensitive energy models for the memory, clock and bus subsystems, respectively. Such an architectural-level energy estimation framework is invaluable in making good energy-conscious decisions early in the design cycle. We analyzed the energy consumption of 10 codes from the multidimensional array domain, a domain that is important for embedded video and signal processing systems. Our study shows that the pipeline registers and the register file are the datapath energy hotspots consuming 58-70 percent of overall datapath energy and that the clocking of the on-chip memory structures is the major source of the on-chip clock networks energy consumption. Further, we find that the off-chip main memory is the overall energy bottleneck of the entire system. However, we found that the application of high-level compiler optimizations reduces the main memory energy significantly, causing the contribution of the data cache, on-chip clock network, instruction cache, and datapath to become more important. We found that the improved locality of the optimized codes is useful in not only reducing the accesses to the main memory but also in exploiting the more energy-efficient cache architectures much better than unoptimized codes. Optimized codes saved 21 percent more energy using the most recently used way-prediction cache scheme as compared to executing unoptimized codes from the multidimensional array domain. We also observed that emerging technologies such as embedded DRAM coupled with a combination of energy-efficient circuit, architectural and compiler optimizations can potentially shift the energy hotspot. Thus, we have demonstrated that early estimates from the powerful SimplePower energy estimation framework can help one to identify the system energy hot spots and enable architects and compiler designers to focus their efforts on these areas.

[1] N. Vijaykrishnan, M. Kandemir, M. J. Irwin, H. Kim, and W. Ye, “Energy-Driven Integrated Hardware-Software Optimizations Using SimplePower,” Proc. Int'l Symp. Computer Architecture, 2000.
[2] D. Tennenhouse, Pro-Active Computing, , 1999.
[3] A.P. Chandrakasan and R.W. Brodersen, "Low Power Digital CMOS Design," Kluwer Academic Pub., Boston, Mass., 1995.
[4] D. Blaauw, A. Dharchoudhury, R. Panda, S. Sirichottiyakul, C. Oh, and T. Edwards, “Emerging Power Management Tools for Processor Design,” Proc. Int'l Symp. Low Power Electronics and Design, pp. 143-148, Aug. 1998.
[5] V. Tiwari, S. Malik, A. Wolfe, and T.C. Lee, “Instruction Level Power Analysis and Optimization of Software,” J. VLSI Signal Processing Systems, vol. 13, no. 2, Aug. 1996.
[6] Y. Li and J. Henkel, “A Framework for Estimating and Minimizing Energy Dissipation of Embedded HW/SW Systems,” Proc. Design Automation Conf., pp. 188-191, 1998.
[7] D. Burger and T. Austin, “The Simplescalar Tool Set, Version 2.0,” technical report, Computer Sciences Dept., Univ. of Wisconsin, June 1997.
[8] F. Catthoor, S. Wuytack, E.D. Greef, F. Balasa, L. Nachtergaele, and A. Vandecappelle, Custom Memory Management Methodology—Exploration of Memory Organization for Embedded Multimedia System Design. Kluwer Academic, June 1998.
[9] P.R. Panda, N.D. Dutt, and A. Nicolau, “Architectural Exploration and Optimization of Local Memory in Embedded Systems,” Proc. 10th Int'l Symp. System Synthesis, Sept. 1997.
[10] K. Itoh, K. Sasaki, and Y. Nakagome, “Trends in Low-Power RAM Circuit Technologies,” Proc. IEEE, vol. 83, no. 4, pp. 524-543, Apr. 1995.
[11] C.L. Su and A.M. Despain, “Cache Designs for Energy Efficiency,” Proc. 28th Hawaii Int'l Conf. System Sciences, Jan. 1995.
[12] K. Roy and M.C. Johnson, “Software Design for Low Power,” NATO Advanced Study Inst. on Low Power Design in Deep Sub-Micron Electronics, Aug. 1996.
[13] P. Song, “Embedded DRAM Finds Growing Niche,” Microprocessor Report, pp. 19-23, Aug. 1997.
[14] Y. Nunomura et al., “M32R/D–Integrating DRAM and Microprocessor,” IEEE Micro, Nov./Dec. 1997.
[15] D. Brooks, V. Tiwari, and M. Martonosi, Wattch: A Framework for Architectural-Level Power Analysis and Optimizations Proc. 27th Ann. Int'l Symp. Computer Architecture, pp. 83-94, June 2000.
[16] T. Simunic, L. Benini, and G.D. Micheli, “Cycle-Accurate Simulation of Energy Consumption in Embedded Systems,” Proc. 36th ACM/IEEE Design Automation Conf., June 1999.
[17] M.B. Kamble and K. Ghose,"Analytical Energy Dissipation Models for Low-Power Caches," Proc. Int'l Symp. Low Power Electronics and Design (ISPLED 97), ACM Press, 1997, pp. 143-148.
[18] M.J. Irwin and N. Vijaykrishnan, “Energy Issues in Multimedia Systems,” Proc. Workshop Signal Processing Systems, pp. 24-33, Oct. 1999.
[19] R. Chen, N. Vijaykrishnan, and M.J. Irwin, “Clock Power Issues in System-on-Chip Designs,” Proc. Ann. IEEE CS Workshop VLSI, pp. 48-53, 1999.
[20] D. Duarte, N. Vijaykrishnan, M.J. Irwin, and M. Kandemir, “Formulation and Validation of an Energy Dissipation Model for Clock Generation Circuitry and Distribution Network,” Proc. 14th Int'l Conf. VLSI Design, Jan. 2001.
[21] H. Mehta, R.M. Owens, and M.J. Irwin, “Energy Characterization Based on Clustering,” Proc. 33rd Design Automation Conf., p. 702, June 1996.
[22] META-SOFTWARE, “Star-Hspice User's Manual Version 96.1,” Feb. 1996.
[23] J. Lin, W. Shen, and J. Jou, “A Power Modeling and Characterization Method for Macrocells Using Structure Information,” Proc. Int'l Conf. Computer Aided Design, p. 502, Nov. 1997.
[24] R.Y. Chen, R.M. Owens, and M.J. Irwin, “Validation of an Architectural Level Power Analysis Technique,” Proc. 35th Design Automation Conf., p. 242, June 1998.
[25] W.-T. Shiue and C. Chakrabarti, “Memory Exploration for Low Power, Embedded Systems, clpe-tr-9-1999-20,” technical report, Arizona State Univ., 1999.
[26] D. Duarte, N. Vijaykrishnan, M.J. Irwin, and M. Kandemir, “Evaluating the Impact of Architectural-Level Optimizations on Clock Power,” Proc. Int'l Conf. ASIC/SoC, Sept. 2001.
[27] P. Hicks, M. Walnock, and R.M. Owens, “Analysis of Power Consumption in Memory Hierarchies,” Proc. Int'l Symp. Low Power Electronics and Design, p. 239, 1997.
[28] B. Ackland and C. Nicol, “High Performance DSPs—What's Hot and What's Not?” Proc. Int'l Symp. Low Power Electronics and Design, pp. 1-6, Aug. 1998.
[29] M. Kandemir, A. Choudhary, J. Ramanujam, and P. Banerjee, “Improving Locality Using Loop and Data Transformations in an Integrated Approach,” Proc. MICRO-31, Dec. 1998.
[30] M. Wolf, D. Maydan, and D. Chen, “Combining Loop Transformations Considering Caches and Scheduling,” Proc. MICRO-29, pp. 274-286, Dec. 1996.
[31] S. Carr and Y. Guan, “Unroll-and-Jam Using Uniformly Generated Sets,” Proc. 30th Int'l Symp. Microarchitecture (MICRO-30), Dec. 1997.
[32] S.S. Muchnick, Advanced Compiler Design and Implementation, Morgan Kaufmann, San Francisco, Calif., 1997.
[33] M. Wolfe, High Performance Compilers for Parallel Computing. Addison-Wesley, 1996.
[34] S.Y. Liao, “Code Generation and Optimization for Embedded Digital Signal Processors,” PhD thesis, Dept. of Electrical Eng. and Computer Science, Massachusetts Inst. of Technology, Cambridge, Mass., June 1996.
[35] M. Chang, M. Irwin, and R. Owens, “Power-Area Trade-Offs in Divided Word Line Memory Arrays,” J. Circuits, Systems, and Computers, vol. 7, no. 1, pp. 49-67, 1997.
[36] J. Kin, G.M. Gupta, and W.H. Mangione-Smith, “The Filter Cache: An Energy Efficient Memory Structure,” Proc. 30th Ann. Int'l Symp. Microarchitecture, pp. 184-193, 1997.
[37] G. Albera and R.I. Bahar, “Power and Performance Tradeoffs Using Various Cache Configurations,” Proc. Power Driven Micro-Architecture Workshop, Int'l Symp. Computer Architecture (ISCA '98), June 1998.
[38] B. Calder, D. Grunwald, and J. Emer, Predictive Sequential Associative Cache Proc. Second Int'l Symp. High-Performance Computer Architecture, pp. 244-253, Feb. 1996.
[39] K. Inoue, T. Ishihara, and K. Murakami, "Way-Predicting Set-Associative Cache for High-Performance and Low Energy Consumption," Proc. Int'l Symp. Low Power Electronics and Design (ISLPED 99), ACM Press, New York, 1999, pp. 273-275.
[40] R. Kessler et al., "Inexpensive Implementations of Set-Associativity," Proc. 16th Ann. Int'l Symp. Computer Architecture, IEEE CS Press, 1989, pp. 131-139.
[41] M. Ohnishi, A. Yamada, H. Noda, and T. Kambe, “A Method of Redundant Clocking Detection and Power Reduction at RT Level Design,” Proc. Int'l Symp. Low Power Electronics and Design (ISLPED '97), p. 131, Aug. 1997.
[42] T. Lang, E. Musol, and J. Cortadella, “Individual Flip-Flops with Gated Clocks for Low Power Datapaths,” IEEE Trans. Circuits and System-II: Analog and Digital Signal Processing, vol. 44, no. 6, June 1997.

Index Terms:
Energy estimation, simulation, energy models, optimizations.
N. Vijaykrishnan, Mahmut Kandemir, Mary Jane Irwin, Hyun Suk Kim, Wu Ye, David Duarte, "Evaluating Integrated Hardware-Software Optimizations Using a Unified Energy Estimation Framework," IEEE Transactions on Computers, vol. 52, no. 1, pp. 59-76, Jan. 2003, doi:10.1109/TC.2003.1159754
Usage of this product signifies your acceptance of the Terms of Use.