This Article 
 Bibliographic References 
 Add to: 
Frequent Loop Detection Using Efficient Nonintrusive On-Chip Hardware
October 2005 (vol. 54 no. 10)
pp. 1203-1215
Dynamic software optimization methods are becoming increasingly popular for improving software performance and power. The first step in dynamic optimization consists of detecting frequently executed code, or "critical regions.” Most previous critical region detectors have been targeted to desktop processors. We introduce a critical region detector targeted to embedded processors, with the unique features of being very size and power efficient and being completely nonintrusive to the software's execution—features needed in timing-sensitive embedded systems. Our detector not only finds the critical regions, but also determines their relative frequencies, a potentially important feature for selecting among alternative dynamic optimization methods. Our detector uses a tiny cache-like structure coupled with a small amount of logic. We provide results of extensive explorations across 19 embedded system benchmarks. We show that highly accurate results can be achieved with only a 0.02 percent power overhead, acceptable size overhead, and zero runtime overhead. Our detector is currently being used as part of a dynamic hardware/software partitioning approach, but is applicable to a wide variety of situations.

[1] J. Anderson, L.M. Berc, J. Dean, S. Ghemawat, M.R. Henzinger, S.T.A. Leung, R.L. Sites, M.T. Vandevoorde, C.A. Waldspurger, and W.E. Weihl, “Continuous Profiling: Where Have All the Cycles Gone?” Proc. 16th ACM Symp. Operating Systems Design, 1997.
[2] J. Anderson, L. Berc, G. Chrysos, J. Dean, S. Ghemawat, J. Hicks, S.T. Leung, M. Lichtenberg, M. Vandevoorde, C.A. Waldspurger, and W.E. Weihl, “Transparent, Low-Overhead Profiling on Modern Processors,” Proc. Workshop Profile and Feedback-Directed Compilation, Oct. 1998.
[3] G. Ammons, T. Ball, and J. Larus, “Exploiting Hardware Performance Counters with Flow and Context Sensitive Profiling,” Proc. SIGPLAN '97 Conf. Programming Language Design and Implementation, 1997.
[4] M. Arnold and B.G. Ryder, “A Framework for Reducing the Cost of Instrumented Code,” Proc. Conf. Programming Language Design and Implementation, 2001.
[5] Artisan, http:/, 2003.
[6] V. Bala, E. Duesterwald, and S.. Banerjia, “Dynamo: A Transparent Dynamic Optimization System,” Proc. ACM SIGPLAN Conf. Programming Language Design and Implemenation, 2000.
[7] T. Ball and J. Larus, “Efficient Path Profiling,” Proc. 29th Ann. Int'l Symp. Microarchitecture, 1996.
[8] T. Ball and J. Larus, “Optimally Profiling and Tracing Programs,” ACM Trans. Programming Languages and Systems, vol. 16, no. 4, pp. 1319-1360, July 1994.
[9] R.D. Barnes, E.M. Nystrom, M.C. Merten, and W.W. Hwu, “Vacuum Packing: Extracting Hardware-Detected Program Phases for Post-Link Optimization,” Proc. MICRO, 2002.
[10] N. Bellas et al., “Energy and Performance Improvements in Microprocessor Design Using a Loop Cache,” Proc. Int'l Conf. Computer Design (ICCD), pp. 378-383, 1999.
[11] D. Burger, T. Austin, and S. Bennet, “Evaluating Future Microprocessors: The Simplescalar Toolset,” Technical Report CS-TR-1308, Computer Science Dept., Univ. of Wisconsin-Madison, July 2000.
[12] B. Calder, P. Feller, and A. Eustace, “Value Profiling and Optimization,” J. Instruction Level Parallelism, vol. 1, Mar. 1999.
[13] R. Cmelik, “SpixTools— Introduction and User's Manual,” Technical Report SMLI TR 93-6, Sun Microsystems Laboratories, Inc., Feb. 1993.
[14] J. Dean, J. Hicks, C.A. Waldspurger, W.E. Weihl, and G. Chrysos, “ProfileMe: Hardware Support for Instruction Level Profiling on Out-of-Order Processors,” Proc. 30th Int'l Symp. Microarchitecture, 1997.
[15] A. Dhodapkar and J. Smith, “Managing Multi-Configuration Hardware via Dynamic Working Set Analysis,” Proc. 29th Ann. Int'l Symp. Computer Architecture, 2002.
[16] K. Ebcioglu and E. Altman, “DAISY: Dynamic Compilation for 100% Architecture Compatibility,” Proc. 24th Int'l Symp. Computer Architecture, pp. 26-37, June 1997.
[17] A. Gordon-Ross, S. Cotterell, and F. Vahid, “Exploiting Fixed Programs in Embedded Systems: A Loop Cache Example,” IEEE Computer Architecture Letters, vol. 1, Jan. 2002.
[18] S.C. Govindarajan, G. Ramaswamy, and M. Mehendale, “Area and Power Reduction of Embedded DSP Systems Using Instruction Compression and Re-Configurable Encoding,” Proc. Int'l Conf. Computer Aided Design, 2001.
[19] S.L. Grahm, P.B. Kessler, and M.K. McKusick, “Gprof: A Call Graph Execution Profiler,” Proc. SIGPLAN Symp. Compiler Construction, 1982.
[20] IEEE, “IEEE 1149.1 Standard Test Access Port and Boundary-Scan Architecture,” http:/, 2001.
[21] Intel Corp., Intel Itanium 2 Processor Reference Manual for Software Development and Optimization, June 2002.
[22] Y. Ishihara and H.A. Yasuura, “A Power Reduction Technique with Object Code Merging for Application Specific Embedded Processors,” Proc. Design Automation and Test in Europe, Mar. 2000.
[23] J.L. Hennessy and D.A. Patterson, Computer Architecture: A Quantitative Approach. Morgan Kaufmann, 1990.
[24] K. Kiefendorff, “Transistor Budgets Go Ballistic,” Microprocessor Report, vol. 12, no. 10, pp. 34-43, Aug. 1998.
[25] A. Klaiber, “The Technology behind Crusoe Processors,” Transmeta Technical Brief, Jan. 2000.
[26] C. Lee, M. Potkonjak, and W.H. Mangione-Smith, “MediaBench: A Tool for Evaluating and Synthesizing Multimedia and Communication Systems,” Proc 30th Ann. Int'l Symp. Microarchitecture, Dec. 1997.
[27] L.H. Lee, B. Moyer, and J. Arends, “Instruction Fetch Energy Reduction Using Loop Caches for Embedded Applications with Small Tight Loops,” Proc. Int'l Symp. Low Power Electronics and Design, 1999.
[28] R. Lysecky and F. Vahid, “A Codesigned On-Chip Logic Minimizer,” Proc. First IEEE/ACM/IFIP Int'l Conf. Hardware/Software Codesign and System Synthesis (CODES+ISSS), 2003.
[29] R. Lysecky and F. Vahid, “On-Chip Logic Minimization,” Proc. 40th ACM/IEEE Conf. Design Automation (DAC), 2003.
[30] M.C. Merten, A.R. Trick, R.D. Barnes, E.M. Nystrom, C.N. George, J. Gyllenhaal, and W.W. Hwu, “An Architectural Framework for Run-Time Optimizations,” IEEE Trans. Computers, vol. 50, no. 6, pp. 567-589, June 2001.
[31] MIPS Technologies, 32-BitCores/MIPS324KFamily/ProductCatalog/ P_MIPS324KFamilyproductBrief, 2003.
[32] K. Pettis and R.C. Hansen, “Profile Guided Code Positioning,” Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation (PLDI), June 1990.
[33] J. Scott, L.H. Lee, A. Chin, J. Arends, and W. Moyer, “Designing the M*CORE M3 CPU Architecture,” Proc. IEEE Int'l Conf. Computer Design (ICCD), 1999.
[34] G. Stitt, R. Lysecky, and F. Vahid, “Dyanmic Hardware/Software Partitioning: A First Approach,” Proc. 40th ACM/IEEE Conf. Design Automation (DAC), 2003.
[35] D.C. Suresh, W.A. Najjar, F. Vahid, J.R. Villarreal, and G. Stitt, “Profiling Tools for Hardware/Software Partitioning of Embedded Applications,” Proc. Languages, Compilers and Tools for Embedded Systems (LCTES), pp. 189-198, 2003.
[36] Synopsys Inc., http:/, 2003.
[37] J. Tubella and A. Gonzalez, “Control Speculation in Multithreaded Processors through Dynamic Loop Detection,” Proc. Fourth Int'l Symp. High Performance Computer Architecture (HPCA), 1998.
[38] J. Villareal, R. Lysecky, S. Cotterell, and F. Vahid, “Loop Analysis of Embedded Applications,” Technical Report UCR-CSE-01-03, Univ. of California, Riverside, 2001.
[39] Vtune Environment, Intel Corp.,, 2003.
[40] J. Yang and R. Gupta, “Energy Efficient Frequent Value Data Cache Design,” Proc. MICRO, 2002.
[41] M. Zagha, B. Larson, S. Turner, and M. Itzkowitz, “Performance Analysis Using the MIPS R10000 Performance Counters,” Proc. Supercomputing, Nov. 1996.
[42] X. Zhang et al., “System Support for Automatic Profiling and Optimizations,” Proc. 16th Symp. Operating System Principles, 1997.

Index Terms:
Index Terms- Frequent value profiling, runtime profiling, on-chip profiling, hardware profiling, frequent loop detection, hot spot detection, dynamic optimization.
Ann Gordon-Ross, Frank Vahid, "Frequent Loop Detection Using Efficient Nonintrusive On-Chip Hardware," IEEE Transactions on Computers, vol. 54, no. 10, pp. 1203-1215, Oct. 2005, doi:10.1109/TC.2005.165
Usage of this product signifies your acceptance of the Terms of Use.