This Article 
 Bibliographic References 
 Add to: 
An Architectural Framework for Runtime Optimization
June 2001 (vol. 50 no. 6)
pp. 567-589

Abstract—Wide-issue processors continue to achieve higher performance by exploiting greater instruction-level parallelism. Dynamic techniques such as out-of-order execution and hardware speculation have proven effective at increasing instruction throughput. Run-time optimization promises to provide an even higher level of performance by adaptively applying aggressive code transformations on a larger scope. This paper presents a new hardware mechanism for generating and deploying runtime optimized code. The mechanism can be viewed as a filtering system that resides in the retirement stage of the processor pipeline, accepts an instruction execution stream as input, and produces instruction profiles and sets of linked, optimized traces as output. The code deployment mechanism uses an extension to the branch prediction mechanism to migrate execution into the new code without modifying the original code. These new components do not add delay to the execution of the program except during short bursts of reoptimization. This technique provides a strong platform for runtime optimization because the hot execution regions are extracted, optimized, and written to main memory for execution and because these regions persist across context switches. The current design of the framework supports a suite of optimizations, including partial function inlining (even into shared libraries), code straightening optimizations, loop unrolling, and peephole optimizations.

[1] M.C. Merten, A.R. Trick, C.N. George, J.C. Gyllenhaal, and W.W. Hwu, “A Hardware-Driven Profiling Scheme for Identifying Program Hot Spots to Support Runtime Optimization,” Proc. 26th Int'l Symp. Computer Architecture, pp. 136-147, May 1999.
[2] M.C. Merten, A.R. Trick, E.M. Nystrom, R.D. Barnes, and W.W. Hwu, “A Hardware Mechanism for Dynamic Extraction and Relayout of Program Hot Spots,” Proc. 27th Int'l Symp. Computer Architecture, pp. 59-70, June 2000.
[3] W.W. Hwu, S.A. Mahlke, W.Y. Chen, P.P. Chang, N.J. Warter, R.A. Bringmann, R.G. Ouellette, R.E. Hank, T. Kiyohara, G.E. Haab, J.G. Holm,, and D.M. Lavery, ``The Superblock: An Effective Technique for VLIW and Superscalar Compilation,'' J. Supercomputing, vol. 7, pp. 9-50, 1993.
[4] T. Ball and J.R. Larus, “Branch Prediction for Free,” Proc. ACM SIGPLAN 1993 Conf. Programming Language Design and Implementation, pp. 300-313, June 1993.
[5] B.L. Deitrich, B.C. Cheng, and W.W. Hwu, “Improving Static Branch Prediction in a Compiler,” Proc. 18th Ann. Int'l Conf. Parallel Architectures and Compilation Techniques, pp. 214-221, Oct. 1998.
[6] T. Ball and J.R. Larus, “Optimally Profiling and Tracing Programs,” ACM Trans. Programming Languages and Systems, vol. 16, no. 4, pp. 1319–1360, July 1994.
[7] J. Anderson et al., "Continuous Profiling: Where Have All the Cycles Gone?" Proc. 16th ACM Symp. on Operating System Principles, ACM Press, New York, 1997, pp. 1-14.
[8] X. Zhang, Z. Wang, N. Gloy, J.B. Chen, and M.D. Smith, “System Support for Automatic Profiling and Optimization,” Proc. 16th ACM Symp. Operating Systems Principles, pp. 15-26, Oct. 1997.
[9] G. Ammons, T. Ball, and J. Larus, “Exploiting Hardware Performance Counters with Flow and Context Sensitive Profiling,” Proc. ACM SIGPLAN 97 Conf. Programming Language Design and Implementation, June 1997.
[10] T.M. Conte, K.N. Menezes, and M.A. Hirsch, “Accurate and Practical Profile-Driven Compilation Using the Profile Buffer,” Proc. 29th Ann. Int'l Symp. Microarchitecture, pp. 36-45, Dec. 1996.
[11] J. Dean et al., "ProfileMe: Hardware Support for Instruction-Level Profiling on Out-of-Order Processors," Proc. 30th Symp. Microarchitecture (Micro-30), IEEE CS Press, Los Alamitos, Calif., 1997, pp. 292-302.
[12] K. Ebcio$\breve{\rm g}$lu and E.R. Altman, "DAISY: Dynamic Compilation for 100% Architectural Compatibility," Proc. ISCA 24, ACM Press, New York, 1997, pp. 26-37.
[13] R.J. Hookway and M.A. Herdeg, "Digital FX!32: Combining Emulation and Binary Translation," Digital Technical J., Vol. 9, No. 1, 1997, pp. 3-12.
[14] Transmeta, “The Technology behind Crusoe Processors,” technical report, Transmeta, , 2000.
[15] M. Gschwind, E. Altman, S. Sathaye, P. Ledak, and D. Appenzeller, “Dynamic and Transparent Binary Translation,” Computer, pp. 54-59, Mar. 2000.
[16] V. Bala, E. Duesterwald, and S. Banerjia, “Dynamo: A Transparent Dynamic Optimization System,” Proc. ACM SIGPLAN '00 Conf. Programming Language Design and Implementation, pp. 1-12, June 2000.
[17] D. Dever, R. Gorton, and N. Rubin, “Wiggens/Redstone: An On-Line Program Specializer,” Proc. Hot Chips 11, Aug. 1999.
[18] W.-K. Chen, S. Lerner, R. Chaiken, and D.M. Gilles, “Mojo: A Dynamic Optimziation System,” Proc. Third ACM Workshop Feedback-Directed and Dynamic Optimization, Dec. 2000.
[19] A.-R. Adl-Tabatabi, M. Cierniak, G.-Y. Lueh, V.M. Parikh, and J.M. Stichnoth, “Fast, Effective Code Generation in a Just-in-Time Java Compiler,” Proc. ACM SIGPLAN '98 Conf. Programming Language Design and Implementation, pp. 280-290, June 1998.
[20] E. Duesterwald and V. Bala, “Software Profiling for Hot Path Prediction: Less Is More,” Proc. Ninth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 202-211, Dec. 2000.
[21] B. Grant, M. Mock, M. Philipose, C. Chambers, and S. Eggers, “Annotation-Directed Run-Time Specialization in C,” Proc. ACM SIGPLAN Symp. Partial Evaluation and Semantics-Based Program Manipulation (PEPM), pp. 163-178, June 1997.
[22] M. Poletto, D. Engler, and M. Kaashoek, “tcc: A System for Fast, Flexible, and High-Level Dynamic Code Generation,” Proc. ACM SIGPLAN '97 Conf. Programming Language Design and Implementation, pp. 109-121, June 1997.
[23] M. Mock, C. Chambers, and S.J. Eggers, “Calpa: A Tool for Automating Selective Dynamic Compilation,” Proc. 33rd Int'l Symp. Microarchitecture, pp. 291-302, Dec. 2000.
[24] D.A. Connors and W.W. Hwu, “Compiler-Directed Computation Reuse: Rationale and Initial Results,” Proc. 32nd Ann. Int'l Symp. Microarchitecture, pp. 158-169, Nov. 1999.
[25] W.W. Hwu and Y. Patt, “Checkpoint Repair for High Performance Out-of-Order Execution Machines,” IEEE Trans. Computers, vol. 36, no. 12, pp. 1496-1514, Dec. 1987.
[26] E. Rotenberg, S. Bennett, and J. Smith, "Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching," Proc. 29th Ann. ACM/IEEE Int'l Symp. on Microarchitecture, IEEE CS Press, Los Alamitos, Calif., 1996, pp. 24-34.
[27] D.H. Friendly, S.J. Patel,, and Y.N. Patt, ``Putting the Fill Unit to Work: Dynamic Optimizations for Trace Cache Microprocessors,'' Proc. 30th Ann. ACM/IEEE Int'l Symp. Microarchitecture, 1997.
[28] E. Rotenberg, Q. Jacobson, Y. Sazeides, and J.E. Smith, Trace Processors Proc. 30th Int'l Symp. Microarchitecture, pp. 138-148, 1997.
[29] Q. Jacobson and J.E. Smith, “Instruction Pre-Processing in Trace Processors,” Proc. Fifth Int'l Symp. High-Performance Computer Architecture, pp. 125-129, Jan. 1999.
[30] S.J. Patel and S.S. Lumetta, “rePLay: A Hardware Framework for Dynamic Program Optimization,” Technical Report CRHC-99-16, Center for Reliable and High-Performance Computing, Univ. of Illinois, Urbana, Dec. 1999.
[31] S.J. Patel, T. Tung, S. Bose, and M. Crum, “Increasing the Size of Atomic Instruction Blocks by Using Control Flow Assertions,” Proc. 33rd Ann. IEEE/ACM Int'l Symp. Microarchitecture, pp. 303-316, Dec. 2000.
[32] D.I. August, D.A. Connors, S.A. Mahlke, J.W. Sias, K.M. Crozier, B. Cheng, P.R. Eaton, Q.B. Olaniran, and W.W. Hwu, “Integrated Predicated and Speculative Execution in the IMPACT EPIC Architecture,” Proc. 25th Int'l Symp. Computer Architecture, pp. 227-237, June 1998.
[33] S. McFarling, “Combining Branch Predictors,” Technical Report TN-36, Digital, WRL, June 1993.
[34] S.J. Patel, D.H. Friendly, and Y.N. Patt, “Evaluation of Design Options for the Trace Cache Fetch Mechanism,” IEEE Trans. Computers, special issue on cache memory and related problems, vol. 48, no. 2, pp. 193-204, Feb. 1999.

Index Terms:
Postlink optimization, runtime optimization, dynamic optimization, hardware profiling, low-overhead profiling, code layout, program hot spot, partial function inlining, trace formation and optimization.
Matthew C. Merten, Andrew R. Trick, Ronald D. Barnes, Erik M. Nystrom, Christopher N. George, John C. Gyllenhaal, Wen-mei W. Hwu, "An Architectural Framework for Runtime Optimization," IEEE Transactions on Computers, vol. 50, no. 6, pp. 567-589, June 2001, doi:10.1109/12.931894
Usage of this product signifies your acceptance of the Terms of Use.