This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
rePLay: A Hardware Framework for Dynamic Optimization
June 2001 (vol. 50 no. 6)
pp. 590-608

Abstract—In this paper, we propose a new processor framework that supports dynamic optimization. The rePLay Framework embeds an optimization engine atop a high-performance execution engine. The heart of the rePLay Framework is the concept of a frame. Frames are large, single-entry, single-exit optimization regions spanning many basic blocks in the program's dynamic instruction stream, yet containing only a single flow of control. This atomic property of frames increases the flexibilty in applying optimizations. To support frames, rePLay includes a hardware-based recovery mechanism that rolls back the architectural state to the beginning of a frame if, for example, an early exit condition is detected. This mechanism permits the optimizer to make speculative, aggressive optimizations upon frames. In this paper, we investigate some of the underlying phenomenon that support rePLay. Primarily, we evaluate rePLay's region formation strategy. A rePLay configuration with a 256-entry frame cache, using 74KB frame constructor and frame sequencer, achieves an average frame size of 88 Alpha AXP instructions with 68 percent coverage of the dynamic istream, an average frame completion rate of 97.81 percent, and a frame predictor accuracy of 81.26 percent. These results soundly demonstrate that the frames upon which the optimizations are performed are large and stable. Using the most frequently initiated frames from rePLay executions as samples, we also highlight possible strategies for the rePLay optimization engine. Coupled with the high coverage of frames achieved through the dynamic frame construction, the success of these optimizations demonstrates the significance of the rePLay Framework. We believe that the concept of frames, along with the mechanisms and strategies outlined in this paper, will play an important role in future processor architecture.

[1] V. Bala, E. Duesterwald, and S. Banerjia, “Transparent Dynamic Optimization: The Design and Omplementation of Dynamo,” Technical Report HPL-1999-78, Hewlett-Packard Laboratories, June 1999.
[2] P.-Y. Chang, M. Evers, and Y.N. Patt, “Improving Branch Prediction Accuracy by Reducing Pattern History Table Interference,” Proc. 1996 ACM/IEEE Conf. Parallel Architectures and Compilation Techniques, 1996.
[3] Y. Chou and J.P. Shen, “Instruction Path Coprocessors,” Proc. 27th Ann. Int'l Symp. Computer Architecture, 2000.
[4] D.A. Connors and W.W. Hwu, “Compiler-Directed Dynamic Computation Reuse: Rationale and Initial Results,” Proc. 32nd Ann. ACM/IEEE Int'l Symp. Microarchitecture, 1999.
[5] K. Ebcio$\breve{\rm g}$lu and E.R. Altman, "DAISY: Dynamic Compilation for 100% Architectural Compatibility," Proc. ISCA 24, ACM Press, New York, 1997, pp. 26-37.
[6] J.A. Fisher, “Trace Scheduling: A Technique for Global Microcode Compaction,” IEEE Trans. Computers, vol. 30, no. 7, pp. 478-490, July 1981.
[7] D.H. Friendly, S.J. Patel,, and Y.N. Patt, ``Putting the Fill Unit to Work: Dynamic Optimizations for Trace Cache Microprocessors,'' Proc. 30th Ann. ACM/IEEE Int'l Symp. Microarchitecture, 1997.
[8] M. Gschwind, E. Altman, S. Sathaye, P. Ledak, and D. Appenzeller, “Dynamic and Transparent Binary Translation,” Computer, pp. 54-59, Mar. 2000.
[9] R.E. Hank, S.A. Mahlke, R.A. Bringmann, J.C. Gyllenhaal, and W.W. Hwu, “Superblock Formation Using Static Program Analysis,” Proc. 26th Ann. ACM/IEEE Int'l Symp. Microarchitecture, pp. 247-255, 1993.
[10] E. Hao, P.-Y. Chang, M. Evers, and Y.N. Patt, “Increasing the Instruction Fetch Rate via Block-Structured Instruction Set Architectures,” Int'l J. Parallel Programming, vol. 26, no. 4, pp. 449-478, Aug. 1998.
[11] W.W. Hwu, S.A. Mahlke, W.Y. Chen, P.P. Chang, N.J. Warter, R.A. Bringmann, R.G. Ouellette, R.E. Hank, T. Kiyohara, G.E. Haab, J.G. Holm,, and D.M. Lavery, ``The Superblock: An Effective Technique for VLIW and Superscalar Compilation,'' J. Supercomputing, vol. 7, pp. 9-50, 1993.
[12] E. Jacobson, E. Rotenberg,, and J.E. Smith,"Assigning Confidence to Conditional Branch Predictions," Proc. 29th Int'l Symp. Microarchitecture, ACM Press, 1996, pp. 142-152.
[13] Q. Jacobson, E. Rotenberg, and J.E. Smith, “Path-Based Next Trace Prediction,” Proc. 30th Ann. ACM/IEEE Int'l Symp. Microarchitecture, 1997.
[14] Q. Jacobson and J.E. Smith, “Instruction Pre-Processing in Trace Processors,” Proc. Fifth Int'l Symp. High-Performance Computer Architecture, pp. 125-129, Jan. 1999.
[15] A. Klaiber, “The Technology behind Crusoe Processors,” technical report, Transmeta Corp., Jan. 2000.
[16] S.A. Mahlke, D.C. Lin, W.Y. Chen, R.E. Hank, R.A. Bringmann, and W.W. Hwu, “Effective Compiler Support for Predicated Execution Using the Hyperblock,” Proc. 25th Ann. ACM/IEEE Int'l Symp. Microarchitecture, pp. 45-54, 1992.
[17] S. McFarling, “Combining Branch Predictors,” Technical Report TN-36, Digital Western Research Laboratory, June 1993.
[18] S. Melvin and Y. Patt, “Enhancing Instruction Scheduling with a Block-Structured ISA,” Int'l J. Parallel Programming, vol. 23, no. 3, pp. 221-243, June 1995.
[19] M.C. Merten, A.R. Trick, E.M. Nystrom, R.D. Barnes, and W.W. Hwu, “A Hardware Mechanism for Dynamic Extraction and Relayout of Program Hot Spots,” Proc. 27th Int'l Symp. Computer Architecture, pp. 59-70, June 2000.
[20] R. Nair and M. Hopkins, "Exploiting Instruction Level Parallelism in Processors by Caching Scheduled Groups," Proc. 24th Ann. Int'l Symp. Computer Architecture,Denver, Colo., June 1997.
[21] S.J. Patel, M. Evers,, and Y.N. Patt, ``Improving Trace Cache Effectiveness with Branch Promotion and Trace Packing,'' Proc. 25th Ann. Int'l Symp. Computer Architecture, 1998.
[22] S.J. Patel, D.H. Friendly, and Y.N. Patt, “Evaluation of Design Options for the Trace Cache Fetch Mechanism,” IEEE Trans. Computers, special issue on cache memory and related problems, vol. 48, no. 2, pp. 193-204, Feb. 1999.
[23] S.J. Patel, T. Tung, S. Bose, and M. Crum, “Increasing the Size of Atomic Instruction Blocks by Using Control Flow Assertions,” Proc. 33rd Ann. IEEE/ACM Int'l Symp. Microarchitecture, pp. 303-316, Dec. 2000.
[24] A. Peleg and U. Weiser, “Dynamic Flow Instruction Cache Memory Organized around Trace Segments Independent of Virtual Address Line,” US Patent Number 5,381,533, 1994.
[25] E. Rotenberg, S. Bennett, and J. Smith, "Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching," Proc. 29th Ann. ACM/IEEE Int'l Symp. on Microarchitecture, IEEE CS Press, Los Alamitos, Calif., 1996, pp. 24-34.
[26] M.D. Smith, “Overcoming the Challenges of Feedback-Directed Optimization,” Proc. ACM SIGPLAN Workshop Dynamic and Adaptive Compilation and Optimization, 2000.
[27] A. Sodani and G.S. Sohi, “Dynamic Instruction Reuse,” Proc. 24th Ann. Int'l Symp. Computer Architecture, 1997.
[28] J. Stark, M. Evers, and Y.N. Patt, “Variable Length Path Branch Prediction,” Proc. Eighth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, p. 170-179, 1998.
[29] C. Young and M. Smith, “Improving the Accuracy of Static Branch Prediction Using Branch Correlation,” Proc. Sixth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 232-241, Oct. 1994.

Index Terms:
High-performance microarchitecture, dynamic optimization, trace caches.
Citation:
Sanjay J. Patel, Steven S. Lumetta, "rePLay: A Hardware Framework for Dynamic Optimization," IEEE Transactions on Computers, vol. 50, no. 6, pp. 590-608, June 2001, doi:10.1109/12.931895
Usage of this product signifies your acceptance of the Terms of Use.