This Article 
 Bibliographic References 
 Add to: 
A Complexity-Effective Out-of-Order Retirement Microarchitecture
December 2009 (vol. 58 no. 12)
pp. 1626-1639
Salvador Petit Martí, Universidad Politécnica de Valencia, Spain
Julio Sahuquillo Borrás, Universidad Politécnica de Valencia, Spain
Pedro López Rodríguez, Universidad Politécnica de Valencia, Spain
Rafael Ubal Tena, Universidad Politécnica de Valencia, Spain
José Duato Marín, Universidad Politécnica de Valencia, Spain
Current superscalar processors commit instructions in program order by using a reorder buffer (ROB). The ROB provides support for speculation, precise exceptions, and register reclamation. However, committing instructions in program order may lead to significant performance degradation if a long latency operation blocks the ROB head. Several proposals have been published to deal with this problem. Most of them retire instructions speculatively. However, as speculation may fail, checkpoints are required in order to rollback the processor to a precise state, which requires both extra hardware to manage checkpoints and the enlargement of other major processor structures, which, in turn, might impact the processor cycle. This paper focuses on out-of-order commit in a nonspeculative way, thus, avoiding checkpointing. To this end, we replace the ROB with a validation buffer (VB) structure. This structure keeps dispatched instructions until they are nonspeculative or mispeculated, which allows an early retirement. By doing so, the performance bottleneck is largely alleviated. An aggressive register reclamation mechanism targeted to this microarchitecture is also devised. As experimental results show, the VB structure is much more efficient than a typical ROB since, with only 32 entries, it achieves a performance close to an in-order commit microprocessor using a 256-entry ROB.

[1] J. Smith and A. Pleszkun, “Implementation of Precise Interrupts in Pipelined Processors,” Proc. 12th Ann. Int'l Symp. Computer Architecture, pp. 36-44, June 1985.
[2] S. Palacharla, N. Jouppi, and J. Smith, “Complexity-Effective Superscalar Processor,” Proc. 24th Ann. Int'l Symp. Computer Architecture, June 1997.
[3] N. Kirman, M. Kirman, M. Chaudhuri, and J. Martínez, “Checkpointed Early Load Retirement,” Proc. Int'l Symp. High Performance Architecture, Feb. 2005.
[4] H. Akkary, R. Rajwar, and S.T. Srinivasan, “Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors,” Proc. 36th Int'l Symp. Microarchitecture, Dec. 2003.
[5] A. Cristal, D. Ortega, J. Llosa, and M. Valero, “Out-of-Order Commit Processors,” Proc. Int'l Symp. High Performance Architecture, Feb. 2004.
[6] G. Bell and M. Lipasti, “Deconstructing Commit,” Proc. Int'l Symp. Performance Analysis of Systems and Software, Mar. 2004.
[7] R.E. Kessler, “The Alpha 21264 Microprocessor,” IEEE Micro, vol. 19, no. 2, pp. 24-36, Mar. 1999.
[8] J.M. Tendler, S. Dodson, S. Fields, H. Le, and B. Sinharoy, “Power4 System Microarchitecture,” technical white paper, IBM Server Group, Oct. 2001.
[9] G. Hinton, D. Sager, M. Upton, D. Upton, D. Boggs, D. Carmean, A. Kyker, and P. Rousell, “The Microarchitecture of the Pentium 4 Processor,” Intel Technology J., vol. 5, no. 1, 2001.
[10] D. Burger and T.M. Austin, “The Simplescalar Tool Set, Version 2.0.,” Computer Architecture News, vol. 25, no. 3, pp.13-25, 1997.
[11] J. Smith and G. Sohi, “The Microarchitecture of Superscalar Processors,” Proc. IEEE, vol. 83, no. 2, pp. 1609-1624, Dec. 1995.
[12] M. Moudgill, K. Pingali, and S. Vassiliadis, “Register Renaming and Dynamic Speculation: An Alternative Approach,” Proc. 26th Int'l Symp. Microarchitecture, pp. 202-213, Dec. 1993.
[13] K. Yeager, “The mips r10000 Superscalar Microprocessor,” IEEE Micro, vol. 16, no. 2, pp. 28-40, Apr. 1996.
[14] J. Shen and M. Lipasti, Modern Processor Design. McGraw-Hill, 2005.
[15] K. Gharachorloo, A. Gupta, J.H. Singhal, D. Broniarczyk, F.M. Cerauskis, J. Price, L. Yuan, G. Cheng, D. Doblar, S. Fosth, N. Agarwal, K. Harvey, and E. Hagersten, “Two Techniques to Enhance the Performance of Memory Consistency Modelsgigaplane: A High Performance Bus for Large Smps,” Proc. Symp. High Performance Interconnects IV, pp. 41-52, 1996.
[16] D.J. Sorin, M. Plakal, M.D. Hill, and A.E. Condon, “Lamport Clocks: Reasoning about Shared Memory Correctness,” Technical Report CS-TR-1998-1367, Computer Sciences Dept., University of Wisconsin, Madison, 1998.
[17] M. Plakal, D.J. Sorin, A.E. Condon, and M.D. Hill, “Lamport Clocks: Verifying a Directory Cache-Coherency Protocol,” Proc. 10th ACM Ann. Symp. Parallel Algorithms and Architectures (SPAA '98), pp. 67-76, 1998.
[18] K. Gharachorloo, A. Gupta, and J. Hennessy, “Two Techniques to Enhance the Performance of Memory Consistency Models,” Proc. Int'l Conf. Parallel Processing, pp. I-355-I-364, 1991.
[19] Standard Performance Evaluation Corporation, http://www. spec.orgcpu2000/, 2009.
[20] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder, “Automatically Characterizing Large Scale Program Behavior,” Proc. 10th Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS-X), Oct. 2002.
[21] Free Software Foundation, GCC Online Documentation, /, 2006.
[22] O. Mutlu, J. Stark, C. Wilkerson, and Y. Patt, “Runahead Execution: An Alternative to Very Large Instruction Window for Out-of-Order Processors,” Proc. Int'l Symp. High Performance Architecture, Feb. 2003.
[23] M. Kirman, N. Kirman, and J. Martínez, “Cherry-mp: Correctly Integrating Checkpointed Early Resource Recycling in Chip Multiprocessors,” Proc. Int'l Symp. Microarchitecture, Nov. 2005.
[24] E. Vallejo, M. Galluzzi, A. Cristal, F. Vallejo, R. Beivide, P. Stenstrom, J.E. Smith, and M. Valero, “Implementing Kilo-Instruction Multiprocessors,” Proc. IEEE Conf. Pervasive Services, Invited lecture, pp. 325-336, July 2005.
[25] J. Martinez, J. Renau, M. Huang, M. Prvulovic, and J. Torrellas, “Cherry: Checkpointed Early Resource Recycling in Out-of-Order Processors,” Proc. 35th Int'l Symp. Microarchitecture, Nov. 2002.
[26] S.E. Raasch, N.L. Binkert, and S.K. Reinhardt, “A Scalable Instruction Queue Design Using Dependence Chains,” Proc. 29th Ann. Int'l Symp. Computer Architecture, May 2002.
[27] R. Balasubramonian, S. Dwarkadas, and D. Albonesi, “Reducing the Complexity of the Register File in Dynamic Superscalar Processors,” Proc. 34th Int'l Symp. Microarchitecture, Dec. 2001.
[28] I. Park, C. Ooi, and T. Vijaykumar, “Reducing Design Complexity of the Load/Store Queue,” Proc. 36th Int'l Symp. Microarchitecture, Dec. 2003.

Index Terms:
Instruction-level parallelism, out-of-order commit, long latency operations, control dependencies, exception handling.
Salvador Petit Martí, Julio Sahuquillo Borrás, Pedro López Rodríguez, Rafael Ubal Tena, José Duato Marín, "A Complexity-Effective Out-of-Order Retirement Microarchitecture," IEEE Transactions on Computers, vol. 58, no. 12, pp. 1626-1639, Dec. 2009, doi:10.1109/TC.2009.95
Usage of this product signifies your acceptance of the Terms of Use.