This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
DIA: A Complexity-Effective Decoding Architecture
April 2009 (vol. 58 no. 4)
pp. 448-462
Oliverio J. Santana, Universidad de Las Palmas de Gran Canaria, Las Palmas de Gran Canaria
Ayose Falcón, HP Labs, Spain
Alex Ramirez, Universitat Politècnica de Catalunya, Barcelona
Mateo Valero, Universitat Politècnica de Catalunya, Barcelona
Fast instruction decoding is a true challenge for the design of CISC microprocessors implementing variable-length instructions. A well-known solution to overcome this problem is caching decoded instructions in a hardware buffer. Fetching already decoded instructions avoids the need for decoding them again, improving processor performance. However, introducing such special--purpose storage in the processor design involves an important increase in the fetch architecture complexity. In this paper, we propose a novel decoding architecture that reduces the fetch engine implementation cost. Instead of using a special-purpose hardware buffer, our proposal stores frequently decoded instructions in the memory hierarchy. The address where the decoded instructions are stored is kept in the branch prediction mechanism, enabling it to guide our decoding architecture. This makes it possible for the processor front end to fetch already decoded instructions from the memory instead of the original nondecoded instructions. Our results show that using our decoding architecture, a state-of-the-art superscalar processor achieves competitive performance improvements, while requiring less chip area and energy consumption in the fetch architecture than a hardware code caching mechanism.

[1] G. Hinton, D. Sager, M. Upton, D. Boggs, D. Caerman, A. Kyker, and P. Roussel, “The Microarchitecture of the Pentium 4 Processor,” Intel Technology J., vol. 5, no. 1, 2001.
[2] M. Smotherman and M. Franklin, “Improving CISC Instruction Decoding Performance Using a Fill Unit,” Proc. 28th Int'l Symp. Microarchitecture (MICRO), 1995.
[3] A. Peleg and U. Weiser, Dynamic Flow Instruction Cache Memory Organized Around Trace Segments Independent of Virtual Address Line, US Patent 5 381 533, 1995.
[4] E. Rotenberg, S. Benett, and J.E. Smith, “Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching,” Proc. 29th Int'l Symp. Microarchitecture (MICRO), 1996.
[5] D.H. Friendly, S.J. Patel, and Y.N. Patt, “Alternative Fetch and Issue Techniques for the Trace Cache Mechanism,” Proc. 30th Int'l Symp. Microarchitecture (MICRO), 1997.
[6] E. Rotenberg, S. Bennett, and J.E. Smith, “A Trace Cache Microarchitecture and Evaluation,” IEEE Trans. Computers, vol. 48, no. 2, Feb. 1999.
[7] V. Bala, E. Duesterwald, and S. Banerjia, “Dynamo: A Transparent Dynamic Optimization System,” Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation (PLDI), 2000.
[8] J.C. Dehnert, B.K. Grant, J.P. Banning, R. Johnson, T. Kistler, A. Klaiber, and J. Mattson, “The Transmeta Code Morphing Software: Using Speculation, Recovery, and Adaptive Retranslation to Address Real-Life Challenges,” Proc. First Int'l Symp. Code Generation and Optimization (CGO), 2003.
[9] G. Reinman, T. Austin, and B. Calder, “A Scalable Front-End Architecture for Fast Instruction Delivery,” Proc. 26th Int'l Symp. Computer Architecture (ISCA), 1999.
[10] O.J. Santana, A. Falcón, A. Ramirez, and M. Valero, “Branch Predictor Guided Instruction Decoding,” Proc. 15th Int'l Conf. Parallel Architectures and Compilation Techniques (PACT), 2006.
[11] A. Ramirez, O.J. Santana, J.L. Larriba-Pey, and M. Valero, “Fetching Instruction Streams,” Proc. 35th Int'l Symp. Microarchitecture (MICRO), 2002.
[12] O.J. Santana, A. Ramirez, J.L. Larriba-Pey, and M. Valero, “A Low-Complexity Fetch Architecture for High-Performance Superscalar Processors,” ACM Trans. Architecture and Code Optimization, vol. 1, no. 2, 2004.
[13] T.Y. Yeh and Y.N. Patt, “A Comprehensive Instruction Fetch Mechanism for a Processor Supporting Speculative Execution,” Proc. 25th Int'l Symp. Microarchitecture (MICRO), 1992.
[14] A. Kumar, “The HP PA-8000 RISC CPU: A High Performance Out-of-Order Processor,” Proc. IEEE Symp. High-Performance Chips (Hot Chips), 1996.
[15] D.A. Jimenez and C. Lin, “Dynamic Branch Prediction with Perceptrons,” Proc. Seventh Int'l Conf. High-Performance Computer Architecture (HPCA), 2001.
[16] D. Kaeli and P. Emma, “Branch History Table Prediction of Moving Target Branches Due to Subroutine Returns,” Proc. 18th Int'l Symp. Computer Architecture (ISCA), 1991.
[17] P.Y. Chang, E. Hao, and Y.N. Patt, “Target Prediction for Indirect Jumps,” Proc. 24th Int'l Symp. Computer Architecture (ISCA), 1997.
[18] P. Shivakumar and N.P. Jouppi, “CACTI 3.0: An Integrated Cache Timing, Power and Area Model,” Technical Report Research Report 2001/2, Western Research Laboratory, 2001.
[19] Q. Jacobson, E. Rotenberg, and J.E. Smith, “Path-Based Next Trace Prediction,” Proc. 30th Int'l Symp. Microarchitecture (MICRO), 1997.
[20] K. Driesen and U. Hölzle, “The Cascaded Predictor: Economical and Adaptive Branch Target Prediction,” Proc. 31st Int'l Symp. Microarchitecture (MICRO), 1998.
[21] O.J. Santana, A. Falcón, E. Fernández, P. Medina, A. Ramirez, and M. Valero, “A Comprehensive Analysis of Indirect Branch Prediction,” Proc. Fourth Int'l Symp. High Performance Computing (ISHPC), 2002.
[22] N.P. Jouppi, “Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers,” Proc. 17th Int'l Symp. Computer Architecture (ISCA), 1990.
[23] D.A. Jimenez, S.W. Keckler, and C. Lin, “The Impact of Delay on the Design of Branch Predictors,” Proc. 33rd Int'l Symp. Microarchitecture (MICRO), 2000.
[24] O.J. Santana, A. Ramirez, and M. Valero, “Latency Tolerant Branch Predictors,” Proc. Int'l Workshop Innovative Architecture for Future Generation High-Performance Processors and Systems, 2003.
[25] R. Cohn, D. Connors, W.C. Hsu, C.K. Luk, T. Moseley, H. Patil, and V.J. Reddi, “Software Instrumentation as a Tool for Architecture and Compiler Research,” Tutorial at the 11th Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2004.
[26] T. Sherwood, E. Perelman, and B. Calder, “Basic Block Distribution Analysis to Find Periodic Behavior and Simulation Points in Applications,” Proc. 10th Int'l Conf. Parallel Architectures and Compilation Techniques (PACT), 2001.
[27] Replay Transmogrifier, http://www.crhc.uiuc.edu/acs/toolsrpt/, 2007.
[28] S.W. Melvin, M.C. Shebanow, and Y.N. Patt, “Hardware Support for Large Atomic Units in Dynamically Scheduled Machines,” Proc. 21st Int'l Symp. Microarchitecture (MICRO), 1988.
[29] B. Solomon, A. Mendelson, D. Orenstien, Y. Almog, and R. Ronen, “Micro-Operation Cache: A Power Aware Frontend for Variable Length Instruction Length ISA,” Proc. Int'l Symp. Low Power Electronics and Design (ISLPED), 2001.
[30] K. Ebcioglu and E. Altman, “DAISY: Dynamic Compilation for 100 Percent Architectural Compatibility,” Proc. 24th Int'l Symp. Computer Architecture (ISCA), 1997.
[31] G. Desoli, N. Mateev, E. Duesterwald, P. Faraboschi, and J.A. Fisher, “DELI: A New Run-Time Control Point,” Proc. 35th Int'l Symp. Microarchitecture (MICRO), 2002.
[32] E. Altman and M. Gschwind, “BOA: A Second Generation DAISY Architecture,” Tutorial at the 31st Int'l Symp. Computer Architecture (ISCA), 2004.
[33] D.H. Friendly, S.J. Patel, and Y.N. Patt, “Putting the Fill Unit to Work: Dynamic Optimizations for Trace Cache Microprocessors,” Proc. 31st Int'l Symp. Microarchitecture (MICRO), 1998.
[34] S.J. Patel, T. Tung, S. Bose, and M.M. Crum, “Increasing the Size of Atomic Instruction Blocks Using Control Flow Assertions,” Proc. 33rd Int'l Symp. Microarchitecture (MICRO), 2000.
[35] R. Rosner, Y. Almog, M. Moffie, N. Schwartz, and A. Mendelson, “Power Awareness through Selective Dynamically Optimized Traces,” Proc. 31st Int'l Symp. Computer Architecture (ISCA), 2004.
[36] B. Fahs, T. Rafacz, S.J. Patel, and S.S. Lumetta, “Continuous Optimization,” Proc. 32nd Int'l Symp. Computer Architecture (ISCA), 2005.
[37] V. Petric, T. Sha, and A. Roth, “RENO—A Rename-Based Instruction Optimizer,” Proc. 32nd Int'l Symp. Computer Architecture (ISCA), 2005.

Index Terms:
Superscalar processor design, CISC instruction decoding, variable-length ISA, branch predictor, code caching.
Citation:
Oliverio J. Santana, Ayose Falcón, Alex Ramirez, Mateo Valero, "DIA: A Complexity-Effective Decoding Architecture," IEEE Transactions on Computers, vol. 58, no. 4, pp. 448-462, April 2009, doi:10.1109/TC.2008.170
Usage of this product signifies your acceptance of the Terms of Use.