This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Optimizations Enabled by a Decoupled Front-End Architecture
April 2001 (vol. 50 no. 4)
pp. 338-355

Abstract—In the pursuit of instruction-level parallelism, significant demands are placed on a processor's instruction delivery mechanism. Delivering the performance necessary to meet future processor execution targets requires that the performance of the instruction delivery mechanism scale with the execution core. Attaining these targets is a challenging task due to I-cache misses, branch mispredictions, and taken branches in the instruction stream. To counter these challenges, we present a fetch architecture that decouples the branch predictor from the instruction fetch unit. A Fetch Target Queue (FTQ) is inserted between the branch predictor and instruction cache. This allows the branch predictor to run far in advance of the address currently being fetched by the cache. The decoupling enables a number of architecture optimizations, including multilevel branch predictor design, fetch-directed instruction prefetching, and easier pipelining of the instruction cache. For the multilevel predictor, we show that it performs better than a single-level predictor, even when ignoring the effects of cycle-timing issues. We also examine the performance of fetch-directed instruction prefetching using a multilevel branch predictor and show that an average 19 percent speedup is achieved. In addition, we examine pipelining the instruction cache to achieve a faster cycle time for the processor pipeline and show that pipelining provides an average 27 percent speedup over not pipelining the instruction cache for the programs examined.

[1] V. Agarwal, M. Hrishikesh, S. Keckler, and D. Burger, “Clock Rate versus IPC: The End of the Road for Conventional Microarchitectures,” Proc. 27th Ann. Int'l Symp. Computer Architecture, 2000.
[2] J.O. Bondi, A.K. Nanda, and S. Dutta, “Integrating a Misprediction Recovery Cache (MRC) into a Superscalar Pipeline,” Proc. 29th Ann. Int'l Symp. Microarchitecture, pp. 14-23, Dec. 1996.
[3] D.C. Burger and T.M. Austin, “The Simplescalar Tool Set, Version 2.0,” Technical Report CS-TR-97-1342, Univ. of Wisconsin, Madison, June 1997.
[4] B. Calder and D. Grunwald, Fast&Accurate Instruction Fetch and Branch Prediction Proc. 21st Ann. Int'l Symp. Computer Architecture, pp. 2-11, May 1994.
[5] B. Calder and D. Grunwald, "Reducing Branch Costs Via Branch Alignment," Proc. Sixth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 242-251, Oct. 1994.
[6] P.-Y. Chang, E. Hao, and Y.N. Patt, “Target Prediction for Indirect Jumps,” Proc. 24th Ann. Int'l Symp. Computer Architecture, pp. 274-283, June 1997.
[7] I.K. Chen, C.C. Lee, and T.N. Mudge, “Instruction Prefetching Using Branch Prediction Information,” Proc. Int'l Conf. Computer Design, pp. 593-601, Oct. 1997.
[8] T.-F. Chen and J.-L. Baer, "Effective Hardware-Based Data Prefetching for High-Performance Processors," IEEE Trans. Computers, vol. 44, no. 5, pp. 609-623, May 1995.
[9] T.F. Chen and J.L. Baer, “Reducing Memory Latency via Non-Blocking and Prefetching Caches,” Proc. Fourth Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS-IV), pp. 51-61, Oct. 1992.
[10] T.M. Conte et al., "Optimization of Instruction Fetch Mechanisms for High Issue Rates," Proc. 22nd Int'l Symp. on Computer Architecture, IEEE CS Press, Los Alamitos, Calif., 1995, pp. 333-344.
[11] K.I. Farkas, N.P. Jouppi, and P. Chow, “How Useful Are Non-Blocking Loads, Stream Buffers and Speculative Execution in Multiple Issue Processors?” Proc. First Int'l Symp. High-Performance Computer Architecture, Jan. 1995.
[12] J.A. Fisher, “Trace Scheduling: A Technique for Global Microcode Compaction,” IEEE Trans. Computers, vol. 30, no. 7, pp. 478-490, July 1981.
[13] F. Gabbay and A. Mendelson, “Speculative Execution Based on Value Prediction,” EE Dept. TR 1080, Technion-Israel Inst. of Tech nology, Nov. 1996.
[14] J. González and A. González, “The Potential of Data Value Speculation to Boost ILP,” Proc. Int'l Conf. Supercomputing, 1998.
[15] E. Hao, P.-Y. Chang, M. Evers,, and Y.N. Patt, ``Increasing the Instruction Fetch Rate via Block-Structured Instruction Set Architectures,'' Proc. 29th Ann. ACM/IEEE Int'l Symp. Microarchitecture, 1996.
[16] N.P. Jouppi, “Improving Direct-Mapped Cache Performance by the Addition of a Small Fully Associative Cache and Prefetch Buffers,” Proc. 17th Int'l Symp. Computer Architecture, pp. 364-373, May 1990.
[17] N. Jouppi and S. Wilton, ``Tradeoffs in Two-Level On-Chip Caching,'' Proc. 21st ISCA, pp. 34-45, Apr. 1994.
[18] S. Jourdan, T. Hsing, J. Stark, and Y. Patt, “The Effects of Mispredicted-Path Execution on Branch Prediction Structures,” Proc. Int'l Conf. Parallel Architectures and Compilation Techniques, Oct. 1996.
[19] M.H. Lipasti, C.B. Wilkerson, and J.P. Shen, "Value Locality and Load Value Prediction," Proc. Seventh Int'l Conf. on Architectural Support for Programming Languages and Operating Systems, ACM Press, New York, 1996, pp. 138-147.
[20] M.H. Lipasti and J.P. Shen, "Exceeding the Data-Flow Limit Via Value Prediction," Proc. 29th Ann. ACM/IEEE Int'l Symp. on Microarchitecture, IEEE CS Press, Los Alamitos, Calif., 1996, pp. 226-237.
[21] S. McFarling, “Combining Branch Predictors,” Technical Report TN-36, Digital Equipment Corporation, Western Research Lab., June 1993.
[22] P. Michaud, A. Seznec, and S. Jourdan, “Exploring Instruction-Fetch Bandwidth Requirement in Wide-Issue Superscalar Processors,” Proc. Int'l Conf. Parallel Architectures and Compilation Techniques, 1999.
[23] S. Palacharla, N.P. Jouppi, and J.E. Smith, "Complexity-Effective Superscalar Processors," Proc. Int'l Symp. Computer Architecture, ACM, 1997, pp. 206-218.
[24] S. Palacharla and R.E. Kessler, “Evaluating Stream Buffers as a Secondary Cache Replacement,” Proc. 21st Ann. Int'l Symp. Computer Architecture, pp. 24-33, Apr. 1994.
[25] S. Patel, D. Friendly, and Y. Patt, “Critical Issues Regarding the Trace Cache Fetch Mechanism,” CSE-TR-335-97, Univ. of Michigan, May 1997.
[26] C.H. Perleberg and A.J. Smith, "Branch Target Buffer Design and Optimization," IEEE Trans. Computers, vol. 42, no. 4, pp. 396-412, Apr. 1993.
[27] G. Reinman, T. Austin, and B. Calder, "A Scalable Front-End Architecture for Fast Instruction Delivery," Proc. 26th Ann. Int'l Symp. Computer Architecture, IEEE Press, Piscataway, N.J., 1999, pp. 234-245.
[28] G. Reinman and B. Calder, “Predictive Techniques for Aggressive Load Speculation,” Proc. 31st Int'l Symp. Microarchitecture, Dec. 1998.
[29] G. Reinman, B. Calder, and T. Austin, “Fetch Directed Instruction Prefetching,” Proc. 32nd Int'l Symp. Microarchitecture, Nov. 1999.
[30] G. Reinman and N. Jouppi, “Cacti Version 2.0,” http://citeseer.nj. nec.com/kropp98,automated.htmlhttp:/ /www.research.digital.com/ wrl/people/jouppiCACTI.html, June 1999.
[31] E. Rotenberg, S. Bennett, and J. Smith, "Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching," Proc. 29th Ann. ACM/IEEE Int'l Symp. on Microarchitecture, IEEE CS Press, Los Alamitos, Calif., 1996, pp. 24-34.
[32] Y. Sazeides and J. Smith, “The Predictability of Data Values,” Proc. 30th Ann. Int'l Symp. Microarchitecture (MICRO '30), pp. 248-258, Dec. 1997.
[33] A. Seznec, S. Jourdan, P. Sainrat,, and P. Michaud, ``Multiple-Block Ahead Branch Predictors,'' Proc. Seventh Int'l Conf. Architectural Support for Programming Languages and Operating Systems, 1996.
[34] T. Sherwood, S. Sair, and B. Calder, “Predictor-Directed Stream Buffers,” Proc. 33rd Int'l Symp. Microarchitecture, pp. 42-53, Dec. 2000.
[35] K. Skadron, P.S. Ahuja, M. Martonosi, and D.W. Clark, Improving Prediction for Procedure Returns with Return-Address-Stack Repair Mechanisms Proc. 31st Ann. ACM/IEEE Int'l Symp. Microarchitecture, pp. 259-271, Dec. 1998.
[36] K. Skadron, M. Martonosi, and D. Clark, “Speculative Updates of Local and Global branch History: A Quantitative Analysis,” Technical Report TR-589-98, Dept. of Computer Science, Princeton Univ., Dec. 1998.
[37] A.J. Smith, "Cache Memories," ACM Computing Surveys, Vol. 14, 1982, pp. 473-540.
[38] J.E. Smith and W.-C. Hsu, “Prefetching in Supercomputer Instruction Caches,” Proc. Supercomputing, Nov. 1992.
[39] J. Stark, P. Racunas,, and Y.N. Patt, ``Reducing the Performance Impact of Instruction Cache Misses By Writing Instructions into the Reservation Stations Out-of-Order,'' Proc. 30th Ann. ACM/IEEE Int'l Symp. Microarchitecture, pp. 34-43, 1997.
[40] R. Uhlig, D. Nagle, T. Mudge, S. Sechrest, and J. Emer, “Instruction Fetching: Coping with Code Bloat,” Proc. 22nd Ann. Int'l Symp. Computer Architecture, pp. 345-356, June 1995.
[41] K. Wang and M. Franklin, Highly Accurate Data Value Prediction Using Hybrid Predictors Proc. 30th Int'l Symp. Microarchitecture, 1997.
[42] T.Y. Yeh and Y.N. Patt, “Two-Level Adaptive Branch Prediction and Instruction Fetch Mechanisms for High Performance Superscalar Processors,” Technical Report CSE-TR-192-93, Computer Science and Eng. Division, Univ. of Michigan, Ann Arbor, Oct. 1993.
[43] T. Yeh and Y. Patt, “A Comprehensive Instruction Fetch Mechanism for a Processor Supporting Speculative Execution,” Proc. 25th Ann. Int'l Symp. Microarchitecture, pp. 129-139, Dec. 1992.

Index Terms:
Decoupled architectures, branch prediction, instruction prefetching, fetch architectures.
Citation:
Glenn Reinman, Brad Calder, Todd Austin, "Optimizations Enabled by a Decoupled Front-End Architecture," IEEE Transactions on Computers, vol. 50, no. 4, pp. 338-355, April 2001, doi:10.1109/12.919279
Usage of this product signifies your acceptance of the Terms of Use.