This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Branch Prediction, Instruction-Window Size, and Cache Size: Performance Trade-Offs and Simulation Techniques
November 1999 (vol. 48 no. 11)
pp. 1260-1281

Abstract—Design parameters interact in complex ways in modern processors, especially because out-of-order issue and decoupling buffers allow latencies to be overlapped. Trade-offs among instruction-window size, branch-prediction accuracy, and instruction- and data-cache size can change as these parameters move through different domains. For example, modeling unrealistic caches can under- or overstate the benefits of better prediction or a larger instruction window. Avoiding such pitfalls requires understanding how all these parameters interact. Because such methodological mistakes are common, this paper provides a comprehensive set of SimpleScalar simulation results from SPECint95 programs, showing the interactions among these major structures. In addition to presenting this database of simulation results, major mechanisms driving the observed trade-offs are described. The paper also considers appropriate simulation techniques when sampling full-length runs with the SPEC reference inputs. In particular, the results show that branch mispredictions limit the benefits of larger instruction windows, that better branch prediction and better instruction cache behavior have synergistic effects, and that the benefits of larger instruction windows and larger data caches trade off and have overlapping effects. In addition, simulations of only 50 million instructions can yield representative results if these short windows are carefully selected.

[1] P.S. Ahuja, K. Skadron, M. Martonosi, and D.W. Clark, “Multi-Path Execution: Opportunities and Limits,” Proc. 12th Int'l Conf. Supercomputing, pp. 101-108, July 1998.
[2] D.I. August, D.A. Connors, J.C. Gyllenhaal, and W.W. Hwu, “Architectural Support for Compiler-Synthesized Dynamic Branch Prediction Strategies: Rationale and Initial Results,” Proc. Third Int'l Symp. High-Performance Computer Architecture, pp. 84-93, Feb. 1997.
[3] D. Burger Personal communication, Mar. 1998.
[4] D. Burger, T.M. Austin, and S. Bennett, “Evaluating Future Microprocessors: The SimpleScalar Tool Set,” Technical Report TR-1308, Computer Sciences Dept., Univ. of Wisconsin-Madison, July 1996.
[5] B. Calder and D. Grunwald, Fast&Accurate Instruction Fetch and Branch Prediction Proc. 21st Ann. Int'l Symp. Computer Architecture, pp. 2-11, May 1994.
[6] B. Calder and D. Grunwald, “Reducing Indirect Function Call Overhead in C++ Programs,” Proc. 21st ACM SIGPLAN-SIGACT Symp. Principles of Programming Languages, pp. 397-408, Jan. 1994.
[7] P.-Y. Chang, E. Hao, and Y.N. Patt, Alternative Implementations of Hybrid Branch Predictors Proc. 28th Ann. Int'l Symp. Microarchitecture, pp. 252-257, Dec. 1995.
[8] P.-Y. Chang, E. Hao, and Y.N. Patt, “Target Prediction for Indirect Jumps,” Proc. 24th Ann. Int'l Symp. Computer Architecture, pp. 274-283, June 1997.
[9] K. Driesen and U. Hölzle, “Accurate Indirect Branch Prediction,” Proc. 25th Ann. Int'l Symp. Computer Architecture, pp. 167-178, July 1998.
[10] A.N. Eden and T. Mudge, “The YAGS Branch Prediction Scheme,” Proc. 31st Ann. ACM/IEEE Int'l Symp. Microarchitecture, pp. 69-77, Dec. 1998.
[11] J. Emer Personal communication, June 1997.
[12] J. Emer and N. Gloy, “A Language for Describing Predictors and Its Application to Automatic Synthesis,” Proc. 24th Ann. Int'l Symp. Computer Architecture, pp. 304-314, June 1997.
[13] M. Evers, S.J. Patel, R.S. Chappell, and Y.N. Patt, “An Analysis of Correlation and Predictability: What Makes Two-Level Branch Predictors Work,” Proc. 25th Ann. Int'l Symp. Computer Architecture, pp. 52-61, June 1998.
[14] K.I. Farkas, P. Chow, N.P. Jouppi, and Z. Vranesic, “Memory-System Design Considerations for Dynamically-Scheduled Processors,” Proc. 24th Ann. Int'l Symp. Computer Architecture, pp. 133-143, May 1997.
[15] J. Fisher and S. Freudenberger,"Predicting Conditional Branch Directions from Previous Runs of a Program," Proc. 5th Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS-V), ACM Press, 1992, pp. 85-95.
[16] L. Gwennap, “Intel's P6 Uses Decoupled Superscalar Design,” Microprocessor Report, pp. 9-15, 16 Feb. 1995.
[17] E. Hao, P.-Y. Chang, and Y. Patt, “The Effect of Speculatively Updating Branch History on Branch Prediction Accuracy, Revisited,” Proc. 27th Ann. Int'l Symp. Microarchitecture, Nov. 1994.
[18] V.S. Iyengar and L.H. Trevillyan, “Evaluation and Generation of Reduced Traces for Benchmarks,” IBM Research Report RC 20610, Oct. 1996.
[19] Q. Jacobson, E. Rotenberg, and J.E. Smith, “Path-Based Next Trace Prediction,” Proc. 30th Ann. ACM/IEEE Int'l Symp. Microarchitecture, 1997.
[20] R. Johnson and M. Schlansker, “Analysis Techniques for Predicated Code,” Proc. 29th Ann. IEEE/ACM Int'l Symp. Microarchitecture, pp. 100-113, Dec. 1996.
[21] T.L. Johnson and W.W. Hwu, “Run-Time Adaptive Cache Hierarchy Management via Reference Analysis,” Proc. 24th Ann. Int'l Symp. Computer Architecture, pp. 315-326, June 1997.
[22] N.P. Jouppi and P. Ranganathan, “The Relative Importance of Memory Latency, Bandwidth, and Branch Limits to Performance,” Proc. Workshop Mixing Logic and DRAM: Chips That Compute and Remember, June 1997, ftp://ftp.cs.wisc.edu/sohi/papers/1998/micro.compiler.ps.gzhttp:/ /ayer.CS.Berkeley.EDU isca97-workshop.
[23] N.P. Jouppi and D.W. Wall,"Available Instruction-Level Parallelism for Superscalar and Superpipelined Machines," Proc. Third Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS), Assoc. of Computing Machinery,N.Y., Apr. 1989, pp. 272-282.
[24] S. Jourdan, J. Stark, T.-H. Hsing, and Y.N. Patt, “Recovery Requirements of Branch Prediction Storage Structures in the Presence of Mispredicted-Path Execution,” Int'l J. Parallel Programming, vol. 25, no. 5, pp. 363-383, Oct. 1997.
[25] R.E. Kessler, M.D. Hill, and D.A. Wood, “A Comparison of Trace-Sampling Techniques for Multi-Megabyte Caches,” Technical Report 1048, Univ. of Wisconsin Computer Sciences Dept., Sept. 1991.
[26] R.E. Kessler, E.J. McLellan, and D.A. Webb, The Alpha 21264 Microprocessor Architecture Proc. 1998 Int'l Conf. Computer Design, pp. 90-95, Oct. 1998.
[27] A. Klauser, V. Paithankar, and D. Grunwald, “Selective Eager Execution on the PolyPath Architecture,” Proc. 25th Ann. Int'l Symp. Computer Architecture, pp. 250-259, July 1998.
[28] R. Kol and R. Ginosaur, “Kin: A High Performance Asynchronous Processor Architecture,” Proc. 12th Int'l Conf. Supercomputing, pp. 433-440, July 1998.
[29] D. Kroft, "Lockup-Free Instruction Fetch/Prefetch Cache Organization," Proc. Eighth Int'l Symp. Computer Architecture, pp. 81-87, 1981.
[30] S. Laha, J.A. Patel, and R.K. Iyer, "Accurate Low-Cost Methods for Performance Evaluation of Cache Memory Systems," IEEE Trans. Computing, Feb. 1988, pp. 1,325-1,336.
[31] M.S. Lam and R.P. Wilson, “Limits of Control Flow on Parallelism,” Proc. 19th Ann. Int'l Symp. Computer Architecture, pp. 46-57, 19-21 May 1992.
[32] C.-C. Lee, I.-C.K. Chen, and T.N. Mudge, “The Bi-Mode Branch Predictor,” Proc. 30th Ann. Int'l Symp. Microarchitecture, pp. 4-13, Dec. 1997.
[33] S. Mahlke and B. Natarajan, “Compiler Synthesized Dynamic Branch Prediction,” Proc. 29th Ann. IEEE/ACM Int'l Symp. Microarchitecture, pp. 153-164, Dec. 1996.
[34] M. Martonosi, A. Gupta, and T. Anderson, “Effectiveness of Trace Sampling for Performance Debugging Tools,” Proc. ACM SIGMETRICS Conf. Measurement and Modeling of Computer Systems, pp. 248-259, May 1993.
[35] A. Maynard, C. Donnelly, and B. Olszewski, “Contrasting Characteristics and Cache Performance of Technical and Multi-User Commercial Workloads,” Proc. Sixth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 145-156, Oct. 1994.
[36] S. McFarling, “Combining Branch Predictors,” Technical Note TN-36, DEC WRL, June 1993.
[37] P. Michaud, A. Seznec, and R. Uhlig, “Trading Conflict and Capacity Aliasing in Conditional Branch Predictors,” Proc. 24th Ann. Int'l Symp. Computer Architecture, pp. 292-303, June 1997.
[38] MIPS Tech nologies, MIPS R10000 Microprocessor User's Manual, version 1.0, June 1995.
[39] T.C. Mowry, M.S. Lam, and A. Gupta, “Design and Evaluation of a Compiler Algorithm for Prefetching,” Proc. Fifth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, Oct. 1992.
[40] S. Pan, K. So, and J. Rahmeh, “Improving the Accuracy of Dynamic Branch Prediction Using Branch Correlation,” Proc. Fifth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 76-84, Oct. 1992.
[41] J. Pierce and T. Mudge, “Wrong-Path Instruction Prefetching,” Proc. 29th Ann. IEEE/ACM Int'l Symp. Microarchitecture, pp. 165-175, Dec. 1996.
[42] A.K. Porterfield, “Software Methods for Improvement of Cache Performance on Supercomputer Applications,” doctoral thesis, Dept. of Computer Science, Rice Univ., Apr. 1989.
[43] C. Price, MIPS IV Instruction Set, Revision 3.1, MIPS Technologies, Inc., Mountain View, Calif., Jan. 1995.
[44] B.R. Rau, D.W.L. Yen, W. Yen, and R.A. Towle, “The Cydra 5 Departmental Supercomputer: Design Philosophies, Decisions, and Trade-Offs,” Computer, pp. 12-35, Jan. 1989.
[45] E. Rotenberg, S. Bennett, and J. Smith, "Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching," Proc. 29th Ann. ACM/IEEE Int'l Symp. on Microarchitecture, IEEE CS Press, Los Alamitos, Calif., 1996, pp. 24-34.
[46] E. Rotenberg, Q. Jacobson, Y. Sazeides, and J.E. Smith, Trace Processors Proc. 30th Int'l Symp. Microarchitecture, pp. 138-148, 1997.
[47] E. Rothberg, J.P. Singh, and A. Gupta, "Working Sets, Cache Sizes, and Node Granularity Issues for Large-Scale Multiprocessors," Proc. 20th Ann. Int'l Symp. Computer Architecture, pp. 14-25, ACM, May 1993.
[48] V. Santhanam, E.H. Gornish, and W.-C. Hsu, “Data Prefetching on the HP PA-8000,” Proc. 24th Ann. Int'l Symp. Computer Architecture, pp. 264-273, June 1997.
[49] S. Sechrest, C.-C. Lee, and T. Mudge, “Correlation and Aliasing in Dynamic Branch Predictors,” Proc. 23rd Ann. Int'l Symp. Computer Architecture, pp. 22-32, May 1995.
[50] K. Skadron, P.S. Ahuja, M. Martonosi, and D.W. Clark, Improving Prediction for Procedure Returns with Return-Address-Stack Repair Mechanisms Proc. 31st Ann. ACM/IEEE Int'l Symp. Microarchitecture, pp. 259-271, Dec. 1998.
[51] K. Skadron and D.W. Clark, “Design Issues and Trade-Offs for Write Buffers,” Proc. Third Int'l Symp. High-Performance Computer Architecture, pp. 144-155, Feb. 1997.
[52] K. Skadron, D.W. Clark, and M. Martonosi, “Speculative Updates of Local and Global Branch History: A Quantitative Analysis,” J. Instruction-Level Parallelism, to appear.
[53] K. Skadron, M. Martonosi, and D.W. Clark, “Alloying Global and Local Branch History: Taxonomy, Performance, and Analysis,” Technical Report TR-594-99, Princeton Univ. Dept. of Computer Science, Jan. 1999.
[54] K. Skadron, M. Martonosi, and D.W. Clark, “Selecting a Single, Representative Sample for Accurate Simulation of Specint Benchmarks,” Technical Report TR-595-99, Princeton Univ. Dept. of Computer Science, Jan. 1999.
[55] M.D. Smith, M. Johnson, and M. Horowitz, “Limits on Multiple Instruction Issue,” Proc. Third Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 290-302, Apr. 1989.
[56] G.S. Sohi and A.S. Vajapeyam, “Instruction Issue Logic for High-Performance, Interruptible Pipelined Processors,” Proc. 14th Ann. Int'l Symp. Computer Architecture, pp. 27-34, June 1987.
[57] E. Sprangle, R.S. Chappell, M. Alsup, and Y.N. Patt, “The Agree Predictor: A Mechanism for Reducing Negative Branch History Interference,” Proc. 24th Annual Int'l Symp. Computer Architecture, pp. 284-291, June 1997.
[58] S. Srinivasan and A. Lebeck, “Load Latency Tolerance in Dynamically Scheduled Processors,” Proc. 31st Ann. ACM/IEEE Int'l Symp. Microarchitecture, pp. 148-159, Dec. 1998.
[59] The Standard Performance Evaluation Corporation,http:/www.specbench.org, Dec. 1996
[60] D.W. Wall, “Limits of Instruction-Level Parallelism,” Proc. Fourth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 176-188, 8-11 Apr. 1991.
[61] S. Wallace, B. Calder, and D.M. Tullsen, “Threaded Multiple Path Execution,” Proc. 25th Ann. Int'l Symp. Computer Architecture, pp. 238-249, July 1998.
[62] K.M. Wilson and K. Olukotun, “Designing High Bandwidth On-Chip Caches,” Proc. 24th Ann. Int'l Symp. Computer Architecture, pp. 121-132, June 1997.
[63] S.C. Woo et al., "The SPLASH-2 Programs: Characterization and Methodological Considerations," Proc. 22nd Annual Int'l Symp. Computer Architecture, IEEE CS Press, Los Alamitos, Calif., June 1995, pp. 24-36.
[64] D.A. Wood, M.D. Hill, and R.E. Kessler, “A Model for Estimating Trace-Sample Miss Ratios,” Proc. ACM SIGMETRICS Conf. Measurement and Modeling of Computer Systems, pp. 79-89, June 1991.
[65] T.-Y. Yeh and Y. Patt, “A Comparison of Dynamic Branch Predictors that Use Two Levels of Branch History,” Proc. 20th Ann. Int'l Symp. Computer Architecture, pp. 257-266, May 1993.
[66] C. Young, N. Gloy, and M. Smith, “A Comparative Analysis of Schemes for Correlated Branch Prediction,” Proc. 22nd Ann. Int'l Symp. Computer Architecture, May 1995.
[67] C. Young and M. Smith, “Improving the Accuracy of Static Branch Prediction Using Branch Correlation,” Proc. Sixth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 232-241, Oct. 1994.

Index Terms:
Microarchitecture, trade-offs, branch prediction, cache, sampling, simulation, out-of-order execution, instruction window size, register-update unit.
Citation:
Kevin Skadron, Pritpal S. Ahuja, Margaret Martonosi, Douglas W. Clark, "Branch Prediction, Instruction-Window Size, and Cache Size: Performance Trade-Offs and Simulation Techniques," IEEE Transactions on Computers, vol. 48, no. 11, pp. 1260-1281, Nov. 1999, doi:10.1109/12.811115
Usage of this product signifies your acceptance of the Terms of Use.