This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Branch Target Buffer Design and Optimization
April 1993 (vol. 42 no. 4)
pp. 396-412

A branch target buffer (BTB) can reduce the performance penalty of branches in pipelined processors by predicting the path of the branch and caching information used by the branch. Two major issues in the design of BTBs that achieves maximum performance with a limited number of bits allocated to the BTB implementation are discussed. The first is BTB management. A method for discarding branches from the BTB is examined. This method discards the branch with the smallest expected value for improving performance; it outperforms the least recently used (LRU) strategy by a small margin, at the cost of additional complexity. The second issue is the question of what information to store in the BTB. A BTB entry can consist of one or more of the following: branch tag, prediction information, the branch target address, and instructions at the branch target. Various BTB designs, with one or more of these fields, are evaluated and compared.

[1] Advanced Micro Devices,Am29000 Streamlined Instruction Processor User Manual, Advanced Micro Devices, 1988.
[2] Amdahl,Amdahl 470 V/6 Machine Reference Manual, Amdahl, Sunnyvale, CA, 1976.
[3] D. W. Anderson, F. J. Sparacio, and R.M. Tomasulo, "The IBM System/360 Model 91: Machine philosophy and instruction-handling,"IBM J. Res. Develop., pp. 8-24 Jan. 1967.
[4] H. Bakoglu et al., "The IBM RISC System/6000 Processor: Hardware Overview,"IBM J. Research and Development, Vol. 34, No. 1, Jan. 1990, pp. 12-22.
[5] A. Borg, R.E. Kessler, and D.W. Wall, "Generation and Analysis of Very Long Address Traces,"Proc. 17th Int'l Symp. Computer Architecture, May 1990, IEEE CS Press, Los Alamitos, Calif. Order No. 2047, pp. 270-279.
[6] B. K. Bray and M. J. Flynn, "Strategies for branch target buffers," Stanford Comput. Syst. Lab., Tech. Rep. CSL-TR-91-480, June 1991.
[7] J. A. DeRosa and H. M. Levy, "An evaluation of branch architectures," inProc. 14th Annu. Symp. Comput. Architecture, June 1987, pp. 10-16.
[8] Digital Equipment Corp.,VAX 11/780 Architecture Handbook, Digital Equipment Corp., 1977.
[9] D. R. Ditzel and H. R. McLellan, "Branch folding in CRISP microprocessor," inProc. 14th Annu. Symp. Comput. Architecture, June 1987, pp. 2-9.
[10] N. Doduc, "Fortran execution time benchmark," unpublished report, V.20, Mar. 1989.
[11] P. Dubey and M. J. Flynn, "Branch strategies: Modeling and optimization,"IEEE Trans. Comput., vol. 40, no. 10, pp. 1159-1167, Oct. 1991.
[12] T. H. Elrod, "The CDC 7600 and SCOPE 76,"Datamation, pp. 80-85, Apr. 1970.
[13] D. Folger and E. Basart, "Computer architecture - Designing for speed," inProc. Spring COMPCON 1983, pp. 25-31.
[14] Fujitsu Microelectronics, Inc.,MB86900 RISC Processor Architecture Manual, Fujitsu Microelectronics, Inc., 1987.
[15] S. N. Gaulding and D. P. Madison, Jr., "Optimization of scalar instructions for the advanced scientific computer," inProc. Spring COMPCON 1975, pp. 189-193.
[16] E. T. Grochowski, "An instruction tracer for the Motorola 68010," U.C. Berkeley Masters Rep., 1986.
[17] T. R. Gross and J. Hennessey, "Optimizing delayed branches," inProc. 15th Workshop Microprogramming, 1986.
[18] R. R. Henry, "VAX address and instruction traces," unpublished report, 1983.
[19] M. D. Hill, "Aspects of cache memory and instruction buffer performance," Tech. Rep. UCB/CSD 87/381, Univ. of California at Berkeley, Berkeley, CA, Nov. 1987.
[20] G. Hinton, "80960 -- Next generation," inProc. 34th. IEEE Comput. Society Conf. COMPCON'89, Feb. 1989, pp. 13-17.
[21] W. Hollingsworth, H. Sachs, and A.J. Smith, "The Clipper Processor: Instruction Set Architecture and Implementation,"Comm. ACM, Feb. 1989, pp. 200-219.
[22] K. Hwang and F. A. Briggs,Computer Architecture and Parallel Processing. New York: McGraw-Hill, 1984.
[23] IBM, "IBM Maintenance Library System/370 Model 168 Theory of Operation/Diagrams Manual," vol. 2, IBM, Poughkeepsie, NY, 1973.
[24] IBM, "IBM Maintenance Library 3033 Processor Complex Theory of Operation/Diagrams Manual," vols. 1-3, IBM, Poughkeepsie, NY, Jan. 1978.
[25] H. F. Jordan, "Performance measurements on hep, a pipelined mind computer," inProc. 10th Annu. Int. Symp. Comput. Architecture, SIGARCH Newsletter, vol. 11, 1983, pp. 207-212.
[26] D. Kaeli and P. Emma, "Branch history table prediction of moving target branches due to subroutine returns," inProc. 18th ISCA, and Comput. Architecture News, vol. 19, no. 3, pp. 34-41 May 1991.
[27] G. Kane,MIPS RISC Architecture, Prentice-Hall, Englewood Cliffs, N.J., 1988.
[28] S. P. Kartashev and S. I. Kartashev,Supercomputing Systems, New York: Van Nostrand Reinhold, 1990, pp. 106-153.
[29] P. M. Kogge,The Architecture of Pipelined Computers. New York: McGraw-Hill, 1981.
[30] J. K. F. Lee and A. J. Smith, "Branch prediction strategies and branch target buffer design,"IEEE Comput. Mag., pp. 6-22, Jan. 1984.
[31] D. K. Lewis, J. P. Costello, and D. M. O'Connor, "Design tradeoffs for a 40 MIPS (peak) CMOS 32-bit microprocessor," inProc. IEEE Int. Conf. Comput. Design: VLSI Comput. Processors, Oct. 1988, pp. 110-113.
[32] D. J. Lilja, "Reducing the branch penalty in pipelined processors,"IEEE Comput. Mag., pp. 47-55, July 1988.
[33] T. Manuel, "Getting mainframe power out of a CISC Supermicro,"Electronics, pp. 66-69, Sept. 3, 1987.
[34] S. McFarling and I. Hennessey, "Reducing the cost of branches," inProc. 13th Annu. Symp. Comput. Architecture, June 1986, pp. 396-403.
[35] Motorola,M68000 8-/16-/32-Bit Microprocessors User's Manual, Englewood Cliffs, NJ: Prentice-Hall, 1989.
[36] J. O. Murphey and R. M. Wade, "The IBM 360/195,"Datamation, pp. 72-79, Apr. 1970.
[37] S.-T. Pan, K. So, and J. T. Rahmeh, "Improving the accuracy of dynamic branch prediction using branch correlation," inProc. ASPLOS V, Boston, MA, Oct. 1992.
[38] D. A. Patterson and C. H. Sequin, "RISC I: A reduced instruction set VLSI computer," inProc. 8th Annu. Int. Symp. Comput. Architecture, May 1981, pp. 443-457.
[39] C. H. Perleberg, "Branch target buffer design," U.C. Berkeley Comput. Sci. Division Technical Rep. UCB/CSD 89/553, Dec. 1989.
[40] B. L. Peuto and L. J. Shustek, "An instruction timing model of CPU performance," inProc. 4th Annu. Symp. Comput. Architecture, Silver Spring, MD, Mar. 1977, pp. 165-178.
[41] M. Putrino, S. Vassiliadis, A. Huffman, and A. Ngai, "Apparatus for branch prediction for computer instructions," U.S. Patent 4 914 579, April 3, 1990.
[42] C. V. Ramamoorthy and H. F. Li, "Pipeline architecture."ACM Comput. Surveys, vol. 9, pp. 61-102, Mar. 1977.
[43] B. R. Rau and G. E. Rossman, "The effect of instruction fetch strategies upon the performance of pipelined instruction units," inProc. 4th Annu. Symp. Comput. Architecture, 1977, pp. 80-87.
[44] R. M. Russel, "The CRAY-1 computer system,"Commun. ACM, vol. 21, no. 1, pp. 63-72, Jan. 1978.
[45] K. Sakamura,TRON Project 1987. Berlin, Germany: Springer-Verlag, 1987.
[46] L. E. Shar and E. S. Davidson, "A multiminiprocessor system implemented through pipelining,"IEEE Comput. Mag., pp. 42-51, Feb. 1974.
[47] A. J. Smith, "Sequential program prefetching in memory hierarchies,"IEEE Comput. Mag., vol. 11, no. 12, pp. 7-21, Dec. 1978.
[48] A. J. Smith, "Cache evaluation and the impact of workload choice," inProc. 12th Annu. Symp. Comput. Architecture, Boston, MA, June 1985, pp. 64-73.
[49] A. J. Smith, "Line (block) size choice for CPU cache memories,"IEEE Trans. Computers, vol. 36, no. 9, pp. 1063-1074, 1987.
[50] J. E. Smith, "A study of branch prediction strategies," inProc. 8th Annu. Symp. Comput. Architecture, May 1981.
[51] J. E. Smith and J. R. Goodman, "A study of instruction cache organizations and replacement policies," inProc. 10th Symp. Comput. Architecture, June 1983, pp. 132-137.
[52] D. R. Stiles and H. L. McFarland, "Pipeline control for a single cycle VLSI implementation of a complex instruction set computer," inProc. Spring COMPCON 1989, pp. 504-508.
[53] D. R. Stiles (of NexGen microsystems), personal interview concerning branch prediction cache of NexGen processor, Sept. 25, 1989.
[54] J. E. Thornton, "Parallel operation in the Control Data 6600," inProc. AFIPS Fall Joint Comp. Conf., 1964, pp. 33-40.
[55] U.C. Berkeley CAD/IC Group, "SPICE2G.6," Mar. 1987.
[56] T.-Y. Yeh and Y. Patt, "Two level adaptive training branch prediction," inProc. MICRO-24, Nov. 1991, pp. 51-61.
[57] T.-Y. Yeh and Y. Patt, "Alternative implementations of two-level adaptive branch prediction," inProc. ISCA-19 and Comput. Architecture News, vol. 20, no. 2, pp. 124-135, May 1992.
[58] S. Walter (of Edgecore), personal interview concerning branch cache in Edge 2000, Sept. 20, 1989.
[59] L. C. Widdoes, Jr., "Jump prediction," Stanford Univ., unpublished draft, Feb. 1977.
[60] T. Yoshida and T. Enomoto, "The Mitsubishi VLSI CPU in the TRON Project,"IEEE Micro, p. 24, Apr. 1987.

Index Terms:
branch target buffer design; optimization; performance penalty; branches; pipelined processors; caching; least recently used; complexity; branch tag; prediction information; branch target address; instructions; buffer storage; instruction sets; pipeline processing.
Citation:
C.H. Perleberg, A.J. Smith, "Branch Target Buffer Design and Optimization," IEEE Transactions on Computers, vol. 42, no. 4, pp. 396-412, April 1993, doi:10.1109/12.214687
Usage of this product signifies your acceptance of the Terms of Use.