| | This Article | |
| |
| |
| | Share | |
| |
| |
| | Bibliographic References | |
| |
| |
| | Add to: | |
| |
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
| |
| | Search | |
| |
| |
| | |
A Trace Cache Microarchitecture and Evaluation
February 1999 (vol. 48 no. 2)
pp. 111-120
Abstract—As the instruction issue width of superscalar processors increases, instruction fetch bandwidth requirements will also increase. It will eventually become necessary to fetch multiple basic blocks per clock cycle. Conventional instruction caches hinder this effort because long instruction sequences are not always in contiguous cache locations. Trace caches overcome this limitation by caching traces of the dynamic instruction stream, so instructions that are otherwise noncontiguous appear contiguous. In this paper, we present and evaluate a microarchitecture incorporating a trace cache. The microarchitecture provides high instruction fetch bandwidth with low latency by explicitly sequencing through the program at the higher level of traces, both in terms of 1) control flow prediction and 2) instruction supply. For the SPEC95 integer benchmarks, trace-level sequencing improves performance from 15 percent to 35 percent over an otherwise equally sophisticated, but contiguous, multiple-block fetch mechanism. Most of this performance improvement is due to the trace cache. However, for one benchmark whose performance is limited by branch mispredictions, the performance gain is almost entirely due to improved prediction accuracy.
[1] J.O. Bondi, A.K. Nanda, and S. Dutta, “Integrating a Misprediction Recovery Cache (MRC) into a Superscalar Pipeline,” Proc. 29th Ann. Int'l Symp. Microarchitecture, pp. 14-23, Dec. 1996.[2] D. Burger, T. Austin,, and S. Bennett, “Evaluating Future Microprocessors: The Simplescalar Toolset,” Technical Report CS-TR-96-1308, Computer Sciences Dept., Univ. of Wisconsin-Madison, July 1996.[3] T.M. Conte et al., "Optimization of Instruction Fetch Mechanisms for High Issue Rates," Proc. 22nd Int'l Symp. on Computer Architecture, IEEE CS Press, Los Alamitos, Calif., 1995, pp. 333-344.[4] S. Dutta and M. Franklin, “Control Flow Prediction with Tree-Like Subgraphs for Superscalar Processors,” Proc. 28th Int'l Symp. Microarchitecture, pp. 258-263, Nov. 1995.[5] J. Fisher, “Trace Scheduling: A Technique for Global Microcode Compaction,” IEEE Trans. Computers, vol. 30, no. 7, pp. 478-490, July 1981.[6] M. Franklin and M. Smotherman, "A Fill-Unit Approach to Multiple Instruction Issue," Proc. 27th Ann. Int'l Symp. Microarchitecture, pp. 162-171,San Jose, Calif., Dec. 1994.[7] D.H. Friendly, S.J. Patel,, and Y.N. Patt, ``Alternative Fetch and Issue Techniques for the Trace Cache Fetch Mechanism,'' Proc. 30th Ann. ACM/IEEE Int'l Symp. Microarchitecture, 1997.[8] D.H. Friendly, S.J. Patel,, and Y.N. Patt, ``Putting the Fill Unit to Work: Dynamic Optimizations for Trace Cache Microprocessors,'' Proc. 30th Ann. ACM/IEEE Int'l Symp. Microarchitecture, 1997.[9] G.F. Grohoski, J.A. Kahle, L.E. Thatcher,, and C.R. Moore, “Branch and Fixed-Point Instruction Execution Units,” IBM RISC System/6000 Technology, Publication number SA23-2619, 1990.[10] E. Hao, P.-Y. Chang, M. Evers,, and Y.N. Patt, ``Increasing the Instruction Fetch Rate via Block-Structured Instruction Set Architectures,'' Proc. 29th Ann. ACM/IEEE Int'l Symp. Microarchitecture, 1996.[11] P.P. Chang and W.W. Hwu,“Trace selection for compiling large C application programs tomicrocode,” Proc. 21st Int’l Microprogramming Workshop, pp. 21-29, Nov. 1988.[12] W.W. Hwu, S.A. Mahlke, W.Y. Chen, P.P. Chang, N.J. Warter, R.A. Bringmann, R.G. Ouellette, R.E. Hank, T. Kiyohara, G.E. Haab, J.G. Holm,, and D.M. Lavery, ``The Superblock: An Effective Technique for VLIW and Superscalar Compilation,'' J. Supercomputing, vol. 7, pp. 9-50, 1993.[13] Q. Jacobson et al., "Control Flow Speculation in Multiscalar Processors," Proc. Third Ann. Symp. High-Performance Computer Architecture, IEEE CS Press, Los Alamitos, Calif., 1997, pp. 218-229.[14] Q. Jacobson, E. Rotenberg, and J.E. Smith, “Path-Based Next Trace Prediction,” Proc. 30th Ann. ACM/IEEE Int'l Symp. Microarchitecture, 1997.[15] J.D. Johnson, "Expansion Caches for Superscalar Processors," Technical Report CSL-TR-94-630, Computer Systems Laboratory, Stanford Univ., Palo Alto, Calif., June 1994.[16] S.W. Melvin and Y.N. Patt, ``Performance Benefits of Large Execution Atomic Units in Dynamically Scheduled Machines,'' Proc. Supercomputing '89, pp. 427-432, 1989.[17] S. Melvin and Y. Patt, “Exploiting Fine-Grained Parallelism Through a Combination of Hardware and Software Techniques,” Proc. 18th Int'l Symp. Computer Architecture, pp. 287-296, May 1991.[18] S. Melvin, M. Shebanow,, and Y. Patt, “Hardware Support for Large Atomic Units in Dynamically Scheduled Machines,” Proc. 21st Int'l Symp. Microarchitecture, pp. 60-66, Dec. 1988.[19] R. Nair and M. Hopkins, "Exploiting Instruction Level Parallelism in Processors by Caching Scheduled Groups," Proc. 24th Ann. Int'l Symp. Computer Architecture,Denver, Colo., June 1997.[20] S.J. Patel, M. Evers,, and Y.N. Patt, ``Improving Trace Cache Effectiveness with Branch Promotion and Trace Packing,'' Proc. 25th Ann. Int'l Symp. Computer Architecture, 1998.[21] S. Patel, D. Friendly,, and Y. Patt, “Critical Issues Regarding the Trace Cache Fetch Mechanism,” Technical Report CSE-TR-335-97, Electrical Eng. and Computer Science Dept., Univ. of Michigan, 1997.[22] A. Peleg and U. Weiser, “Dynamic Flow Instruction Cache Memory Organized Around Trace Segments Independent of Virtual Address Line,” U.S. Patent Number 5,381,533, Jan. 1995.[23] E. Rotenberg, S. Bennett,, and J. Smith, “Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching,” Technical Report 1310, Computer Sciences Dept., Univ. of Wisconsin-Madison, Apr. 1996.[24] E. Rotenberg, S. Bennett, and J. Smith, "Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching," Proc. 29th Ann. ACM/IEEE Int'l Symp. on Microarchitecture, IEEE CS Press, Los Alamitos, Calif., 1996, pp. 24-34.[25] E. Rotenberg, Q. Jacobson, Y. Sazeides, and J.E. Smith, Trace Processors Proc. 30th Int'l Symp. Microarchitecture, pp. 138-148, 1997.[26] A. Seznec, S. Jourdan, P. Sainrat,, and P. Michaud, ``Multiple-Block Ahead Branch Predictors,'' Proc. Seventh Int'l Conf. Architectural Support for Programming Languages and Operating Systems, 1996.[27] M. Smotherman and M. Franklin, “Improving CISC Instruction Decoding Performance Using a Fill Unit,” Proc. 28th Int'l Symp. Microarchitecture, pp. 219-229, Nov. 1995.[28] K. Sundararaman and M. Franklin, “Multiscalar Execution Along a Single Flow of Control,” Proc. ICPP '97, Aug. 1997.[29] S. Vajapeyam and T. Mitra, "Improving Superscalar Instruction Dispatch and Issue by Exploiting Dynamic Code Sequences, Proc. 24th Int'l Symp. Computer Architecture, ACM Press, New York, 1997, pp. 1-12.[30] T.-Y. Yeh, D. Marr,, and Y.N. Patt, ``Increasing the Instruction Fetch Rate via Multiple Branch Prediction and Branch Address Cache,'' Proc. Int'l Conf. Supercomputing, pp. 67-76, 1993.
Index Terms:
Instruction cache, instruction fetching, multiple branch prediction, superscalar processors, trace cache
Citation:
Eric Rotenberg, Steve Bennett, James E. Smith, "A Trace Cache Microarchitecture and Evaluation," IEEE Transactions on Computers, vol. 48, no. 2, pp. 111-120, Feb. 1999, doi:10.1109/12.752652