This Article 
 Bibliographic References 
 Add to: 
Multilevel Optimization of Pipelined Caches
October 1997 (vol. 46 no. 10)
pp. 1093-1102

Abstract—This paper formulates and shows how to solve the problem of selecting the cache size and depth of cache pipelining that maximizes the performance of a given instruction-set architecture. The solution combines trace-driven architectural simulations and the timing analysis of the physical implementation of the cache. Increasing cache size tends to improve performance but this improvement is limited because cache access time increases with its size. This trade-off results in an optimization problem we referred to as multilevel optimization, because it requires the simultaneous consideration of two levels of machine abstraction: the architectural level and the physical implementation level. The introduction of pipelining permits the use of larger caches without increasing their apparent access time, however, the bubbles caused by load and branch delays limit this technique. In this paper we also show how multilevel optimization can be applied to pipelined systems if software- and hardware-based strategies are considered for hiding the branch and load delays.

The multilevel optimization technique is illustrated with the design of a pipelined cache for a high clock rate MIPS-based architecture. The results of this design exercise show that, because processors with pipelined caches can have shorter CPU cycle times and larger caches, a significant performance advantage is gained by using two or three pipeline stages to fetch data from the cache. Of course, the results are only optimal for the implementation technologies chosen for the design exercise; other choices could result in quite different optimal designs. The exercise is primarily to illustrate the steps in the design of pipelined caches using multilevel optimization; however, it does exemplify the importance of pipelined caches if high clock rate processors are to achieve high performance.

[1] A.J. Smith, "A Comparative Study of Set Associative Memory Mapping Algorithms and Their Use for Cache and Main Memory," IEEE Trans. Software Eng., vol. 4, no. 2, pp. 121-130, 1978.
[2] M.D. Hill, Aspects of Cache Memory and Instruction Buffer Performance, PhD thesis, UCB/CSD 87/381, Univ. of California at Berkeley, Nov. 1987.
[3] S.A. Przybylski, Cache and Memory Hierarchy Design—A Performance-Directed Approach, pp. 181-186. Morgan Kaufmann, 1990.
[4] O.A. Olukotun, R.B. Brown, R.J. Lomax, T.N. Mudge, and K.A. Sakallah, "Multilevel Optimization in the Design of a High-Performance GaAs Microcomputer," IEEE J. Solid-State Circuits, vol. 6, no. 5, pp. 763-767, May 1991.
[5] K. Olukotun, T. Mudge, and R. Brown, “Performance Optimization of Pipelined Primary Caches,” Proc. 19th Ann. Int'l Symp. Computer Architecture, pp. 181-190, May 1992.
[6] G. Kane and J. Heinrich, MIPS RISC Architecture, Prentice-Hall, Englewood Cliffs, N.J., 1992.
[7] R. Brown, M. Upton, A. Chandna, T. Huff, T. Mudge, and R. Oettel, "Gallium Arsenide Process Evaluation Based on a RISC Microprocessor Example," IEEE J. Solid-State Circuits, vol. 28, no. 10, pp. 1,030-1,037, Oct. 1993.
[8] T.I. Chappell, B.A. Chappell, S.E. Schuster, J.W. Allan, S.P. Klepner, R.V. Joshi, and R.L. Franch, "A 2-ns Cycle, 3.8-ns Access 512-kb CMOS ECL SRAM with a Fully Pipelined Architecture," IEEE J. Solid-State Circuits, vol. 26, pp. 1,577-1,585, 1991.
[9] O.A. Olukotun, "Technology-Organization Trade-Offs in the Architecture of a High Performance Processor," PhD thesis, Univ. of Michigan, 1991.
[10] J. Hennessy and D. Patterson, Computer Architecture: A Quantitative Approach. Morgan Kaufmann, 1995.
[11] O.A. Olukotun, T.N. Mudge, and R.B. Brown, "Implementing a Cache for High-Performance GaAs Microprocessor," Proc. 18th Ann. Int'l Symp. Computer Architecture, 1991.
[12] M.D. Smith, "Tracing with Pixie," Technical Report CSL-TR-91-497, Computer Systems Laboratory, Nov. 1991.
[13] MIPS RISCompiler Languages Programmer's Guide, MIPS Computer Systems, Inc., Dec. 1988.
[14] K.A. Sakallah, T.N. Mudge, and O.A. Olukotun, "CheckTcand minTc: Timing Verification and Optimal Clocking of Synchronous Digital Circuits," Proc. IEEE Conf. Computer-Aided Design, pp. 552-555,Santa Clara, Calif., Nov. 1990.
[15] A.I. Kayssi, "A Methodology for the Construction of Accurate Timing Macromodels for Digital Circuits," PhD thesis, Univ. of Michigan, 1993.
[16] R. Brown et al., "Synthesis and Verification of a GaAs Microprocessor from a Verilog Hardware Description," Proc. Open Verilog Int'l User Group Meeting, 1992.
[17] J.E. Smith, "A Study of Branch Prediction Strategies," Proc. Eighth Ann. Int'l Symp. Computer Architecture, pp. 135-148, June 1981.
[18] D.J. Lilja, "Reducing the Branch Penalty in Pipelined Processors," Computer, 1988.
[19] W.W. Hwu, T.M. Conte, and P.P. Chang, "Comparing Software and Hardware Schemes for Reducing the Cost of Branches," Proc. 16th Ann. Int'l Symp. Computer Architecture, 1989.
[20] H.B. Bakoglu, Circuits, Interconnections, and Packaging for VLSI.Reading, Mass.: Addison-Wesley, 1990.
[21] T. Wada, S. Rajan, and S.A. Przybylski, An Analytical Access Time Model for On-Chip Cache Memories IEEE J. Solid-State Circuits, vol. 27, no. 8, pp. 1147-1156, Aug. 1992.
[22] J.L. Hennessy and N.P. Jouppi, "Computer Technology and Architecture: An Evolving Interaction," Computer, pp. 18-29, Sept. 1991.

Index Terms:
Optimizing cache design, trace-driven simulation, multichip modules, pipelining, caches, cache access times, macromodels of delay.
Kunle Olukotun, Trevor N. Mudge, Richard B. Brown, "Multilevel Optimization of Pipelined Caches," IEEE Transactions on Computers, vol. 46, no. 10, pp. 1093-1102, Oct. 1997, doi:10.1109/12.628394
Usage of this product signifies your acceptance of the Terms of Use.