This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
High Bandwidth On-Chip Cache Design
April 2001 (vol. 50 no. 4)
pp. 292-307

Abstract—In this paper, we evaluate the performance of high bandwidth cache organizations employing multiple cache ports, multiple cycle hit times, and cache port efficiency enhancements, such as load all and line buffer, to find the organization that provides the best processor performance. Using a dynamic superscalar processor running realistic benchmarks that include operating system references, we use execution time to measure processor performance. When the cache is limited to a single cache port without enhancements, we find that two cache ports increase processor performance by 25 percent. With the addition of line buffer and load all to a single ported cache, the processor achieves 91 percent of the performance of the same processor containing a cache with two ports. When the processor is not limited to a single cache port, the results show that a large dual-ported multicycle pipelined SRAM cache with a line buffer maximizes processor performance. A large pipelined cache provides both a low miss rate and a high CPU clock frequency. Dual-porting the cache and using a line buffer provide the bandwidth needed by a dynamic superscalar processor. The line buffer makes the pipelined dual-ported cache the best option by increasing cache port bandwidth and hiding cache latency.

[1] J. Bennett and M. Flynn, “Performance Factors for Superscalar Processors,” Technical Report CSL-TR-95-661, Computer Systems Laboratory, Stanford Univ., Feb. 1995.
[2] J. Edmondson et al., “Internal Organization of the Alpha 21164, a 300-MHz, 64-Bit, Quad-Issue, CMOS RISC Microprocessor,” Digital Technical J., vol. 7, no. 1, 1995.
[3] K.I. Farkas and N.P. Jouppi, Complexity/Performance Tradeoffs with Non-Blocking Loads Proc. 21st Int'l Symp. Computer Architecture (ISCA), pp. 211-222, Apr. 1994.
[4] J. Gray, ed., Benchmark Handbook for Database and Transaction Processing Systems, second ed., Morgan Kaufmann, San Mateo, Calif., 1993.
[5] L. Gwennap, “MIPS R10000 Uses Decoupled Architecture,” Microprocessor Report, vol. 8, no. 14, pp. 18-22, Oct. 1994.
[6] J. Hennessy and D. Patterson, Computer Architecture: A Quantitative Approach. Morgan Kaufmann, 1995.
[7] M. Horowitz, S. Przybylski, and M.D. Smith, “Tutorial on Recent Trends in Processor Design: Reclimbing the Complexity Curve,” Western Inst. of Computer Science, Stanford Univ., 1992.
[8] M. Horowitz, “High Frequency Clock Distribution,” High Frequency Digital Logic Design and Clocking Strategies, Proc. 1996 Symp. VLSI Circuits, June 1996.
[9] N.P. Jouppi, “Improving Direct-Mapped Cache Performance by the Addition of a Small Fully Associative Cache and Prefetch Buffers,” Proc. 17th Int'l Symp. Computer Architecture, pp. 364-373, May 1990.
[10] N. Jouppi and S. Wilton, ``Tradeoffs in Two-Level On-Chip Caching,'' Proc. 21st ISCA, pp. 34-45, Apr. 1994.
[11] S. Jourdan, P. Sainrat, and D. Litaize, “Exploring Configurations of Functional Units in an Out-of-Order Superscalar Processor,” Proc. 22nd Ann. Int'l Symp. Computer Architecture, pp. 117-124, June 1995.
[12] J. Keller, “The 21264: A Superscalar Alpha Processor with Out-of-Order Execution,” Microprocessor Forum, Oct. 1996.
[13] M. Heinrich et al. “The Stanford FLASH Multiprocessor,” Proc. 21th Int'l Symp. Computer Architecture, pp. 302-313, April 1994.
[14] MIPS Technologies, Inc., “R10000 Microprocessor Product Overview,” MIPS Open RISC Technology, MIPS Technologies, Inc., Oct. 1994.
[15] MIPS Technologies, Inc., “R10000 Microprocessor User's Manual—Version 2.0,” Silicon Graphics,http://www.sgi.com/MIPS/products/r10k/UMan_V2.0 R10K_UM.cv.html, 1996.
[16] S. Moon and K. Ebcioglu, “A Study on the Number of Memory Ports in Multiple Issue Machines,” Proc. 26th Ann. Int'l Symp. Microarchitecture, pp. 49-58, Dec. 1993.
[17] K. Olukotun, T. Mudge, and R. Brown, “Performance Optimization of Pipelined Primary Caches,” Proc. 19th Ann. Int'l Symp. Computer Architecture, pp. 181-190, May 1992.
[18] M. Rosenblum, S. Herrod, E. Witchel, and A. Gupta, "Complete Computer System Simulation," IEEE Parallel and Distributed Technology, Fall 1995.
[19] A. Saulsbury, F. Pong, and A. Nowatzyk, “Missing the Memory Wall: The Case for Processor/Memory Integration,” Proc. 23rd Ann. Int'l Symp. Computer Architecture (ISCA '96), pp. 90-101, May 1996.
[20] T. Shimizu et al., "A Multimedia 32b RISC Microprocessor with 16Mb DRAM," Dig. Tech. Papers, IEEE Int'l Solid-State Circuits Conf., IEEE, New York, 1996, pp. 216-217, 448.
[21] G.S. Sohi and M. Franklin, “High-Bandwidth Data Memory Systems for Superscalar Processors,” Proc. Fourth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 53-62, 8-11 Apr. 1991.
[22] K.M. Wilson and K. Olukotun, “High Performance Cache Architectures to Support Dynamic Superscalar Microprocessors,” Technical Report CSL-TR-95-682, Computer Systems Laboratory, Stanford Univ., June 1995.
[23] K.M. Wilson, K. Olukotun, and M. Rosenblum, “Increasing Cache Port Efficiency for Dynamic Superscalar Microprocessors,” Proc. 23rd Int'l Symp. Computer Architecture, pp. 147-157, May 1996.
[24] K.M. Wilson and K. Olukotun, “Designing High Bandwidth On-Chip Caches,” Proc. 24th Ann. Int'l Symp. Computer Architecture, pp. 252-262, June 1997.
[25] K.M. Wilson, “High Bandwidth Cache Design for Superscalar Processors,” Technical Report CSL-TR-98-767 (thesis), Computer Systems Laboratory, Stanford Univ., Aug. 1998.
[26] S.J.E. Wilton and N.P. Jouppi, “An Enhanced Access and Cycle Time Model for On-Chip Caches,” WRL Research Report 93/5, Western Research Laboratory, 1994.
[27] S.J.E. Wilton and N.P. Jouppi, Cacti: An Enhanced Cache Access and Cycle Time Model IEEE J. Solid-State Circuits, vol. 31, no. 5, pp. 677-688, May. 1996.
[28] K.C. Yeager, “The MIPS R10000 Superscalar Microprocessor,” IEEE Micro, vol. 16, no. 2, pp. 28–40, Apr. 1996.

Index Terms:
Dynamic superscalar, banked cache, memory bandwidth, dual-ported cache, SPEC95.
Citation:
Kenneth M. Wilson, Kunle Olukotun, "High Bandwidth On-Chip Cache Design," IEEE Transactions on Computers, vol. 50, no. 4, pp. 292-307, April 2001, doi:10.1109/12.919276
Usage of this product signifies your acceptance of the Terms of Use.