This Article 
 Bibliographic References 
 Add to: 
Design and Optimization of Large Size and Low Overhead Off-Chip Caches
July 2004 (vol. 53 no. 7)
pp. 843-855

Abstract—Large off-chip L3 caches can significantly improve the performance of memory-intensive applications. However, conventional L3 SRAM caches are facing two issues as those applications require increasingly large caches. First, an SRAM cache has a limited size due to the low density and high cost of SRAM and, thus, cannot hold the working sets of many memory-intensive applications. Second, since the tag checking overhead of large caches is nontrivial, the existence of L3 caches increases the cache miss penalty and may even harm the performance of some memory-intensive applications. To address these two issues, we present a new memory hierarchy design that uses cached DRAM to construct a large size and low overhead off-chip cache. The high density DRAM portion in the cached DRAM can hold large working sets, while the small SRAM portion exploits the spatial locality appearing in L2 miss streams to reduce the access latency. The L3 tag array is placed off-chip with the data array, minimizing the area overhead on the processor for L3 cache, while a small tag cache is placed on-chip, effectively removing the off-chip tag access overhead. A prediction technique accurately predicts the hit/miss status of an access to the cached DRAM, further reducing the access latency. Conducting execution-driven simulations for a 2GHz 4-way issue processor and with 11 memory-intensive programs from the SPEC 2000 benchmark, we show that a system with a cached DRAM of 64MB DRAM and 128KB on-chip SRAM cache as the off-chip cache outperforms the same system with an 8MB SRAM L3 off-chip cache by up to 78 percent measured by the total execution time. The average speedup of the system with the cached-DRAM off-chip cache is 25 percent over the system with the L3 SRAM cache.

[1] B. Abali, H. Franke, D.E. Poff, X. Shen, and T.B. Smith, “Performance of Hardware Compressed Main Memory,” Proc. Int'l Symp. High Performance Computer Architecture (HPCA '01), pp. 73-81, Jan. 2001.
[2] M.M. Annavaram, J.M. Patel, and E.S. Davidson, Data Prefetching by Dependence Graph Precomputation Proc. 28th Ann. Int'l Symp. Computer Architecture, pp. 52-61, 2001.
[3] J.-L. Baer and W.-H. Wang,"On the Inclusion Property for Multi-Level Cache Hierarchies," Proc. 15th Ann. Int'l Symp. Computer Architecture, 1988, pp.73-80.
[4] R. Balasubramonian, S. Dwarkadas, and D.H. Albonesi, Dynamically Allocating Processor Resources between Nearby and Distant ILP Proc. 28th Ann. Int'l Symp. Computer Architecture, pp. 26-37, 2001.
[5] D. Burger, System-Level Implications of Processor-Memory Integration Technical Report CS-TR-1997-1349, Univ. of Wisconsin, Madison, June 1997.
[6] M. Cox, N. Bhandari, and M. Shantz, Multi-Level Texture Caching for 3D Graphics Hardware Proc. 25th Ann. Int'l Symp. Computer Architecture, pp. 86-97, 1998.
[7] D. Culler, J.P. Singh, and A. Gupta, Parallel Computer Architecture: A Hardware/Software Approach. San Mateo, Calif.: Morgan Kaufmann, 1999.
[8] V. Cuppu, B. Jacob, B. Davis, and T. Mudge, “A Performance Comparison of Contemporary DRAM Architectures,” Proc. 26th Ann. Int'l Symp. Computer Architecture, pp. 222-233, May 1999.
[9] Z. Cvetanovic and D.D. Donaldson, AlphaServer 4100 Performance Characterization Digital Technical J., vol. 8, no. 4, pp. 3-20, 1996.
[10] J. Draper, J. Chame, M. Hall, C. Steele, T. Barrett, J. LaCoss, J. Granacki, J. Shin, C. Chen, C.W. Kang, I. Kim, and G. Daglikoca, The Architecture of the DIVA Processing-in-Memory Chip Proc. 16th Int'l Conf. Supercomputing, pp. 14-25, 2002.
[11] Enhanced Memory Systems Inc., 64 Mit ESDRAM Components, Product Brief r1.8, 2000.
[12] B. Gaeke, P. Husbands, X. Li, L. Oliker, K. Yelick, and R. Biswas, Memory-Intensive Benchmarks: IRAM vs. Cache-Based Machines Proc. 16th Int'l Parallel and Distributed Processing Symp., pp. 30-30, 2002.
[13] Z.S. Hakura and A. Gupta, The Design and Analysis of a Cache Architecture for Texture Mapping Proc. 24th Int'l Symp. Computer Architecture, pp. 108-120, June 1997.
[14] C.A. Hart, CDRAM in a Unified Memory Architecture Proc. CompCon '94, pp. 261-266, 1994.
[15] H. Hidaka, Y. Matsuda, M. Asakura, and K. Fujishima, The Cache DRAM Architecture: A DRAM with an On-Chip Cache Memory IEEE Micro, vol. 10, no. 2, pp. 14-25, Apr. 1990.
[16] W.-C. Hsu and J.E. Smith, “Performance of Cached DRAM Organizations in Vector Supercomputers,” Proc. 20th Ann. Int'l Symp. Computer Architecture (ISCA '93), pp. 327-336, May 1993.
[17] Y. Hu and Q. Yang, DCD-Disk Caching Disk: A New Approach for Boosting I/O Performance Proc. 23rd Ann. Int'l Symp. Computer Architecture, pp. 169-178, 1996.
[18] POWER4 System Architecture white paper, IBM, Oct. 2001.
[19] F. Jones et al., A New Era of Fast Dynamic RAMs IEEE Spectrum, pp. 43-49, Oct. 1992.
[20] N.P. Jouppi, “Improving Direct-Mapped Cache Performance by the Addition of a Small Fully Associative Cache and Prefetch Buffers,” Proc. 17th Int'l Symp. Computer Architecture, pp. 364-373, May 1990.
[21] P. Keltcher, S. Richardson, and S. Siu, An Equal Area Comparison of Embedded DRAM and SRAM Memory Architectures for a Chip Multiprocessor Technical Report HPL-2000-53, HP Laboratories, Palo Alto, Calif., Apr. 2000.
[22] G. Kirsch, Active Memory: Micron's Yukon Proc. Int'l Parallel and Distributd Processing Symp., p. 89b, 2003.
[23] R.P. Koganti and G. Kedem, WCDRAM: A Fully Associative Integrated Cached-DRAM with Wide Cache Lines Proc. Fourth IEEE Workshop Architecture and Implementation of High Performance Comm. Systems, 1997.
[24] C. Kozyrakis, A Media-Enhanced Vector Architecture for Embedded Memory Systems Technical Report CSD-99-1059, Univ. of California, Berkeley, 1999.
[25] C. Kozyrakis, J. Gebis, D. Martin, S. Williams, I. Mavroidis, S. Pope, D. Jones, and D. Patterson, Vector IRAM: A Media-Enhanced Vector Processor with Embedded DRAM Proc. Hot Chips 12, 2000.
[26] H.-H. Lee, G. Tyson, and M. Farrens, Eager Writeback A Technique for Improving Bandwidth Utilization Proc 33rd IEEE/ACM Int'l Symp. Microarchitecture, pp. 11-21, 2000.
[27] W.-F. Lin, S.K. Reinhardt, and D. Burger, “Reducing DRAM Latencies with a Highly Integrated Memory Hierarchy Design,” Proc. Seventh Symp. High-Performance Computer Architecture, pp. 301-312, Jan. 2001.
[28] C.-K. Luk, Tolerating Memory Latency through Software-Controlled Pre-Execution in Simultaneous Multithreading Processors Proc. 28th Ann. Int'l Symp. Computer Architecture, pp. 40-51, 2001.
[29] S. Palacharla and R.E. Kessler, “Evaluating Stream Buffers as a Secondary Cache Replacement,” Proc. 21st Ann. Int'l Symp. Computer Architecture, pp. 24-33, Apr. 1994.
[30] D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, and K. Yelick, A Case for Intelligent DRAM: IRAM IEEE Micro, Apr. 1997.
[31] J.-K. Peir, W.W. Hsu, and A.J. Smith, Functional Implementation Techniques for CPU Cache Memories IEEE Trans. Computers, vol. 48, no. 2, pp. 100-110, Feb. 1999.
[32] J.-K. Peir, S.-C. Lai, S.-L. Lu, J. Stark, and K. Lai, Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching Proc. 16th Int'l Conf. Supercomputing (ICS-02), pp. 189-198, 2002.
[33] W.A. Samaras, N. Cherukuri, and S. Venkataraman, The IA-64 Itanium Processor Cartridge IEEE Micro, vol. 21, no. 1, pp. 82-89, Jan./Feb. 2001.
[34] A. Saulsbury, F. Pong, and A. Nowatzyk, Missing the Memory Wall: The Case for Processor/Memory Integration Proc. 23rd Ann. Int'l Symp. Computer Architecure, pp. 90-103, 1996.
[35] A. Seznec, "Decoupled Sectored Caches: Conciliating Low Tag Implementation Cost and Low Miss Ratio," Proc. 21st Ann. Int'l Symp. Computer Architecture, ACM, 1994, pp. 384-393.
[36] T. Sherwood and B. Calder, A Decoupled Predictor-Directed Stream Prefetching Architecture IEEE Trans. Computers, vol. 52, no. 5, Mar. 2003.
[37] P. Shivakumar and N.P. Jouppi, CACTI 3.0: An Integrated Cache Timing, Power, and Area Model technical report, COMPAQ Western Research Lab, Aug. 2001.
[38] J.E. Smith and J.R. Goodman, A Study of Instruction Cache Organization and Replacement Policies Proc. 10th Ann. Int'l Symp. Computer Architecture, pp. 132-137, 1983.
[39] Standard Performance Evaluation Corp.,http:/, 2004.
[40] B. Tremaine, T.B. Smith, M. Wazlowski, D. Har, K. Mak, and S. Arramreddy, “Pinnacle: IBM MXT in a Memory Controller Chip,” IEEE Micro, vol. 22, no. 2, pp. 56-68, Mar./Apr. 2001.
[41] G. Tyson et al., "A Modified Approach to Data Cache Management" Proc. 28th Int'l Symp. Microarchitecture, IEEE CS Press, 1995, pp. 93-103.
[42] C. Weaver http://www.simplescalar.orgspec2000.html , SPEC2000 binaries, 2004.
[43] K.M. Wilson and K. Olukotun, Designing High Bandwidth On-Chip Caches Proc. 24th Ann. Int'l Symp. Computer Architecture, pp. 121-132, 1997.
[44] W. Wong and J.-L. Baer, DRAM On-Chip Caching Technical Report UW CSE 97-03-04, Univ. of Washington, Feb. 1997.
[45] W.A. Wong and J.-L. Baer, Modified LRU Policies for Improving Second-Level Cache Behavior Proc. Sixth Int'l Symp. High-Performance Computer Architecture, pp. 49-60, 2000.
[46] T. Yamauchi, L. Hammond, and K. Olukotun, A Single Chip Multiprocessor Integrated with High Density DRAM Technical Report CSL-TR-97-731, Computer Systems Laboratory, Stanford Univ., Aug. 1997.
[47] T.Y. Yeh and Y.N. Patt,"Alternative Implementations of Two-Level Adaptive Training Branch Prediction," Proc. 19th Ann. Int'l Symp. Computer Architecture, pp. 124-134, 1992.
[48] A. Yoaz et al., "Speculation Techniques for Improving Load Related Instruction Scheduling," Proc. 26th Ann. Int'l Symp. Computer Architecture (ISCA 99), IEEE CS Press, Los Alamitos, Calif., 1999, pp. 42-53.
[49] Z. Zhang, Z. Zhu, and X. Zhang, A Permutation-Based Page Interleaving Scheme to Reduce Row-Buffer Conflicts and Exploit Data Locality Proc. 33rd IEEE/ACM Int'l Symp. Microarchitecture, pp. 32-41, 2000.
[50] Z. Zhang, Z. Zhu, and X. Zhang, Cached DRAM: A Simple and Effective Technique for Memory Access Latency Reduction on ILP Processors IEEE Micro, vol. 21, no. 4, pp. 22-32, July/Aug. 2001.
[51] Z. Zhu, Z. Zhang, and X. Zhang, Fine-Grain Priority Scheduling on Multi-Channel Memory Systems Proc. Eighth Int'l Symp. High-Performance Computer Architecture, pp. 107-116, 2002.

Index Terms:
Cached DRAM, DRAM latency, memory hierarchy, memory-intensive applications, off-chip caches.
Zhao Zhang, Zhichun Zhu, Xiaodong Zhang, "Design and Optimization of Large Size and Low Overhead Off-Chip Caches," IEEE Transactions on Computers, vol. 53, no. 7, pp. 843-855, July 2004, doi:10.1109/TC.2004.27
Usage of this product signifies your acceptance of the Terms of Use.