This Article 
 Bibliographic References 
 Add to: 
Designing a Modern Memory Hierarchy with Hardware Prefetching
November 2001 (vol. 50 no. 11)
pp. 1202-1218

Abstract—In this paper, we address the severe performance gap caused by high processor clock rates and slow DRAM accesses. We show that, even with an aggressive, next-generation memory system using four Direct Rambus channels and an integrated one-megabyte level-two cache, a processor still spends over half its time stalling for L2 misses. Our experimental analysis begins with an effort to tune our baseline memory system aggressively: incorporating optimizations to reduce DRAM row buffer misses, reordering miss accesses to reduce queuing delay, and adjusting the L2 block size to match each channel organization. We show that there is a large gap between the block sizes at which performance is best and at which miss rate is minimized. Using those results, we evaluate a hardware prefetch unit integrated with the L2 cache and memory controllers. By issuing prefetches only when the Rambus channels are idle, prioritizing them to maximize DRAM row buffer hits, and giving them low replacement priority, we achieve a 65 percent speedup across 10 of the 26 SPEC2000 benchmarks, without degrading the performance of the others. With eight Rambus channels, these 10 benchmarks improve to within 10 percent of the performance of a perfect L2 cache.

[1] V. Agarwal, M.S. Hrishikesh, S.W. Keckler, and D. Burger, “Clock Rate versus IPC: The End of the Road for Conventional Microarchitectures,” Proc. 27th Ann. Int'l Symp. Computer Architecture, pp. 248-259, June 2000.
[2] J.-L. Baer and T.-F. Chen, "An Effective On-Chip Preloading Scheme To Reduce Data Access Penalty," Proc. Supercomputing '91, pp. 176-186, 1991,.
[3] D. Burger and T.M. Austin, “The Simplescalar Tool Set Version 2.0,” Technical Report 1342, Computer Sciences Dept., Univ. of Wisconsin, Madison, June 1997.
[4] D. Burger, J.R. Goodman, and A. Kägi, "Memory Bandwidth Limitations of Future Microprocessors," Proc. 23rd Ann. Int'l Symp. Computer Architecture, Association of Computing Machinery, New York, 1996, pp. 79-90.
[5] J. Corbal, R. Espasa, and M. Valero, “Command Vector Memory Systems: High Performance at Low Cost,” Proc. 1998 Int'l Conf. Parallel Architectures and Compilation Techniques, pp. 68-77, Oct. 1998.
[6] R. Crisp, "Direct Rambus Technology: The New Main Memory Standard," IEEE Micro, Vol. 17, No. 6, Nov./Dec. 1997, pp. 18-28.
[7] V. Cuppu, B. Jacob, B. Davis, and T. Mudge, “A Performance Comparison of Contemporary DRAM Architectures,” Proc. 26th Ann. Int'l Symp. Computer Architecture, pp. 222-233, May 1999.
[8] F. Dahlgren, M. Dubois, and P. Stenstrom, "Sequential Hardware Prefetching in Shared Memory Multiprocessors," IEEE Trans. Parallel and Distributed Systems, Vol. 6, No. 7, July 1995, pp. 733-746.
[9] G.C. Driscoll, J.J. Losq, T.R. Puzak, G.S. Rao, H.E. Sachar, and R.D. Villani, “Cache Miss Directory—A Means of Prefetching Cache Missed Lines,” IBM Technical Disclosure Bulletin, vol. 25, p. 1286, Aug. 1982, 82A%2061161 .
[10] G.C. Driscoll, T.R. Puzak, H.E. Sachar, and R.D. Villani, “Staging Length Table—A Means of Minimizing Cache Memory Misses Using Variable Length Cache Lines,” IBM Technical Disclosure Bulletin, vol. 25, p. 1285, Aug. 1982, .
[11] J.D. Gindele, “Buffer Block Prefetching Method,” IBM Technical Disclosure Bulletin, vol. 20, no. 2, pp. 696-697, July 1977.
[12] L. Gwennap, “Alpha 21364 to Ease Memory Bottleneck,” Microprocessor Reports, pp. 12-15, 26 Oct. 1998.
[13] S.I. Hong, S.A. McKee, M.H. Salinas, R.H. Klenke, J.H. Aylor, and W.A. Wulf, “Access Order and Effective Bandwidth for Streams on a Direct Rambus Memory,” Proc. Fifth Int'l Symp. High-Performance Computer Architecture, pp. 80-89, Jan. 1999.
[14] T.L. Johnson and W.W. Hwu, “Run-Time Adaptive Cache Hierarchy Management via Reference Analysis,” Proc. 24th Ann. Int'l Symp. Computer Architecture, pp. 315-326, June 1997.
[15] N.P. Jouppi, “Improving Direct-Mapped Cache Performance by the Addition of a Small Fully Associative Cache and Prefetch Buffers,” Proc. 17th Int'l Symp. Computer Architecture, pp. 364-373, May 1990.
[16] D. Kroft, "Lockup-Free Instruction Fetch/Prefetch Cache Organization," Proc. Eighth Int'l Symp. Computer Architecture, pp. 81-87, 1981.
[17] S. Kumar and C. Wilkerson, “Exploiting Spatial Locality in Data Caches Using Spatial Footprints,” Proc. 25th Ann. Int'l Symp. Computer Architecture, July 1998.
[18] W.-F. Lin, S.K. Reinhardt, and D. Burger, “Reducing DRAM Latencies with a Highly Integrated Memory Hierarchy Design,” Proc. Seventh Symp. High-Performance Computer Architecture, pp. 301-312, Jan. 2001.
[19] W.-F. Lin, S.K. Reinhardt, D. Burger, and T.R. Puzak, “Filtering Superfluous Prefetches Using Density Vectors,” Proc. Int'l Conf. Computer Design, pp. 124-132, Sept. 2001.
[20] B.K. Mathew, S.A. McKee, J.B. Carter, and A. Davis, “Design of a Parallel Vector Access Unit for SDRAM Memory Systems,” Proc. Sixth Int'l Symp. High-Performance Computer Architecture, Jan. 2000.
[21] S.A. McKee and W.A. Wulf, “Access Ordering and Memory-Conscious Cache Utilization,” Proc. First Int'l Symp. High-Performance Computer Architecture, pp. 253-262, Jan. 1995.
[22] S. Palacharla and R.E. Kessler, “Evaluating Stream Buffers as a Secondary Cache Replacement,” Proc. 21st Ann. Int'l Symp. Computer Architecture, pp. 24-33, Apr. 1994.
[23] S. Przybylski, “The Performance Impact of Block Sizes and Fetch Strategies,” Proc. 17th Ann. Int'l Symp. Computer Architecture, pp. 160-169, May 1990.
[24] S. Rixner et al., "Memory Access Scheduling," Proc. 27th Ann. Int'l Symp. Computer Architecture, IEEE CS Press, 2000, pp. 128-138.
[25] A.J. Smith, “Line (Block) Size Choice for CPU Cache Memories,” IEEE Trans. Computers, vol. 36, no. 9, pp. 1063-1075, Sept. 1987.
[26] A.J. Smith, "Cache Memories," ACM Computing Surveys, Vol. 14, 1982, pp. 473-540.
[27] G.S. Sohi, "Instruction Issue Logic for High-Performance, Interruptible, Multiple Functional Unit, Pipelined Computers," IEEE Trans. Computers, Vol. 39, No. 3, 1990, pp. 349-359.
[28] O. Temam and Y. Jegou, “Using Virtual Lines to Enhance Locality Exploitation,” Proc. 1994 Int'l Conf. Supercomputing, pp. 344-353, July 1994.
[29] O. Temam, “Investigating Optimal Local Memory Performance,” Proc. Eighth Symp. Architectural Support for Programming Languages and Operating Systems, pp. 218-227, Oct. 1998.
[30] P. Van Vleet, E. Anderson, L. Brown, J.-L. Baer, and A. Karlin, “Pursuing the Performance Potential of Dynamic Cache Line Sizes,” Proc. 1999 Int'l Conf. Computer Design, pp. 528-537, Oct. 1999.
[31] W.A. Wong and J.-L. Baer, “Dram Caching,” Technical Report 97-03-04, Dept. of Computer Science and Eng., Univ. of Washington, 1997.
[32] C. Zhang and S.A. McKee, “Hardware-Only Stream Prefetching and Dynamic Access Ordering,” Proc. 14th Int'l Conf. Supercomputing, May 2000.
[33] Z. Zhang, Z. Zhu, and X. Zhang, “A Permutation-Based Page Interleaving Scheme to Reduce Row-Buffer Conflicts and Exploit Data Locality,” Proc. 33rd Int'l Symp. Microarchitecture, pp. 32-41, Dec. 2000.
[34] Z. Zhang and J. Torrellas, “Speeding Up Irregular Applications in Shared-Memory Multiprocessors: Memory Binding and Group Prefetching,” Proc. 22nd Ann. Int'l Symp. Computer Architecture, pp. 188-199, June 1995.
[35] J.H. Zurawski, J.E. Murray, and P.J. Lemmon, “The Design and Verification of the Alphastation 600 5-Series Workstation,” Digital Technical J., vol. 7, no. 1, Aug. 1995.

Index Terms:
Prefetching, caches, memory bandwidth, spatial locality, memory system design, Rambus DRAM.
Wei-Fen Lin, Steven K. Reinhardt, Doug Burger, "Designing a Modern Memory Hierarchy with Hardware Prefetching," IEEE Transactions on Computers, vol. 50, no. 11, pp. 1202-1218, Nov. 2001, doi:10.1109/12.966495
Usage of this product signifies your acceptance of the Terms of Use.