This Article 
 Bibliographic References 
 Add to: 
CPU Cache Prefetching: Timing Evaluation of Hardware Implementations
May 1998 (vol. 47 no. 5)
pp. 509-526

Abstract—Prefetching into CPU caches has long been known to be effective in reducing the cache miss ratio, but known implementations of prefetching have been unsuccessful in improving CPU performance. The reasons for this are that prefetches interfere with normal cache operations by making cache address and data ports busy, the memory bus busy, the memory banks busy, and by not necessarily being complete by the time that the prefetched data is actually referenced. In this paper, we present extensive quantitative results of a detailed cycle-by-cycle trace-driven simulation of a uniprocessor memory system in which we vary most of the relevant parameters in order to determine when and if hardware prefetching is useful. We find that, in order for prefetching to actually improve performance, the address array needs to be double ported and the data array needs to either be double ported or fully buffered. It is also very helpful for the bus to be very wide (e.g., 16 bytes) for bus transactions to be split and for main memory to be interleaved. Under the best circumstances, i.e., with a significant investment in extra hardware, prefetching can significantly improve performance. For implementations without adequate hardware, prefetching often decreases performance.

[1] J.-L. Baer and W.-H. Wang,"On the Inclusion Property for Multi-Level Cache Hierarchies," Proc. 15th Ann. Int'l Symp. Computer Architecture, 1988, pp.73-80.
[2] Callahan Kennedy and Porterfield, "Software Prefetching," Proc. Fourth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 40-52, Apr. 1991.
[3] T.-F. Chen and J.-L. Baer, "Effective Hardware-Based Data Prefetching for High-Performance Processors," IEEE Trans. Computers, vol. 44, no. 5, pp. 609-623, May 1995.
[4] C.-H. Chi, "Compiler Optimization Technique for Data Cache Prefetching Using a Small CAM Array," Proc. 1994 Int'l Conf. Parallel Processing, vol. I, pp. 263-266, Aug. 1994.
[5] J. Cho, H. Sachs, and A.J. Smith, "The Memory Architecture and the Cache and Memory Management Unit for the Fairchild CLIPPER Processor," Technical Report UCB/CSD-86-289, Univ. of California, Berkeley, Mar. 1986.
[6] F. Dahlgren, M. Dubois, and P. Stenstrom, "Fixed and Adaptive Sequential Prefetching in Shared Memory Multiprocessors," Proc. 1993 Int'l Conf. Parallel Processing, pp. I56-I63, Aug. 1993.
[7] F. Dahlgren, M. Dubois, and P. Stenstrom, "Sequential Hardware Prefetching in Shared Memory Multiprocessors," IEEE Trans. Parallel and Distributed Systems, Vol. 6, No. 7, July 1995, pp. 733-746.
[8] "Pixie," DEC Ultrix manual page, 1991.
[9] J.W.C. Fu and J.H. Patel, "Data Prefetching Strategies for Vector Cache Memories," Proc. Fifth Int'l Parallel Processing Symp., pp. 555-560, May 1991.
[10] J.D. Gee, M.D. Hill, D.N. Pnevmatikatos, and A.J. Smith, "Cache Performance of the SPEC92 Benchmark Suite," IEEE Micro, pp. 17-27, Aug. 1993.
[11] J. Gee and A.J. Smith, "Evaluation of Cache Consistency Algorithm Performance," Proc. Mascots '96 (Int'l Workshop Modeling, Analysis, and Simulation of Computer and Telecommunication Systems) Conf., pp. 236-249, Feb. 1996.
[12] J.D. Gindele, "Buffer Block Prefetching Method," IBM Technical Disclosure Bulletin, vol. 20, no. 2, pp. 696-697, July 1977.
[13] E. Gornish, E. Granston, and A. Veidenbaum, "Compiler Directed Data Prefetching in Multiprocessors with Memory Hierarchies," Proc. 1990 Int'l Conf. Supercomputing, pp. 354-368, 1990.
[14] E. Gornish and A. Veidenbaum, "An Integrated Hardware/Software Data Prefetching Scheme for Shared-Memory Multiprocessors," Proc. 1994 Int'l Conf. Parallel Processing, vol. II, pp. 281-284, Aug. 1994.
[15] M.D. Hill manual page on DineroIII, Univ. of California, Berkeley, Oct. 1985.
[16] W. Hollingsworth, H. Sachs, and A.J. Smith, "The Fairchild CLIPPER: Instruction Set Architecture and Processor Implementation," Comm. ACM, vol. 32, no. 2, pp. 200-219, Feb. 1989.
[17] R. Jain, The Art of Computer Systems Performance Analysis, pp. 283-292. John Wiley&Sons, 1991.
[18] S.B. Kim et al., "Threaded Prefetching: An Adaptive Instruction Prefetch Mechanism," Microprocessing and Microprogramming, vol. 39, no. 1, pp. 1-15, Nov. 1993.
[19] L. Kurian, P.T. Hulina, L.D. Coraor, and D.N. Mannai, "Classification and Performance Evaluation of Instruction Buffering Techniques," Proc. 18th Int'l Symp. Computer Architecture, pp. 150-159, May 1991.
[20] J.V. Levy, "Buses: The Skeleton of Computer Structures," Computer Engineering: A DEC View of Hardware Systems Design. Digital Press, 1978.
[21] D. Poulsen and P.-C. Yew, "Data Prefetching and Data Forwarding in Shared Memory Multiprocessors," Proc. 1994 Int'l Conf. Parallel Processing, vol. II, pp. 276-280, Aug. 1994.
[22] S.A. Przybylski, Cache and Memory Hierarchy Design—A Performance-Directed Approach, pp. 181-186. Morgan Kaufmann, 1990.
[23] A.D. Samples, “Mache: No-Loss Trace Compaction,” Proc. ACM SIGMETRICS 1989, pp. 89-97, 1989.
[24] R.T. Short and H.M. Levy, "A Simulation Study of Two-Level Caches," Proc. 15th Int'l Symp. Computer Architecture, pp. 81-88, June 1988.
[25] A.J. Smith, "Sequential Program Prefetching in Memory Hierarchies," Computer, vol. 11, no. 12, pp. 7-21, Dec. 1978.
[26] A.J. Smith, "Characterizing the Storage Process and Its Effects on Main Memory Update," J. ACM, vol. 26, no. 1, pp. 6-27, Jan. 1979.
[27] A.J. Smit, "Sequentiality and Prefetching in Data Base Systems," IBM Research Report RJ 1743, 19 Mar. 1976, and ACM Trans. Data Base Systems, vol. 3, no. 3, pp. 223-247, Sept 1979.
[28] A.J. Smith, "Cache Memories," ACM Computing Surveys, Vol. 14, 1982, pp. 473-540.
[29] A.J. Smith, "Cache Evaluation and the Impact of Workload Choice," Proc. 12th Int'l Symp. Computer Architecture, pp. 64-75, June 1985.
[30] A.J. Smith, “Line (Block) Size Choice for CPU Cache Memories,” IEEE Trans. Computers, vol. 36, no. 9, pp. 1063-1075, Sept. 1987.
[31] R.B. Smith, J.K. Archibald, and B.E. Nelson, "Evaluating Performance of Prefetching Second Level Caches," Performance Evaluation Review, vol. 20, no. 4, pp. 31-42, May 1993.
[32] A.J. Smith, "Trace-Driven Simulation in Research on Computer Architecture and Operating Systems," Proc. New Directions in Simulation for Manufacturing and Comm. (SIM94), Morito, Sakasegawa, Yoneda, Fushimi, Nakano, eds., pp. 43-49,Tokyo, Aug.1-2 1994.
[33] J. Tse and A.J. Smith, "Performance Evaluation of Cache Implementation," Technical Report UCB/CSD-95-877, Univ. of California, Berkeley, June 1995.
[34] D. Tullsen and S. Eggers, "Limitations of Cache Prefetching on a Bus-Based Multiprocessor," Proc. 20th Ann. Symp. Computer Architecture, pp. 278-288 May 1993.
[35] A. Varma and G.K. Sinha, "A Class of Prefetch Schemes for On-Chip Data Caches," technical report, Computer Science Dept., Univ. of California, Santa Cruz, 1992.

Index Terms:
Cache memory, prefetching, timing model, cache prefetching, CPU architecture, memory system design, CPU cache memory.
John Tse, Alan Jay Smith, "CPU Cache Prefetching: Timing Evaluation of Hardware Implementations," IEEE Transactions on Computers, vol. 47, no. 5, pp. 509-526, May 1998, doi:10.1109/12.677225
Usage of this product signifies your acceptance of the Terms of Use.