This Article 
 Bibliographic References 
 Add to: 
Effective Hardware-Based Data Prefetching for High-Performance Processors
May 1995 (vol. 44 no. 5)
pp. 609-623

Abstract—Memory latency and bandwidth are progressing at a much slower pace than processor performance. In this paper, we describe and evaluate the performance of three variations of a hardware function unit whose goal is to assist a data cache in prefetching data accesses so that memory latency is hidden as often as possible. The basic idea of the prefetching scheme is to keep track of data access patterns in a Reference Prediction Table (RPT) organized as an instruction cache. The three designs differ mostly on the timing of the prefetching. In the simplest scheme (basic), prefetches can be generated one iteration ahead of actual use. The lookahead variation takes advantage of a lookahead program counter that ideally stays one memory latency time ahead of the real program counter and that is used as the control mechanism to generate the prefetches. Finally the correlated scheme uses a more sophisticated design to detect patterns across loop levels.

These designs are evaluated by simulating the ten SPEC benchmarks on a cycle-by-cycle basis. The results show that 1) the three hardware prefetching schemes all yield significant reductions in the data access penalty when compared with regular caches, 2) the benefits are greater when the hardware assist augments small on-chip caches, and 3) the lookahead scheme is the preferred one cost-performance wise.

[1] J.-L. Baer and T.-F. Chen, "An Effective On-Chip Preloading Scheme To Reduce Data Access Penalty," Proc. Supercomputing '91, pp. 176-186, 1991,.
[2] J. Baer and W. Wang, "Multilevel Cache Hierarchies: Organizations, Protocols, and Performance," J. Parallel and Distributed Computing, vol. 6, pp. 451-476, 1989.
[3] T. Ball and J.R. Larus,“Branch prediction for free,” Technical Report #1137, Computer Science Dept., Univ. of Wis.-Madison, Feb. 1993.
[4] T-F. Chen,“Data prefetching for high-performance processors,” PhD thesis, Dept. of Computer Science and Engineering,Univ. of Wash., 1993.
[5] T.F. Chen and J.L. Baer, “Reducing Memory Latency via Non-Blocking and Prefetching Caches,” Proc. Fourth Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS-IV), pp. 51-61, Oct. 1992.
[6] W.Y. Chen, S.A. Mahlke, P.P. Chang,, and W.W. Hwu, “Data Access Microarchitectures for Superscalar Processors with Compiler-Assisted Data Prefetching,” Proc. 24th Ann.l ACM/IEEE Int'l Symp. Microarchitecture, pp. 69-73, 1991.
[7] J. Fu and J.H. Patel, "Data Prefetching in Multiprocessor Vector Cache Memories," Proc. 18th Int'l Symp. Computer Architecture, pp. 54-63, 1991.
[8] J. Fu, J.H. Patel, and B.L. Janssens, "Stride Directed Prefetching in Scalar Processors," Proc. 25th Ann. Int'l Symp. Microarchitecture, pp. 102-110, 1992,.
[9] D. Gannon, W. Jalby, and K. Gallivan, "Strategies for Cache and Local Memory Management by Global Program Transformations," J. Parallel and Distributed Computing, vol. 5, no. 5, pp. 587-616, Oct. 1988.
[10] E. Gornish, E. Granston, and A. Veidenbaum, "Compiler Directed Data Prefetching in Multiprocessors with Memory Hierarchies," Proc. 1990 Int'l Conf. Supercomputing, pp. 354-368, 1990.
[11] N.P. Jouppi, “Improving Direct-Mapped Cache Performance by the Addition of a Small Fully Associative Cache and Prefetch Buffers,” Proc. 17th Int'l Symp. Computer Architecture, pp. 364-373, May 1990.
[12] A. Klaiber and H. Levy, “An Architecture for Software-Controlled Data Prefetching,” Proc. 18th Ann. Int'l Symp. Computer Architecture, pp. 43-53, May 1991.
[13] D. Kroft, "Lockup-Free Instruction Fetch/Prefetch Cache Organization," Proc. Eighth Int'l Symp. Computer Architecture, pp. 81-87, 1981.
[14] J.K.F. Lee and A.J. Smith,“Branch prediction strategies and branch target buffer design,” Computer, pp. 6-22, Jan. 1984.
[15] R.L. Lee,P-C. Yew,, and D.H. Lawrie,“Data prefetching in shared memory multiprocessors,” Proc. Int’l Conf. Parallel Processing, pp. 28-31, 1987.
[16] R. Lee, P.-C. Yew, and D. Lawrie, "Multiprocessor Cache Design Considerations," Proc. Int'l Symp. Computer Architecture, ACM Press, New York, 1987, pp. 253-262.
[17] T. Mowry and A. Gupta, "Tolerating Latency through Software-Controlled Prefetching in Scalable Shared- Memory Multiprocessors," J. Parallel and Distributed. Computing, vol. 12, pp. 87-106, June 1991.
[18] T.C. Mowry, M.S. Lam, and A. Gupta, “Design and Evaluation of a Compiler Algorithm for Prefetching,” Proc. Fifth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, Oct. 1992.
[19] S. Pan, K. So, and J. Rahmeh, “Improving the Accuracy of Dynamic Branch Prediction Using Branch Correlation,” Proc. Fifth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 76-84, Oct. 1992.
[20] C.H. Perleberg and A.J. Smith,“Branch target buffer design and optimization,,” Technical Report UCB/CSD 89/552, Univ. of Calif., Berkeley, Dec. 1989.
[21] A.K. Porterfield,“Software methods for improvement of cache performance on supercomputer applications,” Technical Report COMP TR 89-93, Rice Univ., May 1989.
[22] S. Przybylski, “The Performance Impact of Block Sizes and Fetch Strategies,” Proc. 17th Ann. Int'l Symp. Computer Architecture, pp. 160-169, May 1990.
[23] I. Sklenar, "Prefetch Unit for Vector Operations on Scalar Computers," Computer Architecture News, vol. 20, pp. 31-37, Sept. 1992.
[24] A.J. Smith, "Cache Memories," ACM Computing Surveys, Vol. 14, 1982, pp. 473-540.
[25] J.E. Smith,“Decoupled access/execute computer architectures,” Proc. Ninth Ann. Int’l Symp. Computer Architecture, pp. 112-119, 1982.
[26] T.Y. Yeh and Y.N. Patt,"Alternative Implementations of Two-Level Adaptive Training Branch Prediction," Proc. 19th Ann. Int'l Symp. Computer Architecture, pp. 124-134, 1992.

Index Terms:
Prefetching, hardware function unit, reference prediction, branch prediction, data cache, cycle-by-cycle simulations.
Jean-Loup Baer, Tien-Fu Chen, "Effective Hardware-Based Data Prefetching for High-Performance Processors," IEEE Transactions on Computers, vol. 44, no. 5, pp. 609-623, May 1995, doi:10.1109/12.381947
Usage of this product signifies your acceptance of the Terms of Use.