This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
A Decoupled Predictor-Directed Stream Prefetching Architecture
March 2003 (vol. 52 no. 3)
pp. 260-276

Abstract—An effective method for reducing the effect of load latency in modern processors is data prefetching. One form of hardware-based data prefetching, stream buffers, has been shown to be particularly effective due to its ability to detect data streams and run ahead of them, prefetching as it goes. Unfortunately, in the past, the applicability of streaming was limited to stride intensive code. In this paper, we propose Predictor-Directed Stream Buffers (PSB), which allows the stream buffer to follow a general address prediction stream instead of a fixed stride. A general address prediction stream complicates the allocation of both stream buffer and memory resources because the predictions generated will not be as reliable as prior sequential next-line and stride-based stream buffer implementations. To address this, we examine using confidence-based techniques to guide the allocation and prioritization of stream buffers and their prefetch requests. Our results show, when using PSB on a benchmark suite heavy in pointer-based applications, that PSB provides a 23 percent speedup on average over the best previous stream buffer implementation, and an improvement of 75 percent over using no prefetching at all.

[1] T. Alexander and G. Kedem, “Distributed Prefetch-Buffer/Cache Design for High Performance Memory Systems,” Proc. Second Int'l Symp. High-Performance Computer Architecture, Feb. 1996.
[2] M.M. Annavaram, J.M. Patel, and E.S. Davidson, Data Prefetching by Dependence Graph Precomputation Proc. 28th Ann. Int'l Symp. Computer Architecture, pp. 52-61, 2001.
[3] M. Bekerman, S. Jourdan, R. Ronen, G. Kirshenboim, L. Rappoport, A. Yoaz, and U. Weiser, “Correlated Load-Address Predictors,” Proc. 26th Int'l Symp. Computer Architecture, May 1999.
[4] A. Berrached, L. Coraor, and P. Hulina, “A Decoupled Access/Execute Architecture for Efficient Accesss of Structured Data,” Proc. Hawaii Int'l Conf. System Services, Jan. 1993.
[5] B. Black, B. Mueller, S. Postal, R. Rakvic, N. Utamaphethai, and J.P. Shen, “Load Execution Latency Reduction,” Proc. 12th Int'l Conf. Supercomputing, June 1998.
[6] D.C. Burger and T.M. Austin, “The Simplescalar Tool Set, Version 2.0,” Technical Report CS-TR-97-1342, Univ. of Wisconsin, Madison, June 1997.
[7] M.J. Charney and T.R. Puzak, “Prefetching and Memory System Behavior of the spec95 Benchmark Suite,” IBM J. Research and Development, vol. 41, no. 3, May 1997.
[8] M.J. Charney and A.P. Reeves, “Generalized Correlation Based Hardware Prefetching,” Technical Report EE-CEG-95-1, Cornell Univ., Feb. 1995.
[9] T.F. Chen and J.L. Baer, “Reducing Memory Latency via Non-Blocking and Prefetching Caches,” Proc. Fourth Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS-IV), pp. 51-61, Oct. 1992.
[10] T.-F. Chen and J.-L. Baer, "Effective Hardware-Based Data Prefetching for High-Performance Processors," IEEE Trans. Computers, vol. 44, no. 5, pp. 609-623, May 1995.
[11] C. Chi and C. Cheung, “Hardware-Driven Prefetching for Pointer Data References,” Proc. Int'l Conf. Supercomputing, pp. 377-384, June 1998.
[12] G. Chrysos and J. Emer, “Memory Dependence Prediction Using Store Sets,” Proc. 25th Int'l Symp. Computer Architecture, pp. 142-153, July 1998.
[13] J. Collins, D. Tullsen, H. Wang, and J.P. Shen, “Dynamic Speculative Precomputation,” Proc. 34th Int'l Symp. Microarchitecture, Dec. 2001.
[14] J. Collins, H. Wang, D. Tullsen, C. Hughes, Y. Lee, D. Lavery, and J.P. Shen, “Speculative Precomputation: Long-Range Prefetching of Delinquent Loads,” Proc. 28th Ann. Int'l Symp. Computer Architecture, June 2001.
[15] R.J. Eickemeyer and S. Vassiliadis, “A Load Instruction Unit for Pipelined Processors,” IBM J. Research and Development, vol. 37, pp. 547-564, July 1993.
[16] K. Farkas, P. Chow, N. Jouppi, and Z. Vranesic, “Memory-System Design Considerations for Dynamically-Scheduled Processors,” Proc. 24th Ann. Int'l Symp. Computer Architecture, June 1997.
[17] K. Farkas and N. Jouppi, “How Useful Are Non-Blocking Loads, Stream Buffers and Speculative Execution in Multiple Issue Processors?” Proc. First Int'l Symp. High-Performance Computer Architecture, pp. 78-89, Jan. 1995.
[18] M. Farrens and A. Pleszkun, “Implementation of the Pipe Processor,” Computer, Jan. 1991.
[19] J. González and A. González, “Speculative Execution via Address Prediction and Data Prefetching,” Proc. Int'l Conf. Supercomputing, pp. 196-203, 1997.
[20] G.P. Jones and N.P. Topham, “A Comparison of Data Prefetching on an Access Decoupled and Superscalar Machine,” Proc. 30th Int'l Symp. Microarchitecture, Dec. 1997.
[21] D. Joseph and D. Grunwald, Prefetching Using Markov Predictors Proc. 24th Int'l Symp. Computer Architecture, pp. 252-263, May 1997.
[22] N.P. Jouppi, “Improving Direct-Mapped Cache Performance by the Addition of a Small Fully Associative Cache and Prefetch Buffers,” Proc. 17th Int'l Symp. Computer Architecture, pp. 364-373, May 1990.
[23] A. Lai, C. Fide, and B. Falsafi, “Dead-Block Prediction and Dead-Block Correlating Prefetchers,” Proc. 28th Ann. Int'l Symp. Computer Architecture, June 2001.
[24] C.-K. Luk, Tolerating Memory Latency through Software-Controlled Pre-Execution in Simultaneous Multithreading Processors Proc. 28th Ann. Int'l Symp. Computer Architecture, pp. 40-51, 2001.
[25] S. McFarling, “Combining Branch Predictors,” Technical Report TN-36, Digital Equipment Corp., Western Research Lab, June 1993.
[26] A. Moshovos, D. Pnevmatikatos, and A. Baniasadi, “Slice Processors: An Implementation of Operation-Based Prediction,” Proc. Int'l Conf. Supercomputing, June 2001.
[27] T.C. Mowry, M.S. Lam, and A. Gupta, “Design and Evaluation of a Compiler Algorithm for Prefetching,” Proc. Fifth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, Oct. 1992.
[28] S. Palacharla and R.E. Kessler, “Evaluating Stream Buffers as a Secondary Cache Replacement,” Proc. 21st Ann. Int'l Symp. Computer Architecture, pp. 24-33, Apr. 1994.
[29] G. Reinman, B. Calder, and T. Austin, “Fetch Directed Instruction Prefetching,” Proc. 32nd Int'l Symp. Microarchitecture, Nov. 1999.
[30] G. Reinman, B. Calder, and T. Austin, “Optimizations Enabled by a Decoupled Front-End Architecture,” IEEE Trans. Computers, vol. 50, no. 4, Apr. 2001.
[31] G. Reinman, B. Calder, and T. Austin, “High Performance and Energy Efficient Serial Prefetch Architecture,” Proc. Fourth Int'l Symp. High Performance Computing, May 2002.
[32] A. Roth, A. Moshovos, and G. Sohi, “Dependence Based Prefetching for Linked Data Structures,” Proc. Eighth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, Oct. 1998.
[33] A. Roth and G. Sohi, "Effective Jump-Pointer Prefetching for Linked Data Structures," Proc. 26th Ann. Int'l Symp. Computer Architecture, IEEE Press, Piscataway, N.J., 1999, pp. 111-121.
[34] A. Roth and G.S. Sohi, "Speculative Data-Driven Multithreading," Proc. 7th Int'l Symp. High-Performance Computer Architecture(HPCA-7), IEEE CS Press, Los Alamitos, Calif., 2001, pp. 37-48.
[35] A. Roth, C.B. Zilles, and G.S. Sohi, “Micro-Architectural Miss/Execute Decoupling,” Proc. Int'l Workshop Memory Access Decoupled Architectures and Related Issues, Oct. 2000.
[36] S. Sair, T. Sherwood, and B. Calder, “Quantifying Load Stream Behavior,” Proc. Eighth Int'l Symp. High-Performance Computer Architecture, Feb. 2002.
[37] A. Saulsbury, F. Dahgren, and P. Stenström, Receny-Based TLB Preloading Proc. 27th Ann. Int'l Symp. Computer Architecture, June 2000.
[38] Y. Sazeides and J. Smith, “The Predictability of Data Values,” Proc. 30th Ann. Int'l Symp. Microarchitecture (MICRO '30), pp. 248-258, Dec. 1997.
[39] Y. Sazeides and J.E. Smith, “Modeling Program Predictability,” Proc. Int'l Symp. Computer Architecture, 1998.
[40] T. Sherwood and B. Calder, “Time Varying Behavior of Programs,” Technical Report UCSD-CS99-630, Univ. of Califonia, San Diego, Aug. 1999.
[41] T. Sherwood, E. Perelman,, and B. Calder,"Basic Block Distribution Analysis to Find Periodic Behavior and Simulation Points in Applications," Proc. Int'l Conf. Parallel Architectures and Compilation Techniques, IEEE CS Press, 2001, pp 3-14.
[42] J.E. Smith and W.-C. Hsu, “Prefetching in Supercomputer Instruction Caches,” Proc. Supercomputing, Nov. 1992.
[43] Y. Solihin, J. Lee, and J. Torrellas, “Using a User-Level Memory Thread for Correlation Prefetching,” Proc. 29th Ann. Int'l Symp. Computer Architecture, May 2002.
[44] Y. Song and M. Dubois, “Assisted Execution,” Technical Report CENG 98-25, Univ. of Southern California, Oct. 1988.
[45] D.M. Tullsen, S.J. Eggers, and H.M. Levy, Simultaneous Multithreading: Maximizing On-Chip Parallelism Proc. Int'l Symp. Computer Architecture, pp. 392-403, 1995.
[46] K. Wang and M. Franklin, “Highly Accurate Data Value Prediction Using Hybrid Predictors,” Proc. 30th Ann. Int'l Symp. Microarchitecture, Dec. 1997.
[47] C. Yang and A. Lebeck, “Push vs. Pull: Data Movement for Linked Data Structures,” Proc. Int'l Conf. Supercomputing, June 2000.
[48] C. Zilles and G. Sohi, “Execution-Based Prediction Using Speculative Slices,” Proc. 28th Ann. Int'l Symp. Computer Architecture, June 2001.

Index Terms:
Data prefetching, stream buffers, address prediction.
Citation:
Suleyman Sair, Timothy Sherwood, Brad Calder, "A Decoupled Predictor-Directed Stream Prefetching Architecture," IEEE Transactions on Computers, vol. 52, no. 3, pp. 260-276, March 2003, doi:10.1109/TC.2003.1183943
Usage of this product signifies your acceptance of the Terms of Use.