This Article 
 Bibliographic References 
 Add to: 
Correlation Prefetching with a User-Level Memory Thread
June 2003 (vol. 14 no. 6)
pp. 563-580

Abstract—This paper proposes using a User-Level Memory Thread (ULMT) for correlation prefetching. In this approach, a user thread runs on a general-purpose processor in main memory, either in the memory controller chip or in a DRAM chip. The thread performs correlation prefetching in software, sending the prefetched data into the L2 cache of the main processor. This approach requires minimal hardware beyond the memory processor: The correlation table is a software data structure that resides in main memory, while the main processor only needs a few modifications to its L2 cache so that it can accept incoming prefetches. In addition, the approach has wide applicability, as it can effectively prefetch even for irregular applications. Finally, it is very flexible, as the prefetching algorithm can be customized by the user on an application basis. Our simulation results show that, through a new design of the correlation table and prefetching algorithm, our scheme delivers good results. Specifically, nine mostly-irregular applications show an average speedup of 1.32. Furthermore, our scheme works well in combination with a conventional processor-side sequential prefetcher, in which case the average speedup increases to 1.46. Finally, by exploiting the customization of the prefetching algorithm, we increase the average speedup to 1.53.

[1] T. Alexander and G. Kedem, “Distributed Prefetch-Buffer/Cache Design for High Performance Memory Systems,” Proc. Second Int'l Symp. High-Performance Computer Architecture, Feb. 1996.
[2] J.L. Baer, Dynamic Improvements of Locality in Virtual Memory Systems IEEE Trans. Software Eng., vol. 2, pp. 54-62, Mar. 1976.
[3] J.E. Barnes, Treecode Inst. for Astronomy, Univ. of Hawaii, , 1994.
[4] J.B. Carter, W.C. Hsieh, L.B. Stoller, M.R. Swanson, L. Zhang, E.L. Brunvand, A. Davis, C.-C. Kuo, R. Kuramkote, M.A. Parker, L. Schaelicke, and T. Tateyama, Impulse: Building a Smarter Memory Controller Proc. Fifth Int'l Symp. High Performance Computer Architecture, pp. 70-79, Jan. 1999.
[5] R.S. Chappell et al., "Simultaneous Subordinate Microthreading (SSMT)," Proc. 26th Int'l Symp. Computer Architecture, ACM Press, New York, 1999, pp. 186-195.
[6] M.J. Charney and A.P. Reeves, Generalized Correlation Based Hardware Prefetching Technical Report EE-CEG-95-1, Cornell Univ., Feb. 1995.
[7] T.F. Chen and J.L. Baer, “Reducing Memory Latency via Non-Blocking and Prefetching Caches,” Proc. Fourth Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS-IV), pp. 51-61, Oct. 1992.
[8] S. Choi, D. Kim, and D. Yeung, Multi-Chain Prefetching: Effective Exploitation of Inter-Chain Memory Parallelism for Pointer-Chasing Codes Proc. Int'l Conf. Parallel Architectures and Compilation Techniques, pp. 51-61, Sept. 2001.
[9] R. Cooksey, D. Colarelli, and D. Grunwald, Content-Based Prefetching: Initial Results Proc. Second Workshop Intelligent Memory Systems, pp. 33-55, Nov. 2000.
[10] J. Dongarra, V. Eijkhout, and H. van der Vorst, SparseBench: A Sparse Iterative Benchmark marksparsebench , 2003.
[11] M. Dubois and Y.H. Song, Assisted Execution CENG Technical Report 98-25, Dept. of Electrical Eng.-Systems, Univ. of Southern California, Oct. 1998.
[12] Z. Hu, S. Kaxiras, and M. Martonosi, Timekeeping in the Memory System: Predicting and Optimizing Memory Behavior Proc. 29th Int'l Symp. Computer Architecture, May 2002.
[13] Z. Hu, M. Martonosi, and S. Kaxiras, Tag Correlating Prefetchers Proc. Ninth Int'l Symp. High-Performance Computer Architecture, Feb. 2003.
[14] C.J. Hughes, Prefetching Linked Data Structures in Systems with Merged DRAM-Logic Master's thesis, Univ. of Illinois at Urbana-Champaign, Technical Report UIUCDCS-R-2001-2221, May 2000.
[15] D. Joseph and D. Grunwald, Prefetching Using Markov Predictors Proc. 24th Int'l Symp. Computer Architecture, pp. 252-263, May 1997.
[16] N.P. Jouppi, “Improving Direct-Mapped Cache Performance by the Addition of a Small Fully Associative Cache and Prefetch Buffers,” Proc. 17th Int'l Symp. Computer Architecture, pp. 364-373, May 1990.
[17] M. Karlsson, F. Dahlgren, and P. Stenstrom, A Prefetching Technique for Irregular Accesses to Linked Data Structures Proc. Sixth Int'l Symp. High-Performance Computer Architecture, pp. 206-217, Jan. 2000.
[18] D. Koufaty and J. Torrellas, Comparing Data Forwarding and Prefetching for Communication-Induced Misses in Shared-Memory MPs Proc. Int'l Conf. Supercomputing, pp. 53-60, July 1998.
[19] C. Kozyrakis, S. Perissakis, D. Patterson, T. Anderson, K. Asanovic, N. Cardwell, R. Fromm, J. Golbus, B. Gribstad, K. Keeton, R. Thomas, N. Treuhaft, and K. Yelick, “Scalable Processors in the Billion-Transistor Era: IRAM,” Computer, vol. 30, no. 9, pp. 75-78, Sept. 1997.
[20] V. Krishnan and J. Torrellas, “A Direct Execution Framework for Fast and Accurate Simulation of Superscalar Processors,” Proc. PACT '98, pp, 286-293, Oct. 1998.
[21] A. Lai, C. Fide, and B. Falsafi, “Dead-Block Prediction and Dead-Block Correlating Prefetchers,” Proc. 28th Ann. Int'l Symp. Computer Architecture, June 2001.
[22] J. Lee, Y. Solihin, and J. Torrellas, “Automatically Mapping Code on an Intelligent Memory Architecture,” Proc. Seventh Int'l Symp. High Performance Computer Architecture, Jan. 2001.
[23] M.H. Lipasti, W.J. Schmidt, S.R. Kunkel,, and R.R. Roediger, “SPAID: Software Prefetching in Pointer- and Call-Intensive Environments,” Proc. 28th Ann. ACM/IEEE Int'l Symp. Microarchitecture, pp. 231-236 1995.
[24] C.-K. Luk and T.C. Mowry, “Compiler-Based Prefetching for Recursive Data Structures,” Proc. Seventh Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 222-233, Oct. 1996.
[25] S. Mehrotra, “Data Prefetch Mechanisms for Accelerating Symbolic and Numeric Computation,” PhD thesis, Univ. of Illi nois, 1996.
[26] MIPS, MIPS R10000 Microprocessor User's Manual, Version 2.0, Jan. 1997.
[27] NVIDIA, technical brief: NVIDIA nForce Integrated Graphics Processor (IGP) and Dynamic Adaptive Speculative Pre-Processor (DASP),http:/, 2002.
[28] S. Palacharla and R.E. Kessler, “Evaluating Stream Buffers as a Secondary Cache Replacement,” Proc. 21st Ann. Int'l Symp. Computer Architecture, pp. 24-33, Apr. 1994.
[29] J. Pomerene, T. Puzak, R. Rechtschaffen, and F. Sparacio, Prefetching System for a Cache Having a Second Directory for Sequentially Accessed Blocks US Patent 4,807,110, Feb. 1989.
[30] A. Roth, A. Moshovos, and G. Sohi, “Dependence Based Prefetching for Linked Data Structures,” Proc. Eighth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, Oct. 1998.
[31] A. Roth and G.S. Sohi, "Speculative Data-Driven Multithreading," Proc. 7th Int'l Symp. High-Performance Computer Architecture(HPCA-7), IEEE CS Press, Los Alamitos, Calif., 2001, pp. 37-48.
[32] T. Sherwood, S. Sair, and B. Calder, Predictor-Directed Stream Buffers Proc. 33rd Int'l Symp. Microarchitecture, pp. 42-53, Dec. 2000.
[33] Y. Solihin, J. Lee, and J. Torrellas, “Using a User-Level Memory Thread for Correlation Prefetching,” Proc. 29th Ann. Int'l Symp. Computer Architecture, May 2002.
[34] Sony Computer Entertainment Inc.,http:/, 2003.
[35] C. Yang and A. Lebeck, “Push vs. Pull: Data Movement for Linked Data Structures,” Proc. Int'l Conf. Supercomputing, June 2000.
[36] Z. Zhang and J. Torrellas, Speeding up Irregular Applications in Shared-Memory Multiprocessors: Memory Binding and Group Prefetching Proc. 22nd Int'l Symp. Computer Architecture, pp. 188-199, June 1995.

Index Terms:
Prefetching, correlation prefetching, memory-side prefetching, helper threads, intelligent memory architecture, processing-in-memory, heterogeneous system.
Yan Solihin, Jaejin Lee, Josep Torrellas, "Correlation Prefetching with a User-Level Memory Thread," IEEE Transactions on Parallel and Distributed Systems, vol. 14, no. 6, pp. 563-580, June 2003, doi:10.1109/TPDS.2003.1206504
Usage of this product signifies your acceptance of the Terms of Use.