This Article 
 Bibliographic References 
 Add to: 
Prefetching with Helper Threads for Loosely Coupled Multiprocessor Systems
September 2009 (vol. 20 no. 9)
pp. 1309-1324
Jaejin Lee, Seoul National University, Seoul
Changhee Jung, Georgia Institute of Technology, Atlanta
Daeseob Lim, Seoul National University, Seoul
Yan Solihin, North Carolina State University, Raleigh
This paper presents a helper thread prefetching scheme that is designed to work on loosely coupled processors, such as in a standard chip multiprocessor (CMP) system or an intelligent memory system. Loosely coupled processors have an advantage in that resources such as processor and L1 cache resources are not contended by the application and helper threads, hence preserving the speed of the application. However, interprocessor communication is expensive in such a system. We present techniques to alleviate this. Our approach exploits large loop-based code regions and is based on a new synchronization mechanism between the application and helper threads. This mechanism precisely controls how far ahead the execution of the helper thread can be with respect to the application thread. We found that this is important in ensuring prefetching timeliness and avoiding cache pollution. To demonstrate that prefetching in a loosely coupled system can be done effectively, we evaluate our prefetching by simulating a standard unmodified CMP system and an intelligent memory system where a simple processor in memory executes the helper thread. Evaluating our scheme with nine memory-intensive applications with the memory processor in DRAM achieves an average speedup of 1.25. Moreover, our scheme works well in combination with a conventional processor-side sequential L1 prefetcher, resulting in an average speedup of 1.31. In a standard CMP, the scheme achieves an average speedup of 1.33. Using a real CMP system with a shared L2 cache between two cores, our helper thread prefetching plus hardware L2 prefetching achieves an average speedup of 1.15 over the hardware L2 prefetching for the subset of applications with high L2 cache misses per cycle.

[1] M. Dubois and Y.H. Song, “Assisted Execution,” Technical Report 98-25, Dept. Electrical Eng.—Systems, Univ. of Southern California, Oct. 1998.
[2] J.A. Brown, H. Wang, G. Chrysos, P.H. Wang, and J.P. Shen, “Speculative Precomputation on Chip Multiprocessors,” Proc. Sixth Workshop Multithreaded Execution, Architecture and Compilation (MTEAC '02), Nov. 2002.
[3] D. Kim, S.-W. Liao, P.H. Wang, J. del Cuvillo, X. Tian, X. Zou, H. Wang, D. Yeung, M. Girkar, and J.P. Shen, “Physical Experimentation with Prefetching Helper Threads on Intel's Hyper-Threaded Processor,” Proc. Second Int'l Symp. Code Generation and Optimization (CGO '04), pp. 27-38, Mar. 2004.
[4] D. Kim and D. Yeung, “Design and Evaluation of Compiler Algorithms for Pre-Execution,” Proc. 10th Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS '02), pp. 159-170, Oct. 2002.
[5] C.-K. Luk, “Tolerating Memory Latency through Software-Controlled Pre-Execution in Simultaneous Multithreading Processors,” Proc. 29th Int'l Symp. Computer Architecture (ISCA '01), June 2001.
[6] A. Roth and G. Sohi, “Speculative Data-Driven Multithreading,” Proc. Seventh Int'l Symp. High-Performance Computer Architecture (HPCA '01), pp. 37-48, Jan. 2001.
[7] K. Sundaramoorthy, Z. Purser, and E. Rotenberg, “Slipstream Processors: Improving Both Performance and Fault Tolerance,” Proc. Ninth Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS '00), pp. 257-268, Oct. 2000.
[8] C.B. Zilles and G.S. Sohi, “Understanding the Backward Slices of Performance Degrading Instructions,” Proc. 27th Int'l Symp. Computer Architecture (ISCA '00), pp. 172-181, June 2000.
[9] J.D. Collins, H. Wang, D.M. Tullsen, C.J. Hughes, Y. fong Lee, D. Lavery, and J.P. Shen, “Speculative Precomputation: Long-Range Prefetching of Delinquent Loads,” Proc. 28th Int'l Symp. Computer Architecture (ISCA '01), pp. 14-25, July 2001.
[10] J.D. Collins, D.M. Tullsen, H. Wang, and J.P. Shen, “Dynamic Speculative Precomputation,” Proc. 34th Int'l Symp. Microarchitecture (MICRO '01), Dec. 2001.
[11] G.K. Dorai and D. Yeung, “Transparent Threads: Resource Allocation in SMT Processors for High Single-Thread Performance,” Proc. 11th Ann. Int'l Conf. Parallel Architectures and Compilation Techniques (PACT '02), Sept. 2002.
[12] B. Sinharoy, R.N. Kalla, J.M. Tendler, R.J. Eickemeyer, and J.B. Joyner, “Power5 System Microarchitecture,” IBM J. Research and Development, vol. 49, nos. 4-5, 2005.
[13] J. Lee, Y. Solihin, and J. Torrellas, “Automatically Mapping Code in an Intelligent Memory Architecture,” Proc. Seventh Int'l Symp. High-Performance Computer Architecture (HPCA '01), Jan. 2001.
[14] Y. Solihin, J. Lee, and J. Torrellas, “Automatic Code Mapping on an Intelligent Memory Architecture,” IEEE Trans. Computers, special issue on advances in high performance memory systems, vol. 50, no. 11, 2001.
[15] J. Renau et al., SESC, http:/, 2004.
[16] M. Weiser, “Program Slicing,” Proc. Fifth Int'l Conf. Software Eng. (ICSE '81), pp. 439-449, 1981.
[17] P. Tu and D. Padua, “Automatic Array Privatization,” Compiler Optimizations for Scalable Parallel Systems: Languages, Compilation Techniques, and Run Time Systems, LNCS 1808, pp.247-281, Springer, 2001.
[18] S. Muchnick, Advanced Compiler Design and Implementation. Morgan Kaufmann, 1997.
[19] N. Jouppi, “Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers,” Proc. 17th Int'l Symp. Computer Architecture (ISCA '90), pp. 364-373, May 1990.
[20] Intel, IA-32 Intel Architecture Optimization Reference Manual, Apr. 2006.
[21] Intel, Intel C++ Compiler 9.1 for Windows, 2006.
[22] Intel, Intel Visual Fortran Compiler 9.1, 2006.
[23] Intel, Intel VTune Performance Analyzer 8.1, 2006.
[24] Z. Purser, K. Sundaramoorthy, and E. Rotenberg, “A Study of Slipstream Processors,” Proc. 33rd Int'l Symp. Microarchitecture (MICRO '00), pp. 269-280, Dec. 2000.
[25] R.S. Chappell, J. Stark, S. Kim, S.K. Reinhardt, and Y.N. Patt, “Simultaneous Subordinate Microthreading (SSMT),” Proc. 26th Int'l Symp. Computer Architecture (ISCA '99), pp. 186-195, May 1999.
[26] C.B. Zilles and G.S. Sohi, “Execution-Based Prediction Using Speculative Slices,” Proc. 28th Int'l Symp. Computer Architecture (ISCA '01), July 2001.
[27] W. Zhang, B. Calder, and D. Tullsen, “Dynamic Speculative Precomputation,” Proc. Fourth IEEE Int'l Symp. Code Generation and Optimization (CGO '06), Mar. 2006.
[28] A.D. Jiwei Lu, W.-C. Hsu, K. Nguyen, and S.G. Abraham, “Dynamic Helper Threaded Prefetching on the Sun UltraSparc CMP Processor,” Proc. 38th Ann. IEEE/ACM Int'l Symp. Microarchitecture (MICRO '05), pp. 93-104, Nov. 2005.
[29] T. Alexander and G. Kedem, “Distributed Predictive Cache Design for High Performance Memory Systems,” Proc. Second Int'l Symp. High-Performance Computer Architecture (HPCA '96), pp.254-263, Feb. 1996.
[30] R. Cooksey, D. Colarelli, and D. Grunwald, “Content-Based Prefetching: Initial Results,” Proc. Second Workshop Intelligent Memory Systems, pp. 33-55, Nov. 2000.
[31] C.J. Hughes, “Prefetching Linked Data Structures in Systems with Merged DRAM-Logic,” master's thesis, Technical Report UIUCDCS-R-2001-2221, Univ. of Illinois at Urbana-Champaign, May 2000.
[32] NVIDIA, Technical Brief: NVIDIA nForce Integrated Graphics Processor (IGP) and Dynamic Adaptive Speculative Pre-Processor (DASP '02), http:/, 2002.
[33] Y. Solihin, J. Lee, and J. Torrellas, “Using a User-Level Memory Thread for Correlation Prefetching,” Proc. 29th Int'l Symp. Computer Architecture (ISCA '02), May 2002.
[34] C.-L. Yang and A.R. Lebeck, “Push versus Pull: Data Movement for Linked Data Structures,” Proc. Int'l Conf. Supercomputing (ICS'00), pp. 176-186, May 2000.
[35] L. Zhang, S. McKee, W. Hsieh, and J. Carter, “Pointer-Based Prefetching within the Impulse Adaptable Memory Controller: Initial Results,” Proc. Workshop Solving the Memory Wall Problem, June 2000.
[36] J.B. Carter et al., “Impulse: Building a Smarter Memory Controller,” Proc. Fifth Int'l Symp. High-Performance Computer Architecture (HPCA '99), pp. 70-79, Jan. 1999.
[37] C. Jung, D. Lim, J. Lee, and Y. Solihin, “Helper Thread Prefetching for Loosely-Coupled Multiprocessor Systems,” Proc. IEEE Int'l Parallel and Distributed Processing Symp. (IPDPS '06), Apr. 2006.

Index Terms:
Helper thread, prefetching, chip multiprocessors, processing-in-memory system.
Jaejin Lee, Changhee Jung, Daeseob Lim, Yan Solihin, "Prefetching with Helper Threads for Loosely Coupled Multiprocessor Systems," IEEE Transactions on Parallel and Distributed Systems, vol. 20, no. 9, pp. 1309-1324, Sept. 2009, doi:10.1109/TPDS.2008.224
Usage of this product signifies your acceptance of the Terms of Use.