This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
An Efficient Solution to the Cache Thrashing Problem Caused by True Data Sharing
May 1998 (vol. 47 no. 5)
pp. 527-543

Abstract—When parallel programs are executed on multiprocessors with private caches, a set of data may be repeatedly used and modified by different threads. Such data sharing can often result in cache thrashing, which degrades memory performance. This paper presents and evaluates a loop restructuring method to reduce or even eliminate cache thrashing caused by true data sharing in nested parallel loops. This method uses a compiler analysis which applies linear algebra and the theory of numbers to the subscript expressions of array references. Due to this method's simplicity, it can be efficiently implemented in any parallel compiler. Experimental results show quite significant performance improvements over existing static and dynamic scheduling methods.

[1] S. Abraham and D. Hudak, "Compile-Time Partitioning of Iterative Parallel Loops to Reduce Cache Coherence Traffic," IEEE Trans. Parallel and Distributed Systems, vol. 2, no. 3, July 1991.
[2] W. Abu-Sufah, D. Kuck, and D. Lawrie, "On the Performance Enhancement of Paging Systems Through Program Analysis and Transformations," IEEE Trans. Computers, vol. 30, no. 5, May 1981.
[3] A. Agarwal, D. Kranz, and V. Natarajan, “Automatic Partitioning of Parallel Loops and Data Arrays for Distributed Shared-Memory Multiprocessors,” IEEE Trans. Parallel and Distributed Systems, vol. 6, no. 9, pp. 943-962, Sept. 1995.
[4] J.R. Allen and K. Kennedy, "Automatic Loop Interchange," Proc. SIGPLAN '84 Symp. Compiler Construction,Montreal, Canada, June 1984.
[5] J. Baer and W. Wang, "Multilevel Cache Hierarchies: Organizations, Protocols, and Performance," J. Parallel and Distributed Computing, vol. 6, pp. 451-476, 1989.
[6] U. Banerjee,Dependence Analysis for Supercomputing. Norwell, MA: Kluwer, 1988.
[7] D. Callahan, S. Carr, and K. Kennedy, “Improving Register Allocation for Subscripted Variables,” Proc. ACM SIGPLAN 1990 Conf. Programming Language Design and Implementation, pp. 53-65, June 1990.
[8] S. Carr and K. Kennedy, "Compiling Scientific Code for Complex Memory Hierarchies," Proc. Hawaii Int'l Conf. System Sciences, pp. 536-544, 1991.
[9] E. D'Hollander, "Partitioning and Labeling of Loops by Unimodular Transformations," IEEE Trans. Parallel and Distributed Systems, vol. 3, no. 4, July 1992.
[10] S.J. Eggers and T.E. Jeremiassen, "Eliminating False Sharing," Proc. 1991 Int'l Conf. Parallel Processing, Aug. 1991.
[11] J. Fang and M. Lu, "A Solution of Cache Ping-Pong Problem in RISC Based Parallel Processing Systems," Proc. 1991 Int'l Conf. Parallel Processing, Aug. 1991.
[12] Z. Fang, "Cache or Local Memory Thrashing and Compiler Strategy in Parallel Processing Systems," Proc. 1990 Int'l Conf. Parallel Processing, pp. 271-275, Aug. 1990.
[13] J. Fang and M. Lu, "An Iteration Partition Approach for Cache or Local Memory Thrashing on Parallel Processing," IEEE Trans. Computers, vol. 42, no. 5, May 1993.
[14] M. Galles and E. Williams, "Performance Optimizations, Implementation, and Verification of the SGI Challenge Multiprocessor," Proc. 27th Ann. Hawaii Int'l Conf. System Sciences, 1994.
[15] K. Gallivan, W. Jalby, and D. Gannon, "On the Problem of Optimizing Data Transfers for Complex Memory Systems," Proc. Supercomputing '88, pp. 238-253, 1988.
[16] D. Gannon, W. Jalby, and K. Gallivan, "Strategies for Cache and Local Memory Management by Global Program Transformations," J. Parallel and Distributed Computing, vol. 5, no. 5, pp. 587-616, Oct. 1988.
[17] M. Gupta and D. Padua, "Effects of Program Parallelization and Stripmining Transformation on Cache Performance in a Multiprocessor," Proc. 1991 Int'l Conf. Parallel Processing, Aug. 1991.
[18] D. Hudak and S. Abraham, "Compiler Techniques for Data Partitioning of Sequentially Iterated Parallel Loops," Proc. ACM Int'l Conf. Supercomputing, pp. 187-200, 1990.
[19] T. Jeremiassen and S. Eggers, “Reducing False Sharing on Shared Memory Multiprocessors through Compile Time Data Transformations,” Proc. SIGPLAN Symp. Principles and Practices of Parallel Programming, pp. 179-188, July 1995.
[20] G. Jin and F. Chen, "The Design and the Implementation of a Knowledge-Based Parallelizing Tool," Proc. Second IES Information Technology Conf.,Singapore, July 1991.
[21] G. Jin, Z. Li, and F. Chen, "An Efficient Solution to the Cache Thrashing Problem (Extended Version)," Technical Report TR 96-020, Dept. of Computer Science, Univ. of Minnesota, 1996.
[22] G. Jin and F. Chen, "Loop Restructuring Techniques for the Thrashing Problem," Proc. 1992 Int'l Conf. Parallel Architectures and Languages Europe, 1992.
[23] G. Jin, X. Yang, and F. Chen, "Loop Staggering, Loop Staggering and Loop Compacting: Restructuring Techniques for the Thrashing Problem," Proc. 1991 Int'l Conf. Parallel Processing, Aug. 1991.
[24] D. Kuck, The Structure of Computers and Computations, vol. 1. New York: John Wiley and Sons, 1978.
[25] D.J. Kuck,R. Kuhn,D. Padua,B. Leasure,, and M. Wolfe,“Dependence graphs and compiler optimizations,” Proc. 1981 SIGACT-SIGPLAN Symp. Principles of Programming Languages, pp. 207-218, Jan. 1981.
[26] M. Lu and J. Fang, "A Solution of the Cache Ping-Pong Problem in Multiprocessor Systems," J. Parallel and Distributed Computing, vol. 16, Oct. 1992.
[27] I. Nivan et al., An Introduction to the Theory of Numbers fourth ed. New York, Chichester, Brisbane, Toronto: John Wiley&Sons, 1980.
[28] J. Peir and R. Cytron, "Minimum Distance: A Method for Partitioning Recurrences for Multiprocessors," IEEE Trans. Computers, vol. 38, no. 8, Aug. 1989.
[29] C.D. Polychronopoulos and D.J. Kuck, “Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel Supercomputers,” IEEE Trans. Computers, vol. 36, no. 12, pp. 1425-1439, Dec. 1987.
[30] W. Shang and J.A.B. Fortes, "Time Optimal Linear Schedules for Algorithms with Uniform Dependencies," IEEE Trans. Computers, vol. 40, June 1991.
[31] Z. Shen, Z. Li, and P.-C. Yew, "An Empirical Study of Fortran Programs for Parallelizing Compilers," IEEE Trans. Parallel and Distributed Systems, vol. 1, no. 3, pp. 356-364, July 1990.
[32] K. Tomko and S. Abraham, "Iteration Partitioning for Resolving Stride Conflicts on Cache-Coherent Multiprocessors," Proc. 1993 Int'l Conf. Parallel Processing, Aug. 1993.
[33] J. Torrellas, M. Lam, and J. Hennessey, "False Sharing and Spatial Locality in Multiprocessor Caches," IEEE Trans. Computers, vol. 43, no. 6, pp. 651-663, June 1994.
[34] M. Wolf and M. Lam, “A Data Locality Optimizing Algorithm,” Proc. SIGPLAN Conf. Programming Language Design and Implementation, pp. 30-44, June 1991.
[35] M. Wolfe, “More Iteration Space Tiling,” Proc. Supercomputing '89, pp. 655-664, Nov. 1989.

Index Terms:
Multiprocessors, cache thrashing, true data sharing, parallel threads, loop transformations, parallelizing compilers.
Citation:
Guohua Jin, Zhiyuan Li, Fujie Chen, "An Efficient Solution to the Cache Thrashing Problem Caused by True Data Sharing," IEEE Transactions on Computers, vol. 47, no. 5, pp. 527-543, May 1998, doi:10.1109/12.677228
Usage of this product signifies your acceptance of the Terms of Use.