This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Reducing False Sharing and Improving Spatial Locality in a Unified Compilation Framework
April 2003 (vol. 14 no. 4)
pp. 337-354

Abstract—The performance of applications on large shared-memory multiprocessors with coherent caches depends on the interaction between the granularity of data sharing, the size of the coherence unit, and the spatial locality exhibited by the applications, in addition to the amount of parallelism in the applications. Large coherence units are helpful in exploiting spatial locality, but worsen the effects of false sharing. A mathematical framework that allows a clean description of the relationship between spatial locality and false sharing is derived in this paper. First, a technique to identify a severe form of multiple-writer false sharing is presented. The importance of the interaction between optimization techniques aimed at enhancing locality and the techniques oriented toward reducing false sharing is then demonstrated. Given the conflicting requirements, a compiler-based approach to this problem holds promise. This paper investigates the use of data transformations in addressing spatial locality and false sharing, and derives an approach that balances the impact of the two. Experimental results demonstrate that such a balanced approach outperforms those approaches that consider only one of these two issues. On an eight-processor SGI/Cray Origin 2000 multiprocessor, our approach brings an additional 9 percent improvement over a powerful locality optimization technique that uses both loop and data transformations. Also, the presented approach obtains an additional 19 percent improvement over an optimization technique that is oriented specifically toward reducing false sharing. This study also reveals that, in addition to reducing synchronization costs and improving the memory subsystem performance, obtaining large granularity parallelism is helpful in balancing the effects of enhancing locality and reducing false sharing, rendering them compatible.

[1] J. Anderson, S. Amarasinghe, and M. Lam, “Data and Computation Transformations for Multiprocessors,” Proc. Fifth ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, July 1995.
[2] D.F. Bacon, J. Chow, D.R. Ju, K. Muthukumar, and V. Sarkar, Proc. CASCON '94 Conf., Nov. 1994.
[3] U. Banerjee, “Unimodular Transformations of Double Loops,” Advances in Languages and Compilers for Parallel Processing, A. Nicolau et al., eds., MIT Press, 1991.
[4] J. Bilmes, K. Asanovic, C.-W. Chin, and J. Demmel, “Optimizing Matrix-Multiply Using PHiPAC: A Portable, High-Performance ANSI C Coding Methodology,” Proc. Int'l Conf. Supercomputing, pp. 340-347, July 1997.
[5] R. Bianchini and T. LeBlanc, “Software Caching on Cache-Coherent Multiprocessors,” Proc. Fourth IEEE Symp. Parallel and Distributed Processing, Dec. 1992.
[6] F. Bodin, E. Granston, and T. Montaut, “Evaluating Two Loop Transformations for Reducing Multiple-Writer False Sharing,” Proc. Seventh Ann. Workshop Languages and Compilers for Parallel Computing, Aug. 1994.
[7] F. Bodin, E. Granston, and T. Montaut, “Page-Level Affinity Scheduling for Eliminating False Sharing,” Proc. Fifth Workshop Compilers for Parallel Computing, June 1995.
[8] W. Bolosky, R. Fitzgerald, and M. Scott, “Simple But Effective Techniques for NUMA Memory Management,” Proc. 12th ACM Symp. Operating Systems Principles, Dec. 1989.
[9] L. Carter, J. Ferrante, S. Flynn Hummel, B. Alpern, and K. Gatlin, “Hierarchical Tiling: A Methodology for High Performance,” UCSD Technical Report CS96-508, Nov. 1996.
[10] S. Chatterjee, V.V. Jain, A.R. Lebeck, S. Mundhra, and M. Thottethodi, “Nonlinear Array Layouts for Hierarchical Memory Systems,” Proc. 1999 ACM Int'l Conf. Supercomputing, pp. 444-453, June 1999.
[11] J. Chow and V. Sarkar, “False Sharing Elimination by Selection of Runtime Scheduling Parameters,” Proc. 26th Int'l Conf. Parallel Processing, Aug. 1997.
[12] M. Cierniak and W. Li, “Unifying Data and Control Transformations for Distributed Shared Memory Machines,” Proc. SIGPLAN Conf. Programming Language Design and Implementation, June 1995.
[13] M. Cierniak and W. Li, “A Practical Approach to the Compile-Time Elimination of False Sharing for Explicitly Parallel Programs,” Proc. 10th Ann. Int'l Conf. High Performance Computers, June 1996.
[14] S. Coleman and K. McKinley, “Tile Size Selection Using Cache Organization and Data Layout,” Proc. SIGPLAN Conf. Programming Language Design and Implementation, June 1995.
[15] S. Eggers and T. Jeremiassen, “Eliminating False Sharing,” Proc. Int'l Conf. Parallel Processing, vol. I, pp. 377-381, Aug. 1991.
[16] J.D. Frens and D.S. Wise, “Auto-Blocking Matrix-Multiplication or Tracking BLAS3 Performance from Source Code,” Proc. Sixth ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, June 1997.
[17] S. Eggers and R. Katz, “The Effect of Sharing on the Cache and Bus Performance of Parallel Programs,” Proc. Third Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 257-270, Apr. 1989.
[18] E. Granston and H. Wijshoff, “Managing Pages in Shared Virtual Memory Systems: Getting the Compiler Into the Game,” Proc. Int'l Conf. Supercomputing, pp. 11-20, 1993.
[19] J. Hennessy and D. Patterson, Computer Architecture: A Quantitative Approach. Morgan Kaufmann, 1995.
[20] C.-H. Huang and P. Sadayappan, “Communication-Free Partitioning of Nested Loops,” J. Parallel and Distributed Computing, vol. 19, pp. 90-102, 1993.
[21] F. Irigoin and R. Triolet, “Supernode Partitioning,” Proc. 15th ACM Symp. Principles of Programming Languages, pp. 319-329, Jan. 1988.
[22] T. Jeremiassen and S. Eggers, “Reducing False Sharing on Shared Memory Multiprocessors through Compile Time Data Transformations,” Proc. SIGPLAN Symp. Principles and Practices of Parallel Programming, pp. 179-188, July 1995.
[23] Y. Ju and H. Dietz, “Reduction of Cache Coherence Overhead by Compiler Data Layout and Loop Transformation,” Languages and Compilers for Parallel Computing, U. Banerjee et al., eds., pp. 344-358, Springer, 1992.
[24] M. Kandemir, A. Choudhary, J. Ramanujam, and P. Banerjee, “A Graph-Based Framework to Detect Optimal Memory Layouts for Improving Data Locality,” Proc. 1999 Int'l Parallel Processing Symp., Apr. 1999.
[25] M. Kandemir, A. Choudhary, N. Shenoy, P. Banerjee, and J. Ramanujam, “A Linear Algebra Framework for Automatic Determination of Optimal Data Layouts,” IEEE Trans. Parallel and Distributed Systems, vol. 10, no. 2, pp. 115-135, Feb. 1999.
[26] M. Kandemir, A. Choudhary, N. Shenoy, P. Banerjee, and J. Ramanujam, “A Hyperplane Based Approach for Optimizing Spatial Locality in Loop Nests,” Proc. 1998 ACM Int'l Conf. Supercomputing, pp. 69-76, July 1998.
[27] M. Kandemir, A. Choudhary, J. Ramanujam, and P. Banerjee, “Improving Locality Using Loop and Data Transformations in an Integrated Approach,” Proc. MICRO-31, Dec. 1998.
[28] M. Kandemir, A. Choudhary, J. Ramanujam, and P. Banerjee, “A Matrix-Based Approach to the Global Locality Optimization Problem,” Proc. Int'l Conf. Parallel Architecture and Compiler Techniques (PACT '98), Oct. 1998.
[29] M. Kandemir, J. Ramanujam, A. Choudhary, and P. Banerjee, “An Iteration Space Transformation Algorithm Based on Explicit Data Layout Representation for Optimizing Locality,” Proc. Workshop Languages and Compilers for Parallel Computing, Aug. 1998.
[30] W. Kelly, V. Maslov, W. Pugh, E. Rosser, T. Shpeisman, and D. Wonnacott, “The Omega Library Interface Guide,” Technical Report CS-TR-3445, Computer Science Dept., Univ. of Maryland, College Park, Mar. 1995.
[31] K. Kennedy and K.S. McKinley, "Optimizing for Parallelism and Data Locality," Proc. 1992 ACM Int'l Conf. Supercomputing, pp. 323-334,Washington, D.C., July 1992.
[32] I. Kodukula, N. Ahmed, and K. Pingali, “Data-Centric Multi-Level Blocking,” Proc. Programming Language Design and Implementation (PLDI '97), June 1997.
[33] R. Ladner, J. Fix, and A. LaMarca, “Cache Performance Analysis of Algorithms,” Proc. 10th Ann. ACM-SIAM Symp. Discrete Algorithms, Jan. 1999.
[34] M. Lam, E. Rothberg, and M. Wolf, “The Cache Performance and Optimizations of Blocked Algorithms,” Proc. Fourth Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS '91), 1991.
[35] A. LaMarca and R. Ladner, “The Influence of Caches on the Performance of Heaps,” The ACM J. Experimental Algorithms, vol. 1, 1996.
[36] L. Lamport, "The Parallel Execution of DO Loops," Comm. ACM, vol. 17, Feb. 1974.
[37] J. Laudon and D. Lenoski, “The SGI Origin: A CC-NUMA Highly Scalable Server,” Proc. 24th Ann. Int'l Symp. Computer Architecture (ISCA '97), May 1997.
[38] S.-T. Leung and J. Zahorjan, “Optimizing Data Locality by Array Restructuring,” Technical Report TR 95-09-01, Dept. of Computer Science and Eng., Univ. of Washington, Sept. 1995.
[39] W. Li, “Compiling for NUMA Parallel Machines,” PhD thesis, Dept. of Computer Science, Cornell Univ., 1993.
[40] W. Li and K. Pingali, “Access Normalization: Loop Restructuring for NUMA Compilers,” ACM Trans. Computer Systems, vol. 11, no. 4, pp. 353-375, 1993.
[41] K. McKinley, S. Carr, and C.W. Tseng, “Improving Data Locality with Loop Transformations,” ACM Trans. Programming Languages and Systems, vol. 18, no. 4, pp. 424-453, July 1996.
[42] M. O'Boyle and P. Knijnenburg, “Non-Singular Data Transformations: Definition, Validity, Applications,” Proc. Sixth Workshop Compilers for Parallel Computers, pp. 287-297, 1996.
[43] C. Polychronopoulos, M. Girkar, M. Haghighat, C. Lee, B. Leung, and D. Schouten, “Parafrase-2: An Environment for Parallelizing, Partitioning, Synchronizing, and Scheduling Programs on Multiprocessors,” Int'l J. High Speed Computing, vol. 1, no. 1, 1989.
[44] J. Ramanujam and P. Sadayappan, “Compile-Time Techniques for Data Distribution in Distributed Memory Machines,” IEEE Trans. Parallel and Distributed Systems, vol. 2, no. 4, pp. 472-482, Oct. 1991.
[45] G. Rivera and C.-W. Tseng, “Data Transformations for Eliminating Conflict Misses,” Proc. SIGPLAN Conf. Programming Language Design and Implementation, June 1998.
[46] J. Torrellas, M. Lam, and J. Hennessey, "False Sharing and Spatial Locality in Multiprocessor Caches," IEEE Trans. Computers, vol. 43, no. 6, pp. 651-663, June 1994.
[47] E. Torrie, C. Tseng, M. Martonosi, and M. Hall, "Evaluating the Impact of Advanced Memory Systems on Compiler-Parallelized Codes," Proc. Int'l Conf. Parallel Architectures and Compilation Techniques,Limassol, Cyprus, June 1995.
[48] C.-W. Tseng, J. Anderson, S. Amarasinghe, and M. Lam, “Unified Compilation Techniques for Shared and Distributed Address Space Machines,” Proc. 1995 Int'l Conf. Supercomputing, July 1995.
[49] M. Wolf and M. Lam, “A Data Locality Optimizing Algorithm,” Proc. SIGPLAN Conf. Programming Language Design and Implementation, pp. 30-44, June 1991.
[50] M. Wolf and M. Lam, “A Loop Transformation Theory and an Algorithm to Maximize Parallelism,” IEEE Trans. Parallel and Distributed Systems, vol. 2, no. 4, Oct. 1991.
[51] M. Wolfe, High Performance Compilers for Parallel Computing. Addison Wesley, CA, 1996.

Index Terms:
Data reuse, cache locality, false sharing, loop and memory layout transformations, shared-memory multiprocessors.
Citation:
Mahmut Kandemir, Alok Choudhary, J. Ramanujam, Prith Banerjee, "Reducing False Sharing and Improving Spatial Locality in a Unified Compilation Framework," IEEE Transactions on Parallel and Distributed Systems, vol. 14, no. 4, pp. 337-354, April 2003, doi:10.1109/TPDS.2003.1195407
Usage of this product signifies your acceptance of the Terms of Use.