This Article 
 Bibliographic References 
 Add to: 
False Sharing and Spatial Locality in Multiprocessor Caches
June 1994 (vol. 43 no. 6)
pp. 651-663

The performance of the data cache in shared-memory multiprocessors has been shown to be different from that in uniprocessors. In particular, cache miss rates in multiprocessors do not show the sharp drop typical of uniprocessors when the size of the cache block increases. The resulting high cache miss rate is a cause of concern, since it can significantly limit the performance of multiprocessors. Some researchers have speculated that this effect is due to false sharing, the coherence transactions that result when different processors update different words of the same cache block in an interleaved fashion. While the analysis of six applications in the paper confirms that false sharing has a significant impact on the miss rate, the measurements also show that poor spatial locality among accesses to shared data has an even larger impact. To mitigate false sharing and to enhance spatial locality, we optimize the layout of shared data in cache blocks in a programmer-transparent manner. We show that this approach can reduce the number of misses on shared data by about 10% on average.

[1] D.R. Cheriton, H.A. Goosen, and P.D. Boyle, "Multi-Level Shared Caching Techniques for Scalability in VMP-MC,"Proc. 16th Int'l Symp. Computer Architecture, CS Press, Los Alamitos, Calif., May 1989, pp. 16-24.
[2] J. Elder, A. Gottlieb, C. K. Kruskal, K. P. McAuliffe, L. Rudolph, M. Snir, P. Teller, and J. Wilson, "Issues related to MIMD, shared-memory computers: The NYU ultracomputer approach," inProc. 12th Annu. Int. Symp. Comput. ArchitectureJune 1985, pp. 126-135.
[3] J.R. Goodman and P.J. Woest, "The Wisconsin Multicube: A New Large-Scale Cache-Coherent Multiprocessor,"Proc. 15th Int'l Symp. Computer Architecture, CS Press, Los Alamitos, Calif., June 1988, pp. 422-431.
[4] D. Lenoski et al., "The Directory-Based Cache Coherence Protocol for the DASH Multiprocessor,"Proc. 17th Int'l Symp. Computer Architecture, CS Press, Los Alamitos, Calif., May 1990, pp. 148-159.
[5] G. Pfister, W. Brantley, D. George, S. Harvey, W. Kleinfelder, K. McAuliffe, E. Melton, A. Norton, and J. Weiss, "The IBM research parallel processor prototype (RP3): Introduction and architecture," inProc. 1985 Int. Conf. Parallel Processing, 1985, pp. 764-771.
[6] A.W. Wilson, Jr., "Hierarchical Cache/ Bus Architecture for Shared Memory Multiprocessors,"Proc. 14th Int'l Symp. Computer Architecture, IEEE CS Press, Los Alamitos, Calif., Order No. 776, 1987, pp. 244-252.
[7] A. Agarwal and A. Gupta, "Memory-reference characteristics of multiprocessor applications under MACH," inProc. ACM SIGMETRICS Conf Measurement and Modeling of Computer Systems, 1988, pp. 215-226.
[8] S. J. Eggers and R. H. Katz, "The effect of sharing on the cache and bus performance of parallel programs," inProc. 3rd Int. Conf. Architectural Support Programming Languages Oper. Syst., Boston, MA, Apr. 1989, pp. 257-270.
[9] W.-D. Weber and A. Gupta, "Analysis of cache invalidation patterns in multiprocessors," inProc. 3rd Int. Conf. Architectural Support Programming Languages Oper. Syst., Boston, MA, Apr. 1989, pp. 243-256.
[10] R. L. Lee, P. C. Yew, and D. H. Lawrie, "Multiprocessor cache design considerations," inProc. 14th Annu. Int. Symp. Comput. Architecture, June 1987, pp. 253-262.
[11] A. J. Smith, "Line (block) size choice for CPU cache memories,"IEEE Trans. Computers, vol. 36, no. 9, pp. 1063-1074, 1987.
[12] A. Smith, "Cache Memories,"Computing Surveys, Vol. 14, No. 3, Sept. 1982, pp. 473- 530.
[13] F. Irigoin and R. Triolet, "Supernode partitioning," inProc. Fifteenth Annu. ACM. SIGACT-SIGPLAN Symp. Principles Programming Languages, Jan. 1988, pp. 319-329.
[14] M.E. Wolf, "A Data Locality Optimizing Algorithm,"Proc. ACM Sigplan Conf. Programming Language Design and Implementation, ACM, New York, 1991, pp. 30-44.
[15] F. J. Carrasco, "A parallel maxflow implementation," CS411 Project Rep., Stanford Univ., Mar. 1988.
[16] J. A. Dykstal and T. C. Mowry, "MINCUT: Graph partitioning using parallel simulated annealing," CS411 Project Rep., Stanford Univ., Mar. 1989.
[17] A. Galper, "DWF," CS411 Project Rep., Stanford Univ., Mar. 1989.
[18] J. D. McDonald and D. Baganoff, "Vectorization of a particle simulation method for hypersonic rarified flow," presented at theAIAA Thermodynamics, Plasmadynamics and Lasers Conf., Seattle, WA, June 1988.
[19] J. Rose. "LocusRoute: A parallel global router for standard cells," inProc. Design Automat. Conf., June 1988, pp. 189-195.
[20] L. Soule and A. Gupta, "Characterization of parallelism and deadlocks in distributed logic simulation," inProc. 26th Design Automat. Conf., June 1989, pp. 81-86.
[21] E. Lusk, R. Stevens, and R. Overbeek, Portable Programs for Parallel Processors. New York: Holt, Rinehart, and Winston, Inc., 1987.
[22] H. Davis, S. Godschmidt, and J. Hennessy, "Multiprocessing simulation and tracing using tango," inProc. 1991 Int. Conf. Parallel Processing, vol. II, Aug. 1991, pp. 99-107.
[23] F. Baskett, T. Jermoluk, and D. Solomon, "The 4D-MP graphics superworkstation: Computing + graphics = 40 MIPS + 40 MFLOPS and 100,000 lighted polygons per second," inProc. 33rd IEEE Comput. Soc. Int. Conf. - COMPCONvol. 88, Feb. 1988, pp. 468-471.
[24] M. S. Papamarcos and J. H. Patel, "A low-overhead coherence solution for multiprocessors with private cache memories," inProc. 11th Int. Symp. Comput. Architecture, June 1984, pp. 348-354.
[25] J.L. Hennessy and David A. Patterson,Computer Architecture: A Quantitative Approach, Morgan Kaufmann, San Mateo, Calif., 1990.
[26] G. Kane,MIPS RISC Architecture, Prentice-Hall, Englewood Cliffs, N.J., 1988.
[27] M. D. Hill, "Aspects of cache memory and instruction buffer performance," Tech. Rep. UCB/CSD 87/381, Univ. of California at Berkeley, Berkeley, CA, Nov. 1987.
[28] S. Przybylski, M. Horowitz, and J. Hennessy, "Performance Trade-offs in Cache Design,"15th Ann. Int'l Symp. Computer Architecture, IEEE CS Press, Los Alamitos, CA, Order No. 861, 1988, pp. 290-298.

Index Terms:
shared memory systems; buffer storage; performance evaluation; program compilers; false sharing; spatial locality; multiprocessor caches; data cache; shared-memory multiprocessors; cache miss rates; cache block; coherence transactions; interleaved fashion; shared data; programmer-transparent method.
J. Torrellas, H.S. Lam, J.L. Hennessy, "False Sharing and Spatial Locality in Multiprocessor Caches," IEEE Transactions on Computers, vol. 43, no. 6, pp. 651-663, June 1994, doi:10.1109/12.286299
Usage of this product signifies your acceptance of the Terms of Use.