This Article 
 Bibliographic References 
 Add to: 
Characterizing the Memory Behavior of Compiler-Parallelized Applications
December 1996 (vol. 7 no. 12)
pp. 1224-1237

Abstract—Compiler-parallelized applications are increasing in importance as moderate-scale multiprocessors become common. This paper evaluates how features of advanced memory systems (e.g., longer cache lines) impact memory system behavior for applications amenable to compiler parallelization. Using full-sized input data sets and applications taken from standard benchmark suites, we measure statistics such as speedups, synchronization and load imbalance, causes of cache misses, cache line utilization, data traffic, and memory costs.

This exploration allows us to draw several conclusions. First, we find that larger granularity parallelism often correlates with good memory system behavior, good overall performance, and high speedup in these applications. Second, we show that when long (512 byte) cache lines are used, many of these applications suffer from false sharing and low cache line utilization. Third, we identify some of the common artifacts in compiler-parallelized codes that can lead to false sharing or other types of poor memory system performance, and we suggest methods for improving them. Overall, this study offers both an important snapshot of the behavior of applications compiled by state-of-the-art compilers, as well as an increased understanding of the interplay between cache line size, program granularity, and memory performance in moderate- scale multiprocessors.

[1] J. Anderson and M. Lam, "Global Optimizations for Parallelism and Locality on Scalable Parallel Machines," Proc. SIGPLAN Conf. Programming Language Design and Implementation, pp. 112-125,Albuquerque, N.M., June 1993.
[2] W. Blume and R. Eigenmann, “Performance Analysis of Parallelizing Compilers on the Perfect Benchmark Programs,” IEEE Trans. Parallel and Distributed Systems, vol. 3, pp. 643–656, Nov. 1992.
[3] W. Bolosky and M. Scott, "False Sharing and Its Effect on Shared Memory Performance," Proc. USENIX Symp. Experiences with Distributed and Multiprocessor Systems (SEDMS IV),San Diego, Sept. 1993.
[4] H. Davis, S.R. Goldschmidt, and J. Hennessy, "Multiprocessor Simulation and Tracing Using Tango," Proc. Int'l Conf. Parallel Processing, Aug. 1991.
[5] M. Dubois, J. Skeppstedt, L. Ricciulli et al., , "The Detection and Elimination of Useless Misses in Multiprocessors," Proc. 20th Int'l Symp. Computer Architecture, pp. 88-97, May 1993.
[6] S.J. Eggers and T.E. Jeremiassen, "Eliminating False Sharing," Proc. 1991 Int'l Conf. Parallel Proc.,St. Charles, Ill., Aug. 1991.
[7] S. Eggers and R. Katz, “The Effect of Sharing on the Cache and Bus Performance of Parallel Programs,” Proc. Third Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 257-270, Apr. 1989.
[8] S. Goldschmidt, "Simulation of Multiprocessors: Accuracy and Performance," PhD dissertation, CS Dept., Stanford University, Stanford, Calif., 1993.
[9] E. Granston and H. Wijshoff, “Managing Pages in Shared Virtual Memory Systems: Getting the Compiler Into the Game,” Proc. Int'l Conf. Supercomputing, pp. 11-20, 1993.
[10] A. Gupta and W.-D. Weber, "Cache Invalidation Patterns in Shared-Memory Multiprocessors," IEEE Trans. Computers, vol. 41, no. 7, pp. 794-810, July 1992.
[11] M.W. Hall, B.R. Murphy, and S.P. Amarasinghe, "Interprocedural Parallelization Analysis: A Case Study," Proc. Seventh SIAM Conf. Parallel Processing for Scientific Computing,San Francisco, Feb. 1995.
[12] M.W. Hall, S.P. Amerasinghe, B.S. Murphy, S. Liao, and M. Lam, "Interprocedural Parallelization Analysis," Proc. Supercomputing '95. IEEE Press, Dec. 1995.
[13] T. Jeremiassen and S. Eggers, “Reducing False Sharing on Shared Memory Multiprocessors through Compile Time Data Transformations,” Proc. SIGPLAN Symp. Principles and Practices of Parallel Programming, pp. 179-188, July 1995.
[14] M. Heinrich et al. “The Stanford FLASH Multiprocessor,” Proc. 21th Int'l Symp. Computer Architecture, pp. 302-313, April 1994.
[15] R.L. Lee, "The Effectiveness of Caches and Data Prefetch Buffers in Large-Scale Shared Memory Multiprocessors," PhD thesis, Univ. of Illinois at Urbana-Champaign, May 1987.
[16] D. Lenoski et al., "The directory-based cache coherence protocol for the dash multiprocessor," Proc. 17th Int'l Symp. Computer Architecture,Los Alamitos, Calif., pp. 148-159, 1990.
[17] D.J. Lilja, "The Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared-Memory Multiprocessor," IEEE Trans. Parallel and Distributed Systems, vol. 5, no. 6, pp. 573-584, June 1994.
[18] M. Martonosi, A. Gupta, and T. Anderson, "MemSpy: Analyzing Memory System Bottlenecks in Programs," Proc. 1992 SIGMETRICS Conf. Measurement and Modeling of Computer Systems, pp. 1-12,Newport, R.I., June1-5 1992.
[19] M.R. Martonosi, "Analyzing and Tuning Memory Performance in Sequential and Parallel Programs," PhD thesis, Stanford Univ., Dec. 1993. Also Stanford CSL Technical Report CSL-TR-94-602.
[20] C. Natarajan, S. Sharma, and R. Iyer, "Measurement-Based Characterization of Global Memory and Network Contention, Operating System and Parallelization Overheads: Case Study on a Shared-Memory Multiprocessor," Proc. 21st Int'l Symp. Computer Architecture,Chicago, May 1994.
[21] S.K. Reinhardt, J.R. Larus, and D.A. Wood, “Tempest and Typhoon: User-Level Shared Memory,” Proc. 21st Int'l Symp. Computer Architecture, pp. 325-337, Apr. 1994.
[22] J.P. Singh, W.D. Weber, and A. Gupta, "SPLASH: Stanford Parallel Applications for Shared Memory," Proc. 19th Annual Int'l Symp. Computer Architecture, IEEE CS Press, Los Alamitos, Calif., May 1992, pp. 5-14.
[23] J. Torrellas, M. Lam, and J. Hennessey, "False Sharing and Spatial Locality in Multiprocessor Caches," IEEE Trans. Computers, vol. 43, no. 6, pp. 651-663, June 1994.
[24] E. Torrie, C. Tseng, M. Martonosi, and M. Hall, "Evaluating the Impact of Advanced Memory Systems on Compiler-Parallelized Codes," Proc. Int'l Conf. Parallel Architectures and Compilation Techniques,Limassol, Cyprus, June 1995.
[25] R. Wilson, R. French, C. Wilson, S. Amarasinghe, J. Anderson, S. Tjiang, S. Liao, C. Tseng, M. Hall, M. Lam, and J. Hennessy, "SUIF: An Infrastructure for Research on Parallelizing and Optimizing Compilers," ACM SIGPLAN Notices, vol. 29, no. 12, pp. 31-37, Dec 1994.
[26] S.C. Woo et al., "The SPLASH-2 Programs: Characterization and Methodological Considerations," Proc. 22nd Annual Int'l Symp. Computer Architecture, IEEE CS Press, Los Alamitos, Calif., June 1995, pp. 24-36.

Index Terms:
Parallelizing compilers, memory hierarchies, shared-memory multiprocessors, cache performance, false and true sharing, parallelism granularity.
Evan Torrie, Margaret Martonosi, Chau-Wen Tseng, Mary W. Hall, "Characterizing the Memory Behavior of Compiler-Parallelized Applications," IEEE Transactions on Parallel and Distributed Systems, vol. 7, no. 12, pp. 1224-1237, Dec. 1996, doi:10.1109/71.553272
Usage of this product signifies your acceptance of the Terms of Use.