This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
The Significance of CMP Cache Sharing on Contemporary Multithreaded Applications
February 2012 (vol. 23 no. 2)
pp. 367-374
Eddy Zheng Zhang, The College of William and Mary, Williamsburg
Yunlian Jiang, The College of William and Mary, Williamsburg
Xipeng Shen, The College of William and Mary, Williamsburg
Cache sharing on modern Chip Multiprocessors (CMPs) reduces communication latency among corunning threads, and also causes interthread cache contention. Most previous studies on the influence of cache sharing have concentrated on the design or management of shared cache. The observed influence is often constrained by the reliance on simulators, the use of out-of-date benchmarks, or the limited coverage of deciding factors. This paper describes a systematic measurement of the influence with most of the potentially important factors covered. The measurement shows some surprising results. Contrary to commonly perceived importance of cache sharing, neither positive nor negative effects from the cache sharing are significant for most of the program executions in the PARSEC benchmark suite, regardless of the types of parallelism, input data sets, architectures, numbers of threads, and assignments of threads to cores. After a detailed analysis, we find that the main reason is the mismatch between the software design (and compilation) of multithreaded applications and CMP architectures. By performing source code transformations on the programs in a cache-sharing-aware manner, we observe up to 53 percent performance increase when the threads are placed on cores appropriately, confirming the software-hardware mismatch as a main reason for the observed insignificance of the influence from cache sharing, and indicating the important role of cache-sharing-aware transformations—a topic only sporadically studied so far—for exerting the power of shared cache.

[1] R. Allen and K. Kennedy, Optimizing Compilers for Modern Architectures: A Dependence-Based Approach. Morgan Kaufmann Publishers, 2001.
[2] C. Bienia, S. Kumar, and K. Li, "PARSEC versus SPLASH-2: A Quantitative Comparison of Two Multithreaded Benchmark Suites on Chip-Multiprocessors," Proc. IEEE Int'l Symp. Workload Characterization, pp. 47-56, 2008.
[3] C. Bienia, S. Kumar, J.P. Singh, and K. Li, "The PARSEC Benchmark Suite: Characterization and Architectural Implications," Proc. Int'l Conf. Parallel Architectures and Compilation Techniques, pp. 72-81, 2008.
[4] S. Browne, C. Deane, G. Ho, and P. Mucci, "PAPI: A Portable Interface to Hardware Performance Counters," Proc. Dept. of Defense HPCMP Users Group Conf., 1999.
[5] B. Calder, C. Krintz, S. John, and T. Austin, "Cache-Conscious Data Placement," Proc. Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 139-149, 1998.
[6] J. Chang and G. Sohi, "Cooperative Cache Partitioning for Chip Multiprocessors," Proc. 21st Ann. Int'l Conf. Supercomputing, pp. 242-252, 2007.
[7] M. DeVuyst, R. Kumar, and D.M. Tullsen, "Exploiting Unbalanced Thread Scheduling for Energy and Performance on a Cmp of smt Processors," Proc. Int'l Parallel and Distributed Processing Symp. (IPDPS), 2006.
[8] C. Ding and T. Chilimbi, "A Composable Model for Analyzing Locality of Multi-Threaded Programs," Technical Report MSR-TR-2009-107, Microsoft Research, 2009.
[9] X. Ding, J. Lin, Q. Lu, P. Sadayappan, and Z. Zhang, "Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems," Proc. Int'l Symp. High-Performance Computer Architecture (HPCA), pp. 367-378, 2008.
[10] A. Fedorova, M. Seltzer, and M.D. Smith, "Improving Performance Isolation on Chip Multiprocessors via an Operating System Scheduler," Proc. Int'l Conf. Parallel Architecture and Compilation Techniques, pp. 25-38, 2007.
[11] Y. Jiang, X. Shen, J. Chen, and R. Tripathi, "Analysis and Approximation of Optimal Co-Scheduling on Chip Multiprocessors," Proc. Int'l Conf. Parallel Architecture and Compilation Techniques (PACT), pp. 220-229, Oct. 2008.
[12] Y. Jiang, K. Tian, and X. Shen, "Combining Locality Analysis with Online Proactive Job Co-Scheduling in Chip Multiprocessors," Proc. Int'l Conf. High Performance Embedded Architectures and Compilers (HiPEAC), pp. 201-215, 2010.
[13] Y. Jiang, E. Zhang, and X. Shen, "Array Regrouping on Cmp with Non-Uniform Cache Sharing," Proc. Int'l Workshop Languages and Compilers for Parallel Computing, 2010.
[14] Y. Jiang, E. Zhang, K. Tian, and X. Shen, "Is Reuse Distance Applicable to Data Locality Analysis on Chip Multiprocessors," Proc. Int'l Conf. Compiler Construction, 2010.
[15] R. Kumar and D. Tullsen, "Compiling for Instruction Cache Performance on a Multithreaded Architecture," Proc. Int'l Symp. Microarchitecture, pp. 419-429, 2002.
[16] C. Liao, Z. Liu, L. Huang, and B. Chapman, "Evaluating OpenMP on Chip Multithreading Platforms," Proc. Int'l Workshop OpenMP, 2005.
[17] D. Nikolopoulos, "Code and Data Transformations for Improving Shared Cache Performance on SMT Processors," Proc. Int'l Symp. High Performance Computing, pp. 54-69, 2003.
[18] M.K. Qureshi and Y.N. Patt, "Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches," Proc. Int'l Symp. Microarchitecture, pp. 423-432, 2006.
[19] N. Rafique, W. Lim, and M. Thottethodi, "Architectural Support for Operating System-Driven CMP Cache Management," Proc. Int'l Conf. Parallel Architecture and Compilation Techniques, pp. 2-12, 2006.
[20] S. Sarkar and D. Tullsen, "Compiler Techniques for Reducing Data Cache Miss Rate on a Multithreaded Architecture," Proc. Int'l Conf. High Performance Embedded Architectures and Compilers (HiPEAC), pp. 353-368, 2008.
[21] D. Schuff, M. Kulkarni, and V. Pai, "Accelerating Multicore Reuse Distance Analysis with Sampling and Parallelization," Proc. Int'l Conf. Parallel Architectures and Compilation Techniques, pp. 53-64, 2010.
[22] A. Settle, J.L. Kihm, A. Janiszewski, and D.A. Connors, "Architectural Support for Enhanced SMT Job Scheduling," Proc. Int'l Conf. Parallel Architecture and Compilation Techniques, pp. 63-73, 2004.
[23] X. Shen, Y. Zhong, and C. Ding, "Locality Phase Prediction," Proc. Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 165-176, 2004.
[24] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder, "Automatically Characterizing Large Scale Program Behavior," Proc. Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 45-57, 2002.
[25] A. Snavely and D. Tullsen, "Symbiotic Jobscheduling for a Simultaneous Multithreading processor," Proc. Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 66-76, 2000.
[26] A. Snavely, D. Tullsen, and G. Voelker, "Symbiotic Jobscheduling with Priorities for a Simultaneous Multithreading Processor," Proc. Joint Int'l Conf. Measurement and Modeling of Computer Systems, pp. 66-76, 2002.
[27] G. Suh, S. Devadas, and L. Rudolph, "A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning," Proc. Eighth Int'l Symp. High-Performance Computer Architecture, pp. 117-128, 2002.
[28] D. Tam, R. Azimi, and M. Stumm, "Thread Clustering: Sharing-Aware Scheduling on SMP-CMP-SMT Multiprocessors," SIGOPS Operating Systems Rev., vol. 41, no. 3, pp. 47-58, 2007.
[29] K. Tian, Y. Jiang, and X. Shen, "A Study on Optimally Co-Scheduling Jobs of Different Lengths on Chip Multiprocessors," Proc. ACM Conf. Computing Frontiers, pp. 41-50, 2009.
[30] N. Tuck and D.M. Tullsen, "Initial Observations of the Simultaneous Multithreading Pentium 4 Processor," Proc. Int'l Conf. Parallel Architectures and Compilation Techniques, pp. 26-35, 2003.
[31] S. Woo, M. Ohara, E. Torrie, J. Singh, and A. Gupta, "The SPLASH-2 Programs: Characterization and Methodological Considerations," Proc. Int'l Symp. Computer Architecture, pp. 24-36, 1995.
[32] E.Z. Zhang, Y. Jiang, and X. Shen, "Does Cache Sharing on Modern Cmp Matter to the Performance of Contemporary Multithreaded Programs?," PPoPP '10: Proc. 15th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, pp. 203-212, 2010.
[33] X. Zhang, S. Dwarkadas, G. Folkmanis, and K. Shen, "Processor Hardware Counter Statistics as a First-Class System Resource," Proc. 11th Workshop Hot Topics in Operating Systems, 2007.
[34] S. Zhuravlev, S. Blagodurov, and A. Fedorova, "Addressing Shared Resource Contention in Multicore Processors via Scheduling," Proc. Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 129-142, 2010.

Index Terms:
Shared cache, thread scheduling, parallel program optimizations, chip multiprocessors.
Citation:
Eddy Zheng Zhang, Yunlian Jiang, Xipeng Shen, "The Significance of CMP Cache Sharing on Contemporary Multithreaded Applications," IEEE Transactions on Parallel and Distributed Systems, vol. 23, no. 2, pp. 367-374, Feb. 2012, doi:10.1109/TPDS.2011.130
Usage of this product signifies your acceptance of the Terms of Use.