The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.07 - July (2012 vol.61)
pp: 913-927
Xi E. Chen , NVIDIA, Portland
ABSTRACT
This paper proposes an analytical model for accurately predicting the impact of contention on cache miss rates. The focus is multiprogrammed workloads running on multithreaded manycore architectures. This work addresses a key challenge facing earlier cache contention models as the number of concurrent threads exceeds the associativity of shared caches. The memory access characteristics of individual applications are obtained in isolation by profiling their circular sequences and two new measures of access locality are proposed. An evaluation of this model in the context of a Niagara processor shows that it achieves an average 8.7 percent error in miss rate predictions which improves upon the best prior model by 48.1x. This paper also presents a novel Markov chain throughput model. When combining the contention model with the Markov chain model, throughput is estimated with an average error of 8.3 percent compared to detailed simulation. Moreover, the combined model tracks throughput sufficiently well to find the same optimized design point for application-specific workloads 65 times faster than detailed simulation. This paper also shows that the models accurately predict cache contention and throughput trends across various workloads on real hardware.
INDEX TERMS
Analytical modeling, cache contention, manycore, fine-grained multithreading, throughput.
CITATION
Xi E. Chen, "Modeling Cache Contention and Throughput of Multiprogrammed Manycore Processors", IEEE Transactions on Computers, vol.61, no. 7, pp. 913-927, July 2012, doi:10.1109/TC.2011.141
REFERENCES
[1] L. Zhao, R.R. Iyer, J. Moses, R. Illikkal, S. Makineni, and D. Newell, “Exploring Large-Scale CMP Architectures Using ManySim,” IEEE Micro, vol. 27, no. 4, pp. 21-33, July/Aug. 2007.
[2] S. Kim, D. Chandra, and Y. Solihin, “Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture,” Proc. 13th Int'l Conf. Parallel Architectures and Compilation Techniques (PACT), pp. 111-122, 2004.
[3] A. Fedorova, M. Seltzer, and M.D. Smith, “Improving Performance Isolation on Chip Multiprocessors via an Operating System Scheduler,” Proc. 16th Int'l Conf. Parallel Architectures and Compilation Techniques (PACT), pp. 25-38, 2007.
[4] Y. Jiang, X. Shen, J. Chen, and R. Tripathi, “Analysis and Approximation of Optimal Co-Scheduling on Chip Multiprocessors,” Proc. 17th Int'l Conf. Parallel Architectures and Compilation Techniques (PACT), pp. 220-229, 2008.
[5] Q. Lu, J. Lin, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan, “Soft-Olp: Improving Hardware Cache Performance through Software-Controlled Object-Level Partitioning,” Proc. 18th Int'l Conf. Parallel Architectures and Compilation Techniques (PACT), pp. 246-257, 2009.
[6] D. Chandra, F. Guo, S. Kim, and Y. Solihin, “Predicting Inter-Thread Cache Contention on a Chip Multi-Processor Architecture,” Proc. 11th Int'l Symp. High-Performance Computer Architecture (HPCA-11), pp. 340-351, 2005.
[7] X.E. Chen and T.M. Aamodt, “A First-Order Fine-Grained Multithreaded Throughput Model,” Proc. 15th Int'l Symp. High-Performance Computer Architecture HPCA-15, pp. 329-340, 2009.
[8] A. Agarwal, M. Horowitz, and J. Hennessy, “An Analytical Cache Model,” ACM Trans. Computer Systems, vol. 7, no. 2, pp. 184-215, 1989.
[9] D. Thiebaut and H.S. Stone, “Footprints in the Cache,” ACM Trans. Computer Systems, vol. 5, no. 4, pp. 305-329, 1987.
[10] V.S. Adve and M.K. Vernon, “Parallel Program Performance Prediction Using Deterministic Task Graph Analysis,” ACM Trans. Computer Systems, vol. 22, no. 1, pp. 94-136, 2004.
[11] OpenSPARC T2 Core Microarchitecture Specification, Sun Microsystems, Inc., 2007.
[12] P.G. Hoel, C.J. Stone, and S.C. Port, Introduction to Stochastic Processes. Waveland Press, 1987.
[13] J.L. Hennessy and D.A. Patterson, Computer Architecture: A Quantitative Approach, fourth ed. Morgan Kaufmann, 2007.
[14] P. Kongetira, K. Aingaran, and K. Olukotun, “Niagara: A 32-Way Multithreaded Sparc Processor,” IEEE Micro, vol. 25, no. 2, pp. 21-29, Mar./Apr. 2005.
[15] D.M. Tullsen, S.J. Eggers, and H.M. Levy, “Simultaneous Multithreading: Maximizing On-Chip Parallelism,” Proc. 22nd Ann. Int'l Symp. Computer Architecture (ISCA-22), pp. 392-403, 1995.
[16] B. Cmelik and D. Keppel, “Shade: A Fast Instruction-Set Simulator for Execution Profiling,” Proc. SIGMETRICS Conf. Measurement and Modeling of Computer Systems, pp. 128-137, 1994.
[17] S. Eyerman and L. Eeckhout, “System-Level Performance Metrics for Multiprogram Workloads,” IEEE Micro, vol. 28, no. 3, pp. 42-53, May/June 2008.
[18] Standard Performance Evaluation Corporation, “SPEC CPU2000 Benchmarks,” http:/www.spec.org, 2011.
[19] A.J. Smith, “On The Effectiveness of Set Associative Page Mapping and Its Application to Main Memory Management,” ICSE '76: Proc. Second Int'l Conf. Software Eng., pp. 286-292, 1976.
[20] R.L. Mattson, J. Gecsei, D.R. Slutz, and I.L. Traiger, “Evaluation Techniques for Storage Hierarchies,” IBM Systems J., vol. 9, no. 2, pp. 78-117, 1970.
[21] M. Hill and A. Smith, “Evaluating Associativity in CPU Caches,” IEEE Trans. Computers, vol. 38, no. 12, pp. 1612 -1630, Dec. 1989.
[22] F. Guo and Y. Solihin, “An Analytical Model for Cache Replacement Policy Performance,” Proc. Int'l Conf. Measurement and Modeling of Computer Systems (SIGMETRICS), pp. 228-239, 2006.
[23] E. Berg and E. Hagersten, “Fast Data-Locality Profiling of Native Execution,” Proc. Int'l Conf. Measurement and Modeling of Computer Systems (SIGMETRICS), pp. 169-180, 2005.
[24] X. Shen, J. Shaw, B. Meeker, and C. Ding, “Locality Approximation Using Time,” Proc. 34th Symp. Principles of Programming Languages (POPL), pp. 55-61, 2007.
[25] B. Falsafi and D.A. Wood, “Modeling Cost/Performance of a Parallel Computer Simulator,” ACM Trans. Modeling and Computer Simulation, vol. 7, no. 1, pp. 104-130, 1997.
[26] F. Liu, F. Guo, Y. Solihin, S. Kim, and A. Eker, “Characterizing and Modeling the Behavior of Context Switch Misses,” Proc. 17th Int'l Conf. Parallel Architectures and Compilation Techniques (PACT), pp. 91-101, 2008.
[27] G.E. Suh, S. Devadas, and L. Rudolph, “Analytical Cache Models with Applications to Cache Partitioning,” Proc. 15th Int'l Conf. Supercomputing (ICS), pp. 1-12, 2001.
[28] W. Yamamoto, M. Serrano, A. Talcott, R. Wood, and M. Nemirosky, “Performance Estimation of Multistreamed, Superscalar Processors,” Proc. 27th Hawaii Int'l Conf. System Sciences, Jan. 1994.
[29] V. Govindaraju, P. Djeu, K. Sankaralingam, M. Vernon, and W.R. Mark, “Toward A Multicore Architecture for Real-Time Ray-Tracing,” Proc. IEEE/ACM 41st Int'l Symp. Microarchitecture (MICRO-41), pp. 176-187, 2008.
151 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool