The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.12 - December (2009 vol.20)
pp: 1752-1763
Xudong Shi , Google Inc., Mountain View
Feiqi Su , Nvidia Corporation, Durham
Jih-Kwon Peir , University of Florida, Gainesville
Ye Xia , University of Florida, Gainesville
Zhen Yang , Nvidia Corporation, Santa Clara
ABSTRACT
Performance trade-offs between fast data access by local data replication and cache capacity maximization by global data sharing have been extensively studied for many-core Chip Multiprocessors (CMPs). Costly simulations over a wide spectrum of the design space are generally required to gain insight for a sound design. To lower the cost, we develop an abstract model for understanding the performance impact of data replication on CMP caches. To overcome the lack of real-time interactions among multiple cores in the model, we further develop an efficient single-pass stack simulation to study the performance of CMP cache organizations with various degrees of data replication. The global stack logically incorporates a shared stack and per-core private stacks; shared/private reuse (stack) distances can be collected in a single-pass simulation. With the reuse distances, one can calculate the performance of CMP cache organizations with various degrees of data replication. We verify both the model and the stack simulation against execution-driven simulations with commercial multithreaded workloads. The results show that the abstract model provides accurate information about performance trade-offs of data replication. The stack simulation accurately predicts the performance of various cache organizations with 2-9 percent error margins using only about 8 percent of the simulation time.
INDEX TERMS
Cache memories, chip multiprocessors, performance modeling, stack simulation.
CITATION
Xudong Shi, Feiqi Su, Jih-Kwon Peir, Ye Xia, Zhen Yang, "Modeling and Stack Simulation of CMP Cache Capacity and Accessibility", IEEE Transactions on Parallel & Distributed Systems, vol.20, no. 12, pp. 1752-1763, December 2009, doi:10.1109/TPDS.2009.31
REFERENCES
[1] A. Agarwal, M. Horowitz, and J. Hennessy, “An Analytical Cache Model,” ACM Trans. Computer Systems, vol. 7, no. 2, pp. 184-215, May 1989.
[2] B. Beckmann and D. Wood, “Managing Wire Delay in Large Chip-Multiprocessor Caches,” Proc. 37th Int'l Symp. Microarchitecture, pp. 319-330, Dec. 2004.
[3] B.M. Beckmann, M.R. Marty, and D.A. Wood, “ASR: Adaptive Selective Replication for CMP Caches,” Proc. 39th Int'l Symp. Microarchitecture, pp. 443-454, Dec. 2006.
[4] B.T. Bennett and V.J. Kruskal, “LRU Stack Processing,” IBM J. Research and Development, vol. 19, pp. 353-357, July 1975.
[5] E. Berg and E. Hagersten, “StatCache: A Probabilistic Approach to Efficient and Accurate Data Locality Analysis,” Proc. Int'l Symp. Performance Analysis of Systems and Software, pp. 20-27, Mar. 2004.
[6] E. Berg, H. Zeffer, and E. Hagersten, “A Statistical Multiprocessor Cache Model,” Proc. Int'l Symp. Performance Analysis of Systems and Software, pp. 89-99, Mar. 2006.
[7] D. Chandra, F. Guo, S. Kim, and Y. Solihin, “Predicting Inter-Thread Cache Contention on a Chip Multi-Processor Architecture,” Proc. 11th Int'l Symp. High Performance Computer Architecture, pp. 340-351, Feb. 2005.
[8] J. Chang and G. Sohi, “Cooperative Caching for Chip Multiprocessors,” Proc. 33rd Int'l Symp. Computer Architecture, pp. 264-276, June 2006.
[9] Z. Chishti, M.D. Powell, and T.N. Vijaykumar, “Optimizing Replication, Communication, and Capacity Allocation in CMPs,” Proc. 32nd Int'l Symp. Computer Architecture, pp. 357-358, June 2005.
[10] G. Edwards, S. Devadas, and L. Rudolph, “Analytical Cache Models with Applications to Cache Partitioning,” Proc. 15th Int'l Conf. Supercomputing, pp. 1-12, June 2001.
[11] B. Fraguela, R. Doallo, and E. Zapata, “Automatic Analytical Modeling for the Estimation of Cache Misses,” Proc. Int'l Conf. Parallel Architectures and Compilation Techniques, p. 221, Sept. 1999.
[12] A. Hartstein, V. Srinivasan, T.R. Puzak, and P.G. Emma, “On the Nature of Cache Miss Behavior: Is It sqrt(2)?” J. Instruction-Level Parallelism, vol. 10, pp. 1-22, 2008.
[13] M. Hill and J. Smith, “Evaluating Associativity in CPU Caches,” IEEE Trans. Computers, vol. 38, no. 12, pp. 1612-1630, Dec. 1989.
[14] J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, and S.W. Keckler, “A NUCA Substrate for Flexible CMP Cache Sharing,” Proc. 19th Int'l Conf. Supercomputing, pp. 31-40, June 2005.
[15] C. Kim, D. Burger, and S. Keckler, “An Adaptive Non-Uniform Cache Structure for Wire-Delay Dominated On-Chip Caches,” Proc. 10th Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 211-222, Oct. 2002.
[16] Y.H. Kim, M.D. Hill, and D.A. Wood, “Implementing Stack Simulation for Highly-Associative Memories,” Proc. ACM SIGMETRICS Conf. Measurement and Modeling of Computer Systems, pp.212-213, May 1991.
[17] R. Kumar, V. Zyuban, and D.M. Tullsen, “Interconnections in Multi-Core Architectures: Understanding Mechanisms, Overhead and Scaling,” Proc. 32nd Int'l Symp. Computer Architecture, pp. 408-419, June 2005.
[18] C. Liu, A. Sivasubramaniam, and M. Kandemir, “Organizing the Last Line of Defense before Hitting the Memory Wall for CMPs,” Proc. 10th Int'l Symp. High Performance Computer Architecture, pp.176-185, Feb. 2004.
[19] P.S. Magnusson et al. “Simics: A Full System Simulation Platform,” Computer, vol. 35, no. 2, pp. 50-58, Feb. 2002.
[20] Matlab, http://www.mathworks.com/productsmatlab/, 2009.
[21] R. Mattson, J. Gecsei, D. Slutz, and I. Traiger, “Evaluation Techniques and Storage Hierarchies,” IBM Systems J., vol. 9, pp.78-117, 1970.
[22] Open Source Development Labs Database Test 2 (OSDL-DBT2), http://www.osdl.org/lab_activities/kernel_testing/ osdl_database_ test_suiteosdl_dbt-2 /, 2009.
[23] X. Shi, F. Su, J. Peir, Y. Xia, and Z. Yang, “Modeling and Single-Pass Simulation of CMP Cache Capacity and Accessibility,” Proc. IEEE Int'l Symp. Performance Analysis of Systems and Software (ISPASS-2007), pp. 126-135, Apr. 2007.
[24] E. Speight, H. Shafi, L. Zhang, and R. Rajamony, “Adaptive Mechanisms and Policies for Managing Cache Hierarchies in Chip Multiprocessors,” Proc. 32nd Int'l Symp. Computer Architecture, pp.346-356, June 2005.
[25] R.A. Sugumar and S.G. Abraham, “Set-Associative Cache Simulation Using Generalized Binomial Trees,” ACM Trans. Computer Systems, vol. 13, no. 1, pp. 32-56, Feb. 1995.
[26] G.E. Suh, L. Rudolph, and S. Devadas, “Dynamic Partitioning of Shared Cache Memory,” The J. Supercomputing, vol. 28, no. 1, pp. 7-26, 2004.
[27] J.G. Thompson, “Efficient Analysis of Caching Systems,” Technical Report CSD-87-374, UCB/Computer Science Dept., Univ. of California, Berkeley, Oct. 1987.
[28] X. Vera and J. Xue, “Let's Study Whole-Program Cache Behavior Analytically,” Proc. Eighth Int'l Symp. High Performance Computer Architecture, p. 175, Feb. 2002.
[29] C.E. Wu, Y. Hsu, and Y. Liu, “Efficient Stack Simulation for Shared Memory Set-Associative Multiprocessor Caches,” Proc. Int'l Conf. Parallel Processing, pp. 163-170, Aug. 1993.
[30] Y. Wu and R. Muntz, “Stack Evaluation of Arbitrary Set-Associative Multiprocessor Caches,” IEEE Trans. Parallel and Distributed Systems, vol. 6, no. 9, pp. 930-942, Sept. 1995.
[31] M. Zhang and K. Asanovic, “Victim Replication: Maximizing Capacity While Hiding Wire Delay in Tiled Chip Multiprocessors,” Proc. 32nd Int'l Symp. Computer Architecture, pp. 336-345, June 2005.
27 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool