This Article 
 Bibliographic References 
 Add to: 
Performance and Energy Implications of Many-Core Caches for Throughput Computing
November/December 2010 (vol. 30 no. 6)
pp. 25-35
Christopher Hughes, Intel, Santa Clara
Changkyu Kim, Intel, Santa Clara
Yen-Kuang Chen, Intel Corporation , Santa Clara

Processors that target throughput computing often have many cores, which stresses the cache hierarchy. Logically centralized, shared data storage is needed for many-core chips to provide high cache throughput for heavily read-write shared lines. Techniques to reduce on-die and off-die traffic have a dramatic energy benefit for many-core chips.

1. W.J. Dally, "The End of Denial Architecture and the Rise of Throughput Computing," keynote, Design Automation Conf., 2010; .
2. NVIDIA, "NVIDIA's Next Generation CUDA Compute Architecture: Fermi," white paper, 2009; NVIDIA_Fermi_Compute_ Architecture_Whitepaper.pdf .
3. Intel News Release, "Intel Unveils New Product Plans for High-Performance Computing," 2010; 20100531comp.htm.
4. V.W. Lee et al., "Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU," Proc. Ann. Int'l Symp. Computer Architecture (ISCA 10), ACM Press, 2010, pp. 451-460.
5. J. Chang and G. Sohi, "Cooperative Caching for Chip Multiprocessors," Proc. Ann. Int'l Symp. Computer Architecture (ISCA 06), IEEE Press, 2006, pp. 264-276.
6. J. Huh et al., "A NUCA Substrate for Flexible CMP Cache Sharing," IEEE Trans. Parallel and Distributed Systems, vol. 18, no. 8, Aug. 2007, pp. 1028-1040.
7. M. Zhang and K. Asanovic, Victim Migration: Dynamically Adapting Between Private and Shared CMP Caches, tech. report MIT-CSAIL-TR-2005-064, Computer Science and Artificial Intelligence Laboratory, Mass. Inst. of Technology, 2005.
8. C. Bienia et al., "The Parsec Benchmark Suite: Characterization and Architectural Implications," Proc. Int'l Conf. Parallel Architectures and Compilation Techniques, ACM Press, 2008, pp. 72-81.
9. Y.K. Chen et al., "Convergence of Recognition, Mining, and Synthesis Workloads and Its Implications, Proc. IEEE, vol. 96, no. 5, 2008, pp. 790-807.
10. S.C. Woo, "The Splash-2 Programs: Characterization and Methodological Considerations," Proc. Ann. Int'l Symp. Computer Architecture (ISCA 95), ACM Press, 1995, pp. 24-36.
11. C. Kim, D. Burger, and S.W. Keckler, "An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated On-Chip Caches," Proc. Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS 02), ACM Press, 2002, pp. 211-222.
12. J. Emer et al., "Asim: A Performance Model Framework," Computer, vol. 35, no. 2, Feb. 2002, pp. 68-76.
13. D. Tarjan, S. Thoziyoor, and N.P. Jouppi, CACTI 4.0: An Integrated Cache Access Time, Cycle Time, Area, Leakage, and Dynamic Power Model, tech. report HPL-2006-86, HP Labs, 2006.
14. S. Borkar, "Hundreds of Cores: Scaling to Tera-scale Architecture," Intel Developer Forum, Sept. 2006.

Index Terms:
multicore/single-chip multiprocessors, memory hierarchy, graphics processors, throughput computing
Christopher Hughes, Changkyu Kim, Yen-Kuang Chen, "Performance and Energy Implications of Many-Core Caches for Throughput Computing," IEEE Micro, vol. 30, no. 6, pp. 25-35, Nov.-Dec. 2010, doi:10.1109/MM.2010.83
Usage of this product signifies your acceptance of the Terms of Use.