The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.05 - May (2012 vol.23)
pp: 853-861
Enric Herrero , Universitat Politècnica de Catalunya, Barcelona
José González , Intel Corp., Intel Labs Barcelona, Barcelona
Ramon Canal , Universitat Politècnica de Catalunya, Barcelona
ABSTRACT
Current trends in CMPs indicate that the core count will increase in the near future. One of the main performance limiters of these forthcoming microarchitectures is the latency and high demand of the on-chip network and the off-chip memory communication. One of the main trade-offs when searching an optimal cache hierarchy is the sharing degree of cache space and its on-die distribution. Several techniques have appeared recently that optimize these parameters to get a better performance. This work provides some insight in the most promising configurations for tiled microarchitectures and shows the advantages and limitations of each of them in terms of performance and energy efficiency. This paper extends previous works by providing a complete study that evaluates different network topologies, single and multithreaded benchmarks, and single and multiprogrammed execution. In all these studies, the Distributed Cooperative Caching shows to be a promising alternative to traditional configurations for chip multiprocessors, providing a scalable and energy efficient solution.
INDEX TERMS
Tiled microarchitectures, memory hierarchy, energy efficiency.
CITATION
Enric Herrero, José González, Ramon Canal, "Distributed Cooperative Caching: An Energy Efficient Memory Scheme for Chip Multiprocessors", IEEE Transactions on Parallel & Distributed Systems, vol.23, no. 5, pp. 853-861, May 2012, doi:10.1109/TPDS.2011.200
REFERENCES
[1] M. Acacio, J. Gonzalez, J. Garcia, and J. Duato, "A New Scalable Directory Architecture for Large-Scale Multiprocessors," HPCA '01: Proc. Seventh Int'l Symp. High-Performance Computer Architecture, pp. 97-106, Jan. 2001.
[2] B. Beckmann, M. Marty, and D. Wood, "ASR: Adaptive Selective Replication for CMP Caches," MICRO-39: Proc. 39th Ann. IEEE/ACM Int'l Symp. Microarchitecture, Dec. 2006.
[3] J. Chang and G.S. Sohi, "Cooperative Caching for Chip Multiprocessors," ISCA '06: Proc. 33rd Ann. Int'l Symp. Computer Architecture, pp. 264-276, June 2006.
[4] J. Chang and G.S. Sohi, "Cooperative Cache Partitioning for Chip Multiprocessors," ICS '07: Proc. 21st Ann. Int'l Conf. Supercomputing, pp. 242-252, June 2007.
[5] Z. Chishti, M. Powell, and T. Vijaykumar, "Distance Associativity for High-Performance Energy-Efficient Non-Uniform Cache Architectures," MICRO-36: Proc. 36th Ann. IEEE/ACM Int'l Symp. Microarchitecture, pp. 55-66, Dec. 2003.
[6] J. Davis, J. Laudon, and K. Olukotun, "Maximizing CMP Throughput with Mediocre Cores," PACT '05: Proc. 14th Int'l Conf. Parallel Architectures and Compilation Techniques, pp. 51-62, Sept. 2005.
[7] P. Dubey, "A Platform 2015 Workload Model: Recognition, Mining and Synthesis Moves Computers to the Era of Tera," Intel White Paper, Intel Corporation, 2005.
[8] H. Dybdahl and P. Stenstrom, "An Adaptive Shared/Private Nuca Cache Partitioning Scheme for Chip Multiprocessors," HPCA '07: Proc. 13th Int'l Symp. High Performance Computer Architecture, pp. 2-12, Feb. 2007.
[9] N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki, "Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches," ISCA '09: Proc. 36th Ann. Int'l Symp. Computer Architecture, June 2009.
[10] E. Herrero, J. González, and R. Canal, "Distributed Cooperative Caching," PACT '08: Proc. 17th Int'l Conf. Parallel Architectures and Compilation Techniques, pp. 134-143, 2008.
[11] E. Herrero, J. González, and R. Canal, "Power-Efficient Spilling Techniques for Chip Multiprocessors," EuroPar '10: Proc. 16th European Conf. Parallel and Distributed Computing, June 2010.
[12] E. Herrero, J. González, and R. Canal, "Elastic Cooperative Caching: An Autonomous Dynamically Adaptive Memory Hierarchy for Chip Multiprocessors," ISCA '10: Proc. 37th Ann. Int'l Symp. Computer Architecture, June 2010.
[13] J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, and S. Keckler, "A Nuca Substrate for Flexible CMP Cache Sharing," ICS '05: Proc. 19th Ann. Int'l Conf. Supercomputing, pp. 31-40, June 2005.
[14] J.S. Kim, M.B. Taylor, J. Miller, and D. Wentzlaff, "Energy Characterization of a Tiled Architecture Processor with On-Chip Networks," ISLPED '03: Proc. Int'l Symp. Low Power Electronics and Design, pp. 424-427, Aug. 2003.
[15] D. Lenoski, J. Laudon, K. Gharachorloo, A. Gupta, and J. Hennessy, "The Directory-Based Cache Coherence Protocol for the Dash Multiprocessor," ISCA '90: Proc. 17th Ann. Int'l Symp. Computer Architecture, pp. 148-159, May 1990.
[16] J. Lira, C. Molina, and A. Gonzalez, "The Auction: Optimizing Banks Usage in Non-Uniform Cache Architectures," ICS '10: Proc. 24th Ann. Int'l Conf. Supercomputing, June 2010.
[17] P. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner, "Simics: A Full System Simulation Platform," Computer, vol. 35, no. 2, pp. 50-58, Feb. 2002.
[18] M. Martin, M. Hill, and D. Wood, "Token Coherence: Decoupling Performance and Correctness," ISCA '03: Proc. 30th Ann. Int'l Symp. Computer Architecture, pp. 182-193, June 2003.
[19] M. Martin, D.J. Sorin, B. Beckmann, M. Marty, M. Xu, A. Alameldeen, K. Moore, M. Hill, and D. Wood, "Multifacet's General Execution-Driven Multiprocessor Simulator (GEMS) Toolset," SIGARCH Computer Architecture News, vol. 33, no. 4, pp. 92-99, 2005.
[20] J. Merino, V. Puente, and J.A. Gregorio, "ESP-NUCA: A Low-Cost Adaptive Non-Uniform Cache Architecture," HPCA '10: Proc. 16th Int'l Symp. High Performance Computer Architecture, Jan. 2010.
[21] M. Monchiero, R. Canal, and A. Gonzalez, "Power/Performance/Thermal Design Space Exploration for Multicore Architectures," IEEE Trans. Parallel and Distributed Systems, vol. 19, no. 5, pp. 666-681, May 2008.
[22] R. Mullins, "Minimising Dynamic Power Consumption in On-Chip Networks," Proc. Int'l Symp. System-on-Chip, pp. 1-4, Nov. 2006.
[23] M. Qureshi and Y. Patt, "Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches," MICRO-39: Proc. 39th Ann. IEEE/ACM Int'l Symp. Microarchitecture, pp. 423-432, Dec. 2006.
[24] M. Qureshi, "Adaptive Spill-Receive for Robust High-Performance Caching in CMPs," HPCA '09: Proc. 15th Int'l Symp. High Performance Computer Architecture, Feb. 2009.
[25] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan, "Larrabee: A Many-Core x86 Architecture for Visual Computing," SIGGRAPH '08: Proc. ACM SIGGRAPH 2008 Papers, pp. 1-15, 2008.
[26] S. Srikantaiah, M. Kandemir, and M.J. Irwin, "Adaptive Set Pinning: Managing Shared Caches in Chip Multiprocessors," ASPLOS XIII: Proc. 13th Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 135-144, 2008.
[27] K. Strauss, X. Shen, and J. Torrellas, "Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors," MICRO-40: Proc. 40th Ann. IEEE/ACM Int'l Symp. Microarchitecture, Dec. 2007.
[28] D. Tarjan, S. Thoziyoor, and N. Jouppi, "Cacti 4.0.," technical report, HP Labs Palo Alto, June 2006.
[29] S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, P. Iyer, A. Singh, T. Jacob, S. Jain, S. Venkataraman, Y. Hoskote, and N. Borkar, "An 80-Tile 1.28tflops Network-on-Chip in 65nm Cmos," ISSCC '07: Proc. IEEE Int'l Solid-State Circuits Conf., Feb. 2007.
[30] H.-S. Wang, X. Zhu, L.-S. Peh, and S. Malik, "Orion: A Power-Performance Simulator for Interconnection Networks," MICRO-35: Proc. 35th Ann. IEEE/ACM Int'l Symp. Microarchitecture, pp. 294-305, Nov. 2002.
[31] M. Zhang and K. Asanovic, "Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors," ISCA '05: Proc. 32nd Ann. Int'l Symp. Computer Architecture, pp. 336-345, June 2005.
19 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool