The Community for Technology Leaders
RSS Icon
Issue No.05 - May (2013 vol.62)
pp: 914-928
Christian Fensch , University of Edinburgh, Edinburgh
Nick Barrow-Williams , Nvidia Corporation, Santa Clara
Robert D. Mullins , University of Cambridge, Cambridge
Simon Moore , University of Cambridge, Cambridge
Many-core architectures provide an efficient way of harnessing the growing numbers of transistors available. However, energy and latency costs of communication increasingly limit the parallel programs running on these platforms. Existing designs provide a functional communication layer, but not necessarily the most efficient solution. Due to power limitations, efficiency is now a primary concern that motivates us to look again at cache coherence. First, we analyze the communication behavior of parallel applications. The observed sharing patterns reveal considerable locality of shared data accesses between threads with consecutive IDs. This pattern corresponds to strong physical locality between adjacent cores in a chip-multiprocessor (CMP). This paper explores the design of Proximity Coherence: a novel scheme in which L1 load misses are optimistically forwarded to nearby caches via new dedicated links. We exploit these patterns and improve the efficiency of communication. The results show that careful analysis leads to the design of a more efficient coherence protocol. The protocol reduces the latency of load misses by up to 33 percent (17 percent, on average), improving overall execution time by up to 13 percent. Furthermore, it also reduces network-on-chip traffic by 19 percent and energy consumption by up to 30 percent.
Central Processing Unit, Coherence, Protocols, Transistors, Computers, Energy consumption, Educational institutions, network-on-chip, Central Processing Unit, Coherence, Protocols, Transistors, Computers, Energy consumption, Educational institutions, physical locality, Proximity coherence, CMP, cache design
Christian Fensch, Nick Barrow-Williams, Robert D. Mullins, Simon Moore, "Designing a Physical Locality Aware Coherence Protocol for Chip-Multiprocessors", IEEE Transactions on Computers, vol.62, no. 5, pp. 914-928, May 2013, doi:10.1109/TC.2012.52
[1] G.E. Moore, "Cramming More Components onto Integrated Circuits," Electronics, vol. 38, no. 8, Apr. 1965.
[2] D.W. Wall, "Limits of Instruction-Level Parallelism," Proc. Fourth Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS IV), pp. 176-188, 1991.
[3] K. Olukotun, B.A. Nayfeh, L. Hammond, K. Wilson, and K. Chang, "The Case for a Single-Chip Multiprocessor," Proc. Seventh Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS VII), pp. 2-11, 1996.
[4] S. Palacharla, N.P. Jouppi, and J.E. Smith, "Complexity-Effective Superscalar Processors," Proc. 24th Ann. Int'l Symp. Computer Architecture (ISCA), pp. 206-218, 1997.
[5] V. Agarwal, M.S. Hrishikesh, S.W. Keckler, and D. Burger, "Clock Rate versus IPC: The End of the Road for Conventional Microarchitectures," Proc. 27th Ann. Int'l Symp. Computer Architecture (ISCA), pp. 248-259, 2000.
[6] W.J. Dally and B. Towles, "Route Packets, Not Wires: On-Chip Interconnection Networks," Proc. 38th Design Automation Conf. (DAC), pp. 684-689, 2001.
[7] A. Agarwal, L. Bao, J. Brown, B. Edwards, M. Mattina, C. Miao, C. Ramey, and D. Wentzlaff, "Tile Processor: Embedded Multicore for Networking and Multimedia," Proc. 19th Hot Chips Symp., Aug. 2007.
[8] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan, "Larrabee: A Many-Core x86 Architecture for Visual Computing," ACM Trans. Graphics, vol. 27, no. 3, pp. 1-15, Aug. 2008.
[9] S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, A. Singh, T. Jacob, S. Jain, V. Erraguntla, C. Roberts, Y. Hoskote, N. Borkar, and S. Borkar, "An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS," J. Solid-State Circuits, vol. 43, no. 1, pp. 29-41, Jan. 2008.
[10] S. Moore and D. Greenfield, "The Next Resource War: Computation vs. Communication," Proc. Int'l Workshop System Level Interconnect Prediction, pp. 81-86, 2008.
[11] M.S. Papamarcos and J.H. Patel, "A Low-Overhead Coherence Solution for Multiprocessors with Private Cache Memories," Proc. 11th Ann. Int'l Symp. Computer Architecture (ISCA), pp. 348-354, June 1984.
[12] N. Barrow-Williams, C. Fensch, and S. Moore, "A Communication Characterisation of Splash-2 and Parsec," Proc. IEEE Int'l Symp. Workload Characterization (IISWC), pp. 86-97, Oct. 2009.
[13] N. Barrow-Williams, C. Fensch, and S. Moore, "Proximity Coherence for Chip Multiprocessors," Proc. 19th Int'l Conf. Parallel Architectures and Compilation Techniques (PACT), pp. 123-134, Sept. 2010.
[14] S.C. Woo, M. Ohara, E. Torrie, J.P. Singh, and A. Gupta, "The SPLASH-2 Programs: Characterization and Methodological Considerations," Proc. 22nd Ann. Int'l Symp. Computer Architecture (ISCA), pp. 24-36, June 1995.
[15] C. Bienia, S. Kumar, J.P. Singh, and K. Li, "The PARSEC Benchmark Suite: Characterization and Architectural Implications," Proc. 17th Int'l Conf. Parallel Architectures and Compilation Techniques (PACT), pp. 72-81, Oct. 2008.
[16] P.S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hllberg, J. Hgberg, F. Larsson, A. Moestedt, and B. Werner, "Simics: A Full System Simulation Platform," IEEE Computer, vol. 35, no. 2, pp. 50-58, Feb. 2002.
[17] J. Balfour and W.J. Dally, "Design Tradeoffs for Tiled CMP On-Chip Networks," Proc. 20th Ann. Int'l Conf. Supercomputing (ICS), pp. 187-198, 2006.
[18] A. Banerjee, P.T. Wolkotte, R.D. Mullins, S.W. Moore, and G.J. Smit, "An Energy and Performance Exploration of Network-on-Chip Architectures," IEEE Trans. Very Large Scale Integration Systems (VLSI), vol. 17, no. 3, pp. 319-329, Mar. 2009.
[19] P. Kundu, "On-Die Interconnects for Next Generation CMPs," Proc. Workshop On- and Off-Chip Interconnection Networks for Multicore Systems, program.html , Dec. 2006.
[20] S. Thoziyoor, N. Muralimanohar, J.H. Ahn, and N.P. Jouppi, "Cacti 5.1 ," Technical Report HPL-2008-20, HP Labs, 2008.
[21] N. Eisley, L.-S. Peh, and L. Shang, "In-Network Cache Coherence," Proc. IEEE/ACM 39th Ann. Int'l Symp. Microarchitecture (MICRO), pp. 321-332, Dec. 2006.
[22] N.D. Enright Jerger, L.-S. Peh, and M.H. Lipasti, "Virtual Tree Coherence: Leveraging Regions and In-Network Multicast Trees for Scalable Cache Coherence," Proc. IEEE/ACM 41st Ann. Int'l Symp. Microarchitecture (MICRO), pp. 35-46, Nov. 2008.
[23] ITRS, "International Technology Roadmap for Semiconductors," http:/, 2010.
[24] M.M.K. Martin, D.J. Sorin, B.M. Beckmann, M.R. Marty, M. Xu, A.R. Alameldeen, K.E. Moore, M.D. Hill, and D.A. Wood, "Multifacet's General Execution-Driven Multiprocessor Simulator (GEMS) Toolset," SIGARCH Computer Architecture News, vol. 33, no. 4, pp. 92-99, Nov. 2005.
[25] A.R. Alameldeen and D.A. Wood, "Variability in Architectural Simulations of Multi-threaded Workloads," Proc. Ninth Int'l Symp. High-Performance Computer Architecture (HPCA), pp. 7-18, Feb. 2003.
[26] L. Cheng, N. Muralimanohar, K. Ramani, R. Balasubramonian, and J.B. Carter, "Interconnect-Aware Coherence Protocols for Chip Multiprocessors," Proc. 33rd Ann. Int'l Symp. Computer Architecture (ISCA), June 2006.
[27] J.A. Brown, R. Kumar, and D. Tullsen, "Proximity-Aware Directory-Based Coherence for Multi-Core Processor Architectures," Proc. 19th Ann. ACM Symp. Parallel Algorithms and Architectures (SPAA), pp. 126-134, June 2007.
[28] H. Hossain, S. Dwarkadas, and M.C. Huang, "Improving Support for Locality and Fine-Grain Sharing in Chip Multiprocessors," Proc. 17th Int'l Conf. Parallel Architectures and Compilation Techniques (PACT), pp. 155-165, Oct. 2008.
[29] D.V. James, A.T. Laundrie, S. Gjessing, and G. Sohi, "Distributed-Directory Scheme: Scalable Coherent Interface," Computer, vol. 23, no. 6, pp. 74-77, 1990.
[30] A. Nowatzyk, G. Aybay, M.C. Browne, E.J. Kelly, M. Parkin, B. Radke, and S. Vishin, "The Scalable Shared Memory Multiprocessor," Proc. Int'l Conf. Parallel Processing (ICPP), vol. 1, pp. 1-10, Aug. 1995.
[31] L. Cheng, J.B. Carter, and D. Dai, "An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing," Proc. IEEE 13th Int'l Symp. High Performance Computer Architecture (HPCA), pp. 328-339, Feb. 2007.
28 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool