The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.08 - Aug. (2012 vol.23)
pp: 1453-1466
José L. Abellán , University of Murcia, Murcia
Juan Fernández , Intel Barcelona Research Center, Intel Labs, Universitat Politècnica de Catalunya, Barcelona
Manuel E. Acacio , Universidad de Murcia, Murcia
ABSTRACT
Traditional software-based barrier implementations for shared memory parallel machines tend to produce hotspots in terms of memory and network contention as the number of processors increases. This could limit their applicability to future many-core CMPs in which possibly several dozens of cores would need to be synchronized efficiently. In this work, we develop GBarrier, a hardware-based barrier mechanism especially aimed at providing efficient barriers in future many-core CMPs. Our proposal deploys a dedicated G-line-based network to allow for fast and efficient signaling of barrier arrival and departure. Since GBarrier does not have any influence on the memory system, we avoid all coherence activity and barrier-related network traffic that traditional approaches introduce and that restrict scalability. Through detailed simulations of a 32-core CMP, we compare GBarrier against one of the most efficient software-based barrier implementations for a set of kernels and scientific applications. Evaluation results show average reductions of 54 and 21 percent in execution time, 53 and 18 percent in network traffic, and also 76 and 31 percent in the energy-delay² product metric for the full CMP when the kernels and scientific applications, respectively, are considered.
INDEX TERMS
Many-core CMPs, barrier synchronization, global lines, S-CSMA, cache coherence, scalability, energy efficiency.
CITATION
José L. Abellán, Juan Fernández, Manuel E. Acacio, "Efficient Hardware Barrier Synchronization in Many-Core CMPs", IEEE Transactions on Parallel & Distributed Systems, vol.23, no. 8, pp. 1453-1466, Aug. 2012, doi:10.1109/TPDS.2011.304
REFERENCES
[1] http://techresearch.intel.com/articles/Tera-Scale 1826.htm, 2012.
[2] http://www.netlib.org/benchmarklivermorec , 2012.
[3] A. Flores et al., "Sim-PowerCMP: A Detailed Simulator for Energy Consumption Analysis in Future Embedded CMP Architectures," Proc. 21th Int'l Conf. Advanced Information Networking and Applications Workshops, 2007.
[4] A.P. Jose and K.L. Shepard, "Distributed Loss-Compensation Techniques for Energy-Efficient Low-Latency On-Chip Communications," IEEE J. Solid-State Circuits, vol. 42, no. 6, pp. 1415-1424, June 2007.
[5] J.L. Abellán et al., "A G-Line-Based Network for Fast and Efficient Barrier Synchronization in Many-Core CMPs," Proc. 39th Int'l Conf. Parallel Processing, 2010.
[6] J.L. Abellán et al., "GLocks: Efficient Support for Highly-Contended Locks in Many-Core CMPs," Proc. IEEE Int'l Parallel & Distributed Processing Symp. (IPDPS '11), 2011.
[7] B. Beck et al., "VLSI Assist for a Multiprocessor," Proc. Second Int'l Conf. Architectural Support for Programming Languages and Operating System, 1987.
[8] C. Cascaval et al., "Evaluation of a Multithreaded Architecture for Cellular Computing," Proc. Eighth Int'l Symp. High-Performance Computer Architecture, 2002.
[9] C.E. Leiserson et al., "The Network Architecture of the Connection Machine CM-5," Proc. ACM Symp. Parallel Algorithms and Architectures, 1992.
[10] P. Conway, "Blade Computing with the AMD Magny-Cours Processor," Proc. 21st Symp. High Performance Chips, 2009.
[11] D.E. Culler et al., "Parallel Programming in Split-C," Proc. ACM/IEEE Int'l Conf. SuperComputing, 1993.
[12] D.E. Culler et al., Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann, 1998.
[13] H.T. Olnowich, "ALLNODE Barrier Synchronization Network," Proc. Ninth Int'l Parallel Processing Symp., 1995.
[14] R. Ho et al., "High-Speed and Low-Energy Capacitively-Driven On-Chip Wires," Proc. IEEE Int'l Solid-State Circuits Conf., 2007.
[15] C.J. Huges et al., "RSIM: Simulating Shared-Memory Multiprocessors with ILP Processors," Computer, vol. 35, no. 2, pp. 40-49, Feb. 2002.
[16] H. Ito et al., "A Bidirectional-and Multi-Drop-Transmission-Line Interconnect for Multipoint-to-Multipoint On-Chip Communications," IEEE J. Solid State Circuits, vol. 43, no. 4, pp. 1020-1029, Apr. 2008.
[17] J. Goodman et al., "Efficient Synchronization Primitives for Large-Scale Cache-Coherent Shared-Memory Multiprocessors," Proc. Third Int'l Conf. Architectural Support for Programming Languages and Operating Systems, 1989.
[18] J. Leverich et al., "Comparing Memory Systems for Chip Multiprocessors," ACM SIGARCH Computer Architecture News, vol. 35, no. 2, pp. 358-368, 2007.
[19] J. Sampson et al., "Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers," Proc. IEEE/ACM 39th Ann. Int'l Symp. Microarchitecture, 2006.
[20] J. Sartori and R. Kumar, "Low-Overhead, High-Speed Multi-Core Barrier Synchronization," Proc. Fifth Int'l Conf. High Performance Embedded Architectures and Compilers, 2010.
[21] M. Monchiero et al., "An Efficient Synchronization Technique for Multiprocessor Systems On-Chip," ACM SIGARCH Computer Architecture News, vol. 34, pp. 33-40, 2006.
[22] E. Mensink et al., "A 0.28pf/b 2gb/s/ch Transceiver in 90 nm Cmos for 10 mm On-Chip Interconnects," Proc. IEEE Solid-State Circuits Conf., 2007.
[23] P. Coteus et al., "Packaging the Blue Gene/L Supercomputer," IBM J. Research and Development, vol. 49, pp. 213-248, 2005.
[24] R. Chang et al., "Near Speed-of-Light Signaling over On-Chip Electrical Interconnects," IEEE J. Solid State Circuits, vol. 38, no. 5, pp. 834-838, May 2003.
[25] S.C. Woo et al., "The SPLASH-2 Programs: Characterization and Methodological Considerations," Proc. 22nd Ann. Int'l Symp. Computer Architecture (ISCA '95), 1995.
[26] S.L. Scott, "Synchronization and Communication in the T3E Multiprocessor," Proc. Seventh Int'l Conf. Architectural Support for Programming Languages and Operating Systems, 1996.
[27] S.S. Mukherjee et al., "Efficient Support for Irregular Applications on Distributed-Memory Machines," Proc. Fifth ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPOPP '95), 1995.
[28] S. Shang and K. Hwang, "Distributed Hardwired Barrier Synchronization for Scalable Multiprocessor Clusters," IEEE Trans. Parallel and Distributed Systems, vol. 6, no. 6, pp. 591-605, June 1995.
[29] T. Krishna et al., "Express Virtual Channels with Capacitively Driven Global Links," IEEE Micro, vol. 29, no. 4 pp. 48-61, July 2009.
[30] V. Krishnan and J. Torrellas, "The Need for Fast Communication in Hardware-Based Speculative Chip Multiprocessors," Int'l J. Parallel Programming, vol. 29, pp. 3-33, 2001.
[31] W.T.-Y. Hsu and P.-C. Yew, "An Effective Synchronization Network for Hot-Spot Accesses," ACM Trans. Computer Systems, vol. 10, pp. 167-189, 1992.
[32] W. Zhu et al., "Synchronization State Buffer: Supporting Efficient Fine-Grain Synchronization on Many-Core Architectures," Proc. 34th Ann. Int'l Symp. Computer Architecture, 2007.
[33] Z. Hu et al., "Optimization of Dense Matrix Multiplication on IBM Cyclops-64: Challenges and Experiences," Proc. 12th Int'l European Conf. Parallel and Distributed Computing, 2006.
52 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool