The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.11 - Nov. (2012 vol.23)
pp: 2010-2023
Pablo Abad , University of Cantabria, Santander
Valentin Puente , Univeristy of Cantabria, Santander
Lucia G. Menezo , University of Cantabria, Santander
Jose Angel Gregorio , Univeristy of Cantabria, Santander
ABSTRACT
Multidestination communications are a highly necessary capability for many coherence protocols in order to minimize on-chip hit latency. Although CMPs share this necessity, up to now few suitable proposals have been developed. The combination of resource scarcity and the common idea that multicast support requires a substantial amount of extra resources is responsible for this situation. In this work, we propose a new approach suitable for on-chip networks capable of managing multidestination traffic via hardware in an efficient way with negligible complexity. We introduce a novel multicast routing mechanism, able to circumvent many of the limitations of conventional multicast schemes. Adaptive-tree multicasting is able to maintain correctness for multiflit multicast messages without routing restrictions, while also coupling correctness and performance in a natural way. Replication restrictions not only guarantee the presence of enough resources to avoid deadlock, but also dynamically adapt tree shape to network conditions, routing multicast messages through noncongested paths. The performance results, using a state-of-the-art full system simulation framework, show that it improves the average full system performance of a CMP by 20 percent and network ED2P by 15 percent, when compared to a state-of-the-art router with conventional multicast support and similar implementation cost.
INDEX TERMS
Routing, System recovery, Unicast, Proposals, Protocols, Hardware, Vectors, router microarchitecture, Chip multiprocessor (CMP), multicast and broadcast communications, network-on-chip
CITATION
Pablo Abad, Valentin Puente, Lucia G. Menezo, Jose Angel Gregorio, "Adaptive-Tree Multicast: Efficient Multidestination Support for CMP Communication Substrate", IEEE Transactions on Parallel & Distributed Systems, vol.23, no. 11, pp. 2010-2023, Nov. 2012, doi:10.1109/TPDS.2012.45
REFERENCES
[1] P. Abad, V. Puente, P. Prieto, and J.A. Gregorio, "Rotary Router: An Efficient Architecture for CMP Interconnection Networks," Proc. Int'l Symp. Computer Architecture (ISCA), pp. 116-125, June 2007.
[2] P. Abad, V. Puente, and J.A. Gregorio, "Reducing the Interconnection Network Cost of Chip Multiprocessors," Proc. IEEE Int'l Symp. Networks-on-Chip (NOCS), pp 183-192, Feb. 2008.
[3] P. Abad, V. Puente, and J.A. Gregorio, "MRR: Enabling Fully Adaptive Multicast Routing for CMP Interconnection Networks," Proc. IEEE 15th Int'l Symp. High-Performance Computer Architecture (HPCA), pp. 355-366, Feb. 2009.
[4] P. Abad, V. Puente, and J.A. Gregorio, "Balancing Performance and Cost in CMP Interconnection Networks," IEEE Trans. Parallel and Distributed Systems, vol. 23, no. 3, pp. 452-459, Mar. 2010.
[5] N.R. Adiga et al., "Blue Gene/L Torus Interconnection Network," IBM J. Research and Development, vol 49, no. 2, pp. 265-276, Mar. 2005.
[6] N. Agarwal, L.S. Peh, and N.K. Jha, "In-Network Snoop Ordering (INSO): Snoopy Coherence on Unordered Networks," Proc. IEEE 15th Int'l Symp. High-Performance Computer Architecture (HPCA), pp. 67-78, Feb. 2009.
[7] N. Agarwal, L.-S. Peh, and N.K. Jha, "In-Network Coherence Filtering: Snoopy Coherence without Broadcasts," Proc. IEEE/ACM 42nd Ann. Int'l Symp. Microarchitecture (MICRO), pp. 232-243, Dec. 2009.
[8] A.R. Alameldeen, M.K. Martin, C.J. Mauer, K.E. Moore, M. Xu, D.J. Sorin, M.D. Hill, and D.A. Wood, "Simulating a $2M Commercial Server on a $2K PC," Computer, vol. 36, no. 2, pp. 50-57, Feb. 2003.
[9] J. Balfour and W.J. Dally, "Design Tradeoffs for Tiled CMP On-Chip Networks," Proc. Int'l Conf. Supercomputing (ICS), pp. 187-198, 2006.
[10] B. Beckmann and D. Wood, "Managing Wire Delay in Large Chip-Multiprocessor Caches," Proc. 37th Int'l Symp. Microarchitecture (MICRO), pp. 319-330, Dec. 2004.
[11] C. Bienia, "Benchmarking Modern Multiprocessors," PhD thesis, Princeton Univ., Jan. 2011.
[12] R. Boppana, S. Chalasani, and C. Raghavendra, "On Multicast Wormhole Routing in Multicomputer Networks," Proc. IEEE Sixth Symp. Parallel and Distributed Processing, pp. 722-729, 1994.
[13] D.M. Brooks et al., "Power-Aware Microarchitecture: Design and Modeling Challenges for Next-Generation Microprocessors," IEEE Micro, vol. 20, no. 6, pp. 26-44, Nov. 2000.
[14] G. Byrd, N. Saraiya, and B. Delagi, "Multicast Communication in Multiprocessor Systems," Proc. Int'l Conf. Parallel Processing (ICPP), pp. 196-200, Aug. 1989.
[15] C. Chiang and L.M. Ni, "Multi-Address Encoding for Multicast," Proc. Int'l Workshop Parallel Computer Routing and Comm., pp. 146-160, 1994.
[16] W. Dally and C.L. Seitz, "Deadlock-Free Message Routing in Multiprocessor Interconnection Networks," IEEE Trans. Computers, vol. C-36, no. 5, pp. 547-553, May 1987.
[17] R. Das, O. Mtulu, T. Moscibroda, and C.R. Das, "Aérgia: Exploiting Packet Latency Slack in On-Chip Networks," Proc. Int'l Symp. Computer Architecture (ISCA), pp. 106-116, June 2010.
[18] N. Eisley, L.S. Peh, and L. Shang, "In-Network Cache Coherence," Proc. IEEE/ACM 39th Ann. Int'l Symp. Microarchitecture (MICRO), pp. 321-332, Dec. 2006.
[19] C. Gomez, M. Gomez, P. Lopez, and J. Duato, "BPS: A Bufferless Switching Technique for NoCs," Proc. Workshop Interconnection Network Architectures, pp. 1-6, 2008.
[20] R. Gonzalez and M. Horowitz, "Energy Dissipation in General Purpose Microprocessors," IEEE J. Solid-State Circuits, vol. 31, no. 9, pp. 1277-1284, Sept. 1996.
[21] P. Gratz, C. Kim, R. McDonald, S.W. Keckler, and D. Burger, "Implementation and Evaluation of On-Chip Network Architectures," Proc. Int'l Conf. Computer Design (ICCD), pp. 477-484, Oct. 2006.
[22] A. Hansson, K. Goossens, and A. Radulescu, "Avoiding Message-Dependent Deadlock in Network-Based Systems on Chip," VLSI Design, vol. 2007, pp. 1-10, 2007.
[23] M. Hayenga, N.E. Jerger, and M. Lipasti, "SCARAB: A Single Cycle Adaptive Routing and Bufferless Network," Proc. IEEE 42nd Ann. Int'l Symp. Microarchitecture (MICRO), pp. 244-254, Dec. 2009.
[24] Intel Corporation, "An Introduction to the Intel Quickpath Interconnect," White paper, Document Number 320412-001US, 2009.
[25] N.E. Jerger, L.S. Peh, and M.H. Lipasti, "Virtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support," Proc. 35th Int'l Symp. Computer Architecture (ISCA), pp. 229-240, June 2008.
[26] H. Jin, M. Frumkin, and J. Yan, "The OpenMP Implementation of NAS Parallel Benchmarks and its Performance," NAS technical report, Oct. 1999.
[27] A. Kahng, B. Li, L.S. Peh, and K. Samadi, "ORION 2.0: A Fast and Accurate NoC Power and Area Model for Early-Stage Design Space Exploration," Proc. Design Automation and Test in Europe (DATE) Conf. and Exhibition, Apr. 2009.
[28] M.J. Karol, M.G. Hluchyj, and S.P. Morgan, "Input versus Output Queuing on a Space-Division Packet Switch," IEEE Trans. Comm., vol. C-35, no. 12, pp. 1347-1356, Dec. 1987.
[29] P. Kermani and L. Kleinrock, "Virtual Cut-through: A New Computer Communication Switching Technique," Computer Networks, vol.3, pp. 267-286, Sept. 1979.
[30] A. Kumar, P. Kundu, A.P. Singh, L.S. Peh, and N.K. Jha, "A 4.6 Tbits/s 3.6 GHz Single-Cycle NoC Router with a Novel Switch Allocator in 65 nm CMOS," Proc. Int'l Conf. Computer Design (ICCD), pp. 63-70, Oct. 2007.
[31] D.R. Kumar, W.A. Najjar, and P.K. Srimani, "A New Adaptive Hardware Tree-Based Multicast Routing in K-Ary N-Cubes," IEEE Trans. Computers, vol. 50, no. 7, pp. 647-659, July 2001.
[32] J. Laudon and D. Lenoski, "The SGI Origin: A cc-NUMA Highly Scalable Server," Proc. Int'l Symp. Computer Architecture (ISCA), pp. 241-251, June 1997.
[33] X. Lin and L.M. Ni, "Deadlock-Free Multicast Wormhole Routing in Multicomputer Networks," Proc. Int'l Symp. Computer Architecture (ISCA), pp. 116-125, 1991.
[34] P.S. Magnusson, M. Christensson, J. Eskilson, D. Forsgen, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner, "Simics: A Full System Simulation Platform," Computer, vol. 35, no. 2, pp. 50-58, Feb. 2002.
[35] M. Malumbres, J. Duato, and J. Torrellas, "An Efficient Implementation of Tree-Based Multicast Routing for Distributed Shared-Memory Multiprocessors," Proc. IEEE Eighth Symp. Parallel and Distributed Processing, pp. 186-189, Oct. 1996.
[36] M.M.K. Martin, M.D. Hill, and D.A. Wood, "Token Coherence: Decoupling Performance and Correctness," Proc. 30th Int'l Symp. Computer Architecture (ISCA), pp. 182-193, June 2003.
[37] M.M.K. Martin, D.J. Sorin, B.M. Beckmann, M.R. Marty, M. Xu, A.R. Alameldeen, K.E. Moore, M.D. Hill, and D.A. Wood, "Multifacet's General Execution-driven Multiprocessor Simulator (GEMS) Toolset," ACM SIGARCH Computer Architecture News, vol. 33, pp. 92-99, Sept. 2005.
[38] T. Moscibroda and O. Mutlu, "A Case for Bufferless Routing in On-Chip Networks," Proc. Int'l Symp. Computer Architecture (ISCA), pp. 196-207, June 2009.
[39] R. Mullins, A. West, and S. Moore, "Low-Latency Virtual-Channel Routers for On-Chip Networks," Proc. 31st Int'l Symp. Computer Architecture, pp. 188-197, June 2004.
[40] N. Muralimanohar, R. Balasubramonian, and N. Jouppi, "Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0," Proc. IEEE/ACM 40th Ann. Int'l Symp. Microarchitecture (MICRO), pp. 3-14, 2007.
[41] K. Pagiamtzis and A. Sheikholeslami, "Content-Addressable Memory (CAM) Circuits and Architectures: A Tutorial and Survey," IEEE J. Solid-State Circuits, vol. 41, no. 3, pp. 712-727, Mar. 2006.
[42] D.K. Panda, S. Singhal, and P. Prabhakaran, "Multidestination Message Passing Mechanism Conforming to Base Wormhole Routing Scheme," Proc. First Int'l Workshop Parallel Computer Routing and Comm., May 1994.
[43] L.S. Peh and W.J. Dally, "A Delay Model and Speculative Architecture for Pipelined Routers," Proc. Seventh Int'l Symp. High Performance Computer Architecture (HPCA), pp. 255-266, Jan. 2001.
[44] F. Petrini, J. Duato, P. Lopez, and J.M. Martinez, "LIFE: A Limited Injection, Fully AdaptivE, Recovery-Based Routing Algorithm," Proc. Fourth Int'l Conf. High Performance Computing (HIPC), pp. 316-321, 1997.
[45] V. Puente, R. Beivide, J.A. Gregorio, J.M. Prellezo, J. Duato, and C. Izu, "Adaptive Bubble Router: A Design to Improve Performance in Torus Networks," Proc. Int'l Conf. Parallel Processing (ICPP), pp. 58-67, 1999.
[46] V. Puente, J.A. Gregorio, and R. Beivide, "SICOSYS: An Integrated Framework for Studying Interconnection Network in Multiprocessor Systems," Proc. 10th Euromicro Workshop Parallel and Distributed Processing, pp. 15-22, Jan. 2002.
[47] S. Rodrigo, J. Flich, J. Duato, and M. Hummel, "Efficient Unicast and Multicast Support for CMPs," Proc. IEEE/ACM 41st Ann. Int'l Symp. Microarchitecture (MICRO), pp. 364-375, Nov. 2008.
[48] Y.H. Song and T.M. Pinkston, "A Progressive Approach to Handling Message-Dependent Deadlock in Parallel Computer Systems," IEEE Trans. Parallel and Distributed Systems, vol. 14, no. 3, pp. 259-275, Mar. 2003.
[49] C.B. Stunkel, J. Herring, B. Abali, and R. Sivaram, "A New Switch Chip for IBM RS/6000 SP Systems," Proc. ACM/IEEE Conf. Supercomputing, Nov. 1999.
[50] The Standard Performance Evaluation Corporation, SpecCPU2006, http://www.spec.orgcpu2006, 2006.
26 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool