This Article 
 Bibliographic References 
 Add to: 
Implementing Multidestination Worms in Switch-Based Parallel Systems: Architectural Alternatives and Their Impact
August 2000 (vol. 11 no. 8)
pp. 794-812

Abstract—Multidestination message passing has been proposed as an attractive mechanism for efficiently implementing multicast and other collective operations on direct networks. However, applying this mechanism to switch-based parallel systems is nontrivial. In this paper, we propose alternative switch architectures with differing buffer organizations to implement multidestination worms on switch-based parallel systems. First, we discuss issues related to such implementation (deadlock-freedom, replication mechanisms, header encoding, and routing). Next, we demonstrate how an existing central-buffer-based switch architecture supporting unicast message passing can be enhanced to accommodate multidestination message passing. Similarly, implementing multidestination worms on an input-buffer-based switch architecture is discussed, and two architectural alternatives are presented that reduce the wiring complexity in a practical switch implementation. The central-buffer-based and input-buffer-based implementations are evaluated against each other, as well as against the corresponding software-based schemes. Simulation experiments under a range of traffic (multiple multicast, bimodal, varying degree of multicast, and message length) and system size are used for evaluation. The study demonstrates the superiority of the central-buffer-based switch architecture. It also indicates that under bimodal traffic the central-buffer-based hardware multicast implementation affects background unicast traffic less adversely compared to a software-based multicast implementation. These results show that multidestination message passing can be applied easily and effectively to switch-based parallel systems to deliver good multicast and collective communication performance.

[1] J.B. Andrews, C.J. Beckmann, and D.K. Poulsen, “Notification and Multicast Networks for Synchronization and Coherence,” J. Parallel and Distributed Computing, vol. 15, pp. 332–350, Aug. 1992.
[2] M. Barnett, D.G. Payne, and R. van de Geijn, "Optimal Broadcasting in Mesh-connected Architectures," Tech. Report, Univ. of Texas at Austin, 1988.
[3] J. Beecroft, M. Homewood, and M. McLaren, “Meiko CS-2 Interconnect Elan-Elite Design,” Parallel Computing, vol. 20, no. 10-11, pp. 1627-1638, Nov. 1994.
[4] J. Bruck, R. Cypher, P. Elustando, A. Ho, C.T. Ho, V. Bala, S. Kipnis, and M. Snir, "CCL: A Portable and Tunable Collective Communication Library for Scalable Parallel Computers," Proc. Int'l Parallel Processing Symp., 1994.
[5] D. Buntinas, D.K. Panda, J. Duato, and P. Sadayappan, “Broadcast/Multicast over Myrinet Using NIC-Assisted Multidestination Messages,” Proc. Fourth Int'l Workshop Comm., Architecture, and Applications for Network-Based Parallel Computing (CANPC '00), Jan. 2000.
[6] C.M. Chiang and L.M. Ni, "Multi-Address Encoding for Multicast," Proc. Parallel Computer Routing and Comm. Workshop, pp. 146-160, May 1994.
[7] C.M. Chiang and L.M. Ni, “Deadlock-Free Multi-Head Wormhole Routing,” Proc. First High Performance Computing-Asia, 1995.
[8] Cray Research, Inc., “Cray T3D System Architecture Overview,” 1993.
[9] D. Dai, “Designing Efficient Communication Subsystems for Distributed Shared Memory (DSM) Systems,” PhD thesis, Ohio State Univ., 1999.
[10] D. Dai and D.K. Panda, “Reducing Cache Invalidation Overheads in Wormhole DSMs Using Multidestination Message Passing,” Proc. Int'l Conf. Parallel Processing, pp. I:138–145, Chicago, Ill., Aug. 1996.
[11] W.J. Dally and C.L. Seitz, “Deadlock-Free Message Routing in Multiprocessor Interconnection Networks,” IEEE Trans. Computers, Vol. C-36, No. 5, May 1987, pp. 547-553.
[12] J. Duato, S. Yalamanchili, and L.M. Ni, Interconnection Networks: An Engineering Approach. Los Alamitos, Calif.: IEEE CS Press, 1997.
[13] B. Duzett and R. Buck, "An Overview of the nCUBE3 Supercomputer," Proc. Fourth Symp. Frontiers of Massively Parallel Computation, pp. 458-464, 1992.
[14] Intel Corporation, “Paragon XP/S Product Overview,” 1991.
[15] M. Katevenis, P. Vatsolaki, and A. Efthymiou, “Pipelined Memory Shared Buffer for VLSI Switches,” Computer Comm. Rev., vol. 25, no. 4, pp. 39-48, Oct. 1995.
[16] P. Kermani and L. Kleinrock, “Virtual Cut-Through: A New Computer Communications Switching Technique,” Computer Networks, vol. 3, no. 4, pp. 267–286, Sept. 1979.
[17] R. Kesavan, K. Bondalapati, and D.K. Panda, “Multicast on Irregular Switch-Based Networks with Wormhole Routing,” Proc. Int'l Symp. High Performance Computer Architecture (HPCA-3), pp. 48-57, Feb. 1997.
[18] R. Kesavan and D.K. Panda, “Minimizing Node Contention in Multiple Multicast on Wormholek-Aryn-Cube Networks,” Proc. Int'l Conf. Parallel Processing, vol. I, pp. 188-195, Aug. 1996.
[19] C.E. Leiserson,Z.S. Abuhamdeh,D.C. Douglas,C.R. Feynman,M.N. Ganmuki,J.V. Hill,W.D. Hillis,B.C. Kuszmaul,M.A. St. Pierre,D.S. Wells,M.C. Wong,S.-W. Yang,, and R. Zak,“The network architecture of the connection machine CM-5,” Proc. Fourth Ann. Symp. Parallel Algorithms and Architectures, ACM, pp. 272-285, June 1992.
[20] X. Lin, P.K. McKinley,, and L.M. Ni,"Deadlock-Free Multicast Wormhole Routing in 2-D Mesh Multicomputers," IEEE Trans. Parallel and Distributed Systems, vol. 5, no. 8, Aug. 1994, pp. 793-804.
[21] X. Lin and L. Ni, "Deadlock-Free Multicast Wormhole Routing in Multicomputer Networks," Proc. Int'l Symp. Computer Architecture, June 1991.
[22] M.P. Malumbres, J. Duato, and J. Torrellas, “An Efficient Implementation of Tree-Based Multicast Routing in Distributed Shared-Memory Multiprocessors,” Proc. Eighth Symp. Parallel and Distributed Processing, pp. 186-189, Oct. 1996.
[23] P.K. McKinley, Y.-J. Tsai, and D. Robinson, "Collective Communication in Wormhole-routed Massively Parallel Computers," Computer, vol. 28, no. 12, pp. 39-50, Dec. 1995.
[24] P.K. McKinley et al., "Unicast-Based Multicast Communication in Wormhole-Routed Networks," IEEE Trans. Parallel and Distributed Systems, vol. 5, no. 12, Dec. 1994, pp. 1252-1265.
[25] L.M. Ni and P.K. McKinley, "A Survey of Wormhole Routing Techniques in Direct Networks," Computer, vol. 26, no. 2, pp. 62-76, Feb. 1993.
[26] L. Ni, “Should Scalable Parallel Computers Support Efficient Hardware Multicasting?” Proc. ICPP Workshop on Challenges for Parallel Processing, pp. 2–7, 1995.
[27] N. Nupairoj and L.M. Ni, “Performance Metrics and Measurement Techniques of Collective Communication Services,” First Int'l Workshop Comm. and Architectural Support for Network-Based Parallel Computing (CANPC '97), pp. 212–226, Feb. 1997.
[28] D.K. Panda, “Issues in Designing Efficient and Practical Algorithms for Collective Communication in Wormhole-Routed Systems,” Proc. ICPP Workshop on Challenges for Parallel Processing, pp. 8–15, 1995.
[29] D.K. Panda, D. Basak, D. Dai, R. Kesavan, R. Sivaram, M. Banikazemi, and V. Moorthy, "Simulation of Modern Parallel Systems: A CSIM-Based Approach," Proc. 1997 Winter Simulation Conf. (WSC '97), pp. 1,013-1,020, Dec. 1997.
[30] D.K. Panda, S. Singal, and R. Kesavan, “Multidestination Message Passing in Wormhole k-Ary n-Cube Networks with Base Routing Conformed Paths,” IEEE Trans. Parallel and Distributed Systems, vol. 10, no. 1, pp. 76-96, Jan. 1999.
[31] D.K. Panda and R. Sivaram, “Fast Broadcast and Multicast in Wormhole Multistage Networks with Multidestination Worms,” Technical Report OSU-CISRC-4/95-TR21, Dept. of Computer and Information Science, Ohio State Univ., Apr. 1995.
[32] W. Qiao and L.M. Ni, “Adaptive Routing in Irregular Networks Using Cut-Through Switches,” Proc. 1996 Int'l Conf. Parallel Processing, Aug. 1996.
[33] I.D. Scherson and C.-H. Chien, "Least Common Ancestor Networks," Proc. Seventh Int'l Parallel Processing Symp., pp. 507-513, 1993.
[34] M.D. Schroeder, A.D. Birrell, M. Burrows, H. Murray, R.M. Needham, T.L. Rodeheffer, E.H. Satterthwaite, and C.P. Thacker, “Autonet: A High-speed, Self-Configuring Local Area Network Using Point-to-Point Links,” Technical Report SRC, Research Report 59, Dec, Apr. 1990.
[35] S.L. Scott, "Synchronization and Communication in the T3E Multiprocess," Proc. ASPLOS-VII, Oct. 1996.
[36] R. Sivaram, “Architectural Support for Efficient Communication in Scalable Parallel Systems,” doctoral thesis, Ohio State Univ., Aug. 1998.
[37] R. Sivaram, R. Kesavan, D. K. Panda, C. B. Stunkel, “Where to Provide Support for Efficient Multicasting in Irregular Networks: Network Interface or Switch?” Proc. 27th Int'l Conf. Parallel Processing (ICPP '98), pp. 452-459, Aug. 1998.
[38] R. Sivaram, D.K. Panda, and C.B. Stunkel, “Fast Broadcast and Multicast on Wormhole Multistage Networks Using Multiport Encoding,” Proc. Eighth IEEE Symp. Parallel and Distributed Processing, pp. 36-45, Oct. 1996.
[39] R. Sivaram, D.K. Panda, and C.B. Stunkel, "Multicasting in Irregular Networks with Cut-Through Switches using Tree-Based Multidestination Worms," Proc. Second Parallel Computer Routing and Comm. Workshop (PCRCW '97), pp. 39-52, June 1997.
[40] R. Sivaram, D.K. Panda, and C.B. Stunkel, “Efficient Broadcast and Multicast on Multistage Interconnection Networks Using Multiport Encoding,” IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 10, pp. 1004-1028, Oct. 1998.
[41] R. Sivaram, C.B. Stunkel, and D.K. Panda, "A Reliable Hardware Barrier Synchronization Scheme," Proc. 11th IEEE Int'l Parallel Processing Symp., pp. 274-280, Apr. 1997.
[42] R. Sivaram, C.B. Stunkel, and D.K. Panda, “HIPIQS: A High Performance Switch Architecture Using Input Queuing,” Proc. 12th Int'l Parallel Processing Symp., pp. 134-143, Apr. 1998.
[43] C.B. Stunkel et al., “The SP1 High-Performance Switch,” Proc. Scalable High-Performance Computing Conf., CS Press, May 1994, pp. 150-157.
[44] C. Stunkel, D. Shea, B. Abali, M. Atkins, C. Bender, D. Grice, P. Hochshild, D. Joseph, B. Nathanson, R. Swetz, R. Stucke, M. Tsao, and P. Varker, “The SP2 High-Performance Switch,” IBM Systems J., vol. 34, no. 2,pp. 185–204, 1995.
[45] C.B. Stunkel, R. Sivaram, and D.K. Panda, “Implementing MultiDestination Worms in Switch-Based Parallel Systems: Architectural Alternatives and Their Impact,” Proc. 24th IEEE/ACM Ann. Int'l Symp. Computer Architecture (ISCA-24), pp. 50-61, June 1997.
[46] Y. Tamir and G. Frazier, "High Performance Multiqueue Buffers for VLSI Communication Switches," Proc. 15th Ann. Symp. Computer Architecture, IEEE Computer Society Press, Los Alamitos, Calif., 1988.
[47] N. F. Tzeng and A. Kongmunvattana, “Distributed Shared Memory Systems with Improved Barrier Synchronization and Data Transfer,” Proc. 1997 ACM Int'l Conf. Supercomputing (ICS '97), pp. 148–155, July 1997.
[48] V. Varavithya and P. Mohapatra, “Tree-Based Multicasting on Wormhole Routed Multistage Interconnection Networks,” IEEE Trans. Parallel and Distributed Systems, vol. 10, no. 11, pp. 1,159–1,177, Nov. 1999.
[49] H. Xu, Y.-D. Gui, and L.M. Ni, "Optimal Software Multicast in Wormhole-Routed Multistage Networks," Proc. Supercomputing Conf., pp. 703-712, 1994.

Index Terms:
Parallel computer architecture, switch/router architecture, wormhole switching, cut-through switching, multicast, broadcast, collective communication, interconnection networks, performance evaluation.
Rajeev Sivaram, Craig B. Stunkel, Dhabaleswar K. Panda, "Implementing Multidestination Worms in Switch-Based Parallel Systems: Architectural Alternatives and Their Impact," IEEE Transactions on Parallel and Distributed Systems, vol. 11, no. 8, pp. 794-812, Aug. 2000, doi:10.1109/71.877938
Usage of this product signifies your acceptance of the Terms of Use.