This Article 
 Bibliographic References 
 Add to: 
Fault-Tolerant Communication with Partitioned Dimension-Order Routers
October 1999 (vol. 10 no. 10)
pp. 1026-1039

Abstract—The current fault-tolerant routing methods require extensive changes to practical routers such as the Cray T3D's dimension-order router to handle faults. In this paper, we propose methods to handle faults in multicomputers with dimension-order routers with simple changes to router structure and logic. Our techniques can be applied to current implementations in which the router is partitioned into multiple modules and no centralized crossbar is used. We consider arbitrarily located faulty blocks and assume only local knowledge of faults. We apply our techniques for torus networks and show that, with as few as four virtual channels per physical channel, deadlock- and livelock-free routing can be provided even with multiple faults and multimodule implementation of routers. Our simulations of the proposed technique for 2D tori and mesh indicate that the performance degradation is similar to that seen in the case of cross-bar based designs previously proposed.

[1] K. Bolding and L. Snyder,“Overview of fault handling for the chaos router,” Proc. 1991 IEEE Int’l Workshop Defect and Fault Tolerance in VLSI Systems, pp. 124-127, 1991.
[2] K. Bolding and W. Yost, "Design of a Router for Fault-Tolerant Networks," Proc. 1994 Parallel Computer Routing and Comm. Workshop, pp. 226-240, May 1994.
[3] R. Boppana and S. Chalasani, “Fault-Tolerant Routing with Non-Adaptive Wormhole Algorithms in Mesh Networks,” Proc. Supercomputing, pp. 693-702, 1994.
[4] R. Boppana and S. Chalasani, "Fault-Tolerant Wormhole Routing Algorithms for Mesh Networks," IEEE Trans. Computers, vol. 44, no. 7, pp. 848-864, July 1995.
[5] S. Borkar, R. Cohn, G. Cox, S. Gleason, T. Gross, H.T. Kung, M. Lam, B. Moore, C. Peterson, J. Pieper, L. Rankin, P.S. Tseng, J. Sutton, J. Urbanski, and J. Webb iWarp: An Integrated Solution to High-Speed Parallel Computing, Proc. 1988 Int'l Conf. Supercomputing, pp. 330-339., IEEE CS and ACM SIGARCH, Orlando, Fla., Nov. 1988.
[6] Y.M. Boura and C.R. Das, “Fault-Tolerant Routing in Mesh Networks,” Proc. 1995 Int'l Conf. Parallel Processing, pp. I.106-109, Aug. 1995.
[7] S. Chalasani and R.V. Boppana, “Adaptive Fault-Tolerant Wormhole Routing Algorithms with Low Virtual Channel Requirements,” Proc. Int'l Symp. Parallel Architectures, Algorithms and Networks, pp. 214-221, Dec. 1994.
[8] S. Chalasani and R.V. Boppana,“Fault-tolerant wormhole routing in tori,” Proc. Eighth ACM Int’l Conf. Supercomputing, July 1994.
[9] S. Chalasani and R.V. Boppana, “Adaptive Wormhole Routing in Tori with Faults,” IEE Proc.: Computers and Digital Techniques, vol. 142, pp. 386-394, Nov. 1995.
[10] A.A. Chien, “A Cost and Speed Model for k-Ary n-Cube Wormhole Routers,” Presented at Hot Interconnects 1993, Mar. 1993.
[11] A.A. Chien and J.H. Kim, "Planar-Adaptive Routing: Low-Cost Adaptive Networks for Multiprocessors," Proc. 19th Int'l Symp. Computer Architecture, vol. 20, no. 2, pp. 268-277, May 1992.
[12] Cray Research Inc., Cray T3D System Architecture Overview, Sept. 1993.
[13] Cray Research Inc., Cray T3D Technical Summary, Oct. 1993.
[14] W. Dally,"Network and processor architecture for message-driven computer," R. Suaya and G. Birtwistle, eds., VLSI and Parallel Computation.San Mateo, Calif.: Morgan Kaufmann, pp. 140-218, 1990.
[15] W.J. Dally, "Virtual-Channel Flow Control," IEEE Trans. Parallel and Distributed Systems, vol. 3, no. 2, pp. 194-205, Mar. 1992.
[16] W.J. Dally and H. Aoki, "Deadlock-Free Adaptive Routing in Multicomputer Networks Using Virtual Channels," IEEE Trans. Parallel and Distributed Systems, vol. 4, no. 4, pp. 466-475, Apr. 1993.
[17] W.J. Dally, L.R. Dennison, D. Harris, K. Kan, and T. Xanthopoulus, “The Reliable Router: A Reliable and High-Performance Communication Substrate for Parallel Computers,” Proc. Workshop Parallel Computer Routing and Comm., pp. 241–255, May 1994.
[18] W.J. Dally and C.L. Seitz, “Deadlock-Free Message Routing in Multiprocessor Interconnection Networks,” IEEE Trans. Computers, Vol. C-36, No. 5, May 1987, pp. 547-553.
[19] W.J. Dally and P. Song, “Design of a Self-Timed VLSI Multicomputer Communication Controller,” Proc. Int'l Conf. Computer Design, pp. 230-234, 1987.
[20] J. Duato, "A New Theory of Deadlock-Free Adaptive Routing in Wormhole Networks," IEEE Trans. Parallel and Distributed Systems, vol. 4, no. 12, pp. 1,320-1,331, Dec. 1993.
[21] P.T. Gaughan and S. Yalamanchili, "A Family of Fault-Tolerant Routing Protocols for Direct Multiprocessor Networks," IEEE Trans. Parallel and Distributed Systems, vol. 5, no. 6, pp. 482-487, May 1995.
[22] C.J. Glass and L.M. Ni, "Fault-Tolerant Wormhole Routing in Meshes," Proc. 23rd Int'l Symp. Fault-Tolerant Computing, pp. 240-249, 1993.
[23] Intel Corporation, Paragon XP/S Product Overview, 1991.
[24] T.C. Lee and J.P. Hayes,“A fault-tolerant communication scheme for hypercube computers,” IEEE Trans. Computers, vol. 41, no. 10, pp. 1,242-1,256, Oct. 1992.
[25] Z. Liu and A.A. Chien, “Hierarchical Adaptive Routing,” Proc. Sixth IEEE Symp. Parallel and Distributed Processing, 1994.
[26] J.Y. Ngai and C.L. Seitz, "A Framework For Adaptive Routing in Multicomputer Networks," Proc. ACM Symp. Parallel Algorithms Architectures, 1989.
[27] T. Pinkston, Y. Choi, and M. Raksapatcharawong, “Architecture and Optoelectronic Implementation of the WARRP Router,” Proc. Symp. Hot Interconnects V, Aug. 1997.
[28] C.S. Raghavendra,P.-J. Yang,, and S.-B. Tien,“Free dimensions—an effective approach to achieving fault tolerance in hypercubes,” 22nd Ann. Int’l Symp. Fault-Tolerant Computing, pp. 170-177, 1992.
[29] C.L. Seitz, "Concurrent architectures," VLSI and Parallel Computation, R. Suaya and G. Birtwistle, eds., ch. 1, pp. 1-84.San Mateo, Calif.: Morgan-Kaufman Publishers, Inc., 1990.

Index Terms:
Cray T3D router, dimension-order router, fault-tolerant routing, multicomputer networks, message routing, torus networks, wormhole routing.
Rajendra V. Boppana, Suresh Chalasani, "Fault-Tolerant Communication with Partitioned Dimension-Order Routers," IEEE Transactions on Parallel and Distributed Systems, vol. 10, no. 10, pp. 1026-1039, Oct. 1999, doi:10.1109/71.808144
Usage of this product signifies your acceptance of the Terms of Use.