This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
A Routing Methodology for Achieving Fault Tolerance in Direct Networks
April 2006 (vol. 55 no. 4)
pp. 400-415
Pedro L?pez, IEEE Computer Society
Antonio Robles, IEEE Computer Society
Jose Duato, IEEE
Olav Lysne, IEEE
Massively parallel computing systems are being built with thousands of nodes. The interconnection network plays a key role for the performance of such systems. However, the high number of components significantly increases the probability of failure. Additionally, failures in the interconnection network may isolate a large fraction of the machine. It is therefore critical to provide an efficient fault-tolerant mechanism to keep the system running, even in the presence of faults. This paper presents a new fault-tolerant routing methodology that does not degrade performance in the absence of faults and tolerates a reasonably large number of faults without disabling any healthy node. In order to avoid faults, for some source-destination pairs, packets are first sent to an intermediate node and then from this node to the destination node. Fully adaptive routing is used along both subpaths. The methodology assumes a static fault model and the use of a checkpoint/restart mechanism. However, there are scenarios where the faults cannot be avoided solely by using an intermediate node. Thus, we also provide some extensions to the methodology. Specifically, we propose disabling adaptive routing and/or using misrouting on a per-packet basis. We also propose the use of more than one intermediate node for some paths. The proposed fault-tolerant routing methodology is extensively evaluated in terms of fault tolerance, complexity, and performance.

[1] ASCI Red Web Site, http://www.sandia.gov/ASCIRed/, 2003.
[2] N.J. Boden, D. Cohen, R.E. Felderman, A.E. Kulawik, C.L. Seitz, J. Seizovic, and W. Su, “Myrinet— A Gigabit-per-Second Local Area Network,” IEEE Micro, pp. 29-36, Feb. 1995.
[3] R. Bopana, D. Cohen, R. Felderman, A. Kulawik, C. Seitz, J. Seizovic, and W. Su, “A Comparison of Adaptive Wormhole Routing Algorithms,” Proc. 20th Ann. Int'l Symp. Computer Architecture, pp. 351-360, May 1993.
[4] R.V. Boppana and S. Chalasani, “Fault-Tolerant Wormhole Routing Algorithms for Mesh Networks,” IEEE Trans. Computers, vol. 44, no 7, pp. 848-864, July 1995.
[5] The BlueGene/L Team, “An Overview of the BlueGene/L Supercomputer,” Proc. ACM/IEEE Conf. Supercomputing, pp. 1-22, Nov. 2002.
[6] C. Carrion, R. Beivide, J.A. Gregorio, and F. Vallejo, “A Flow Control Mechanism to Avoid Message Deadlock in K-Ary N-Cube Networks,” Proc. Fourth Int'l Conf. High Performance Computing, pp. 332-329, Dec. 1997.
[7] R. Casado, A. Bermúdez, J. Duato, F.J. Quiles, and J.L. Sánchez, “A Protocol for Deadlock-Free Dynamic Reconfiguration in High-Speed Local Area Networks,” IEEE Trans. Parallel and Distributed Systems, vol. 12, no. 2, pp. 115-132, Feb. 2001.
[8] A.A. Chien and J.H. Kim, “Planar-Adaptive Routing: Low-Cost Adaptive Networks for Multiprocessors,” Proc. 19th Int'l Symp. Computer Architecture, pp. 268-277, May 1992.
[9] S. Chalasani and R.V. Boppana, “Fault-Tolerant Wormhole Routing in Tori,” Proc. Eighth Int'l Conf. Supercomputing, pp. 146-155, July 1994.
[10] S. Chalasani and R.V. Boppana, “Communication in Multicomputers with Nonconvex Faults,” IEEE Trans. Computers, vol. 46, no. 5, pp. 616-622, May 1997.
[11] C.L. Chen and G.M. Chiu, “A Fault-Tolerant Routing Scheme for Meshes with Nonconvex Faults,” IEEE Trans. Parallel and Distributed Systems, vol. 12, no. 5, pp. 467-475, May 2001.
[12] C.M. Cunningham and D.R. Avresky, “Fault-Tolerant Adaptive Routing for Two-Dimensional Meshes,” Proc. First Ann. Int'l Symp. High Performance Computing Architecture, pp. 122-131, Jan. 1995.
[13] W.J. Dally and C.L. Seitz, “Deadlock-Free Message Routing in Multiprocessor Interconnection Networks,” IEEE Trans. Computers, vol. 36, no. 5, pp. 547-553, May 1987.
[14] W.J. Dally, “Virtual-Channel Flow Control,” IEEE Trans. Parallel and Distributed Systems, vol. 3, no. 2, pp. 194-205, 1992.
[15] W.J. Dally and H. Aoki, “Deadlock-Free Adaptive Routing in Multicomputer Networks Using Virtual Channels,” IEEE Trans. Parallel and Distributed Systems, vol. 4, no. 4, pp. 466-475, 1993.
[16] B.V. Dao, J. Duato, and S. Yalamanchili, “Configurable Flow Control Mechanisms for Fault-Tolerant Routing,” Proc. 22nd Int'l Symp. Computer Architecture, pp. 220-229, June 1995.
[17] J. Duato, “A Theory of Fault-Tolerant Routing in Wormhole Networks,” Proc. Int'l Conf. Parallel and Distributed Systems, pp. 600-607, Dec. 1994.
[18] J. Duato, “A Necessary and Sufficient Condition for Deadlock-Free Routing in Cut-Through and Store-and-Forward Networks,” IEEE Trans. Parallel and Distributed Systems, vol. 7, no. 8, pp. 841-854, Aug. 1996.
[19] Earth Simulator Center, http://www.es.jamstec.go.jp/esc/engindex.html , 2006.
[20] A. Gara et al., “Overview of the Blue Gene/L System Architecture,” IBM J. Research & Development, vol. 49, no. 2, pp. 195-212, Mar./May 2005.
[21] C.J. Glass and L.M. Ni, “The Turn Model for Adaptive Routing,” Proc. Int'l Symp. Computer Architecture, pp. 278-287, May 1992.
[22] M.E. Gómez, J. Duato, J. Flich, P. Lopez, and A. Robles, N.A. Nordbotten, O. Lysne, and T. Skeie, “An Efficient Fault-Tolerant Routing Methodology for Meshes and Tori,” Computer Architecture Letters, vol. 3, May 2004.
[23] M.E. Gómez, J. Duato, J. Flich, P. Lopez, A. Robles, N.A. Nordbotten, T. Skeie, and O. Lysne, “A New Adaptive Fault-Tolerant Routing Methodology for Direct Networks,” Proc. Int'l Conf. High Performance Computing, pp. 462-473, Dec. 2004.
[24] M.E. Gómez, J. Flich, P. Lopez, A. Robles, and J. Duato, N.A. Nordbotten, O. Lysne, and T. Skeie, “An Effective Fault-Tolerant Routing Methodology for Direct Networks,” Proc. Int'l Conf. Parallel Processing, pp. 222-231, Aug. 2004.
[25] C.T. Ho and L. Stockmeyer, “A New Approach to Fault-Tolerant Wormhole Routing for Mesh-Connected Parallel Computers,” IEEE Trans. Computers, vol. 53, no. 4, pp. 427-439, Apr. 2004.
[26] Z. Jiang, J. Wu, and D. Wang, “A New Fault Information Model for Fault-Tolerant Adaptive and Minimal Routing in 3-D Meshes,” Proc. Int'l Conf. Parallel Processing, pp. 500-507, June 2005.
[27] InfiniBand™ Trade Assoc., http:/www.infinibandta.com, 2006.
[28] P.T. Gaughana and S. Yalamanchili, “A Family of Fault-Tolerant Routing Protocols for Direct Multiprocessor Networks,” IEEE Trans. Parallel and Distributed Systems, vol. 6, no. 5, pp. 482-497, May 1995.
[29] P. Kermani and L. Kleinrock, “Virtual Cut-Through: A New Computer Communication Switching Technique,” Computer Networks, vol. 3, pp. 267-286, 1979.
[30] T. Lee and J.P. Hayes, “A Fault-Tolerant Communication Scheme for Hypercube Computers,” IEEE Trans. Computers, vol. 41, no. 1, pp. 1242-1256, Oct. 1992.
[31] D.H. Linder and J.C. Harden, “An Adaptive and Fault-Tolerant Wormhole Routing Strategy for k-Ary n-Cubes,” IEEE Trans. Computers, vol. 40, no. 1, pp. 2-12, Jan. 1991.
[32] O. Lysne, T. Pinkston, and J. Duato, “A Methodology for Developing Dynamic Network Reconfiguration Processes,” Proc. Int'l Conf. Parallel Processing, pp. 77-86, Oct. 2003.
[33] O. Lysne, J.M. Montañana, T.M. Pinkston, J. Duato, T. Skeie, and J. Flich, “Simple Deadlock-Free Dynamic Network Reconfiguration,” Proc. Int'l Conf. High Performance Computing, pp. 504-515, Dec. 2004.
[34] N.A. Nordbotten, M.E. Gómez, J. Flich, P. Lopez, A. Robles, T. Skeie, O. Lysne, and J. Duato, “A Fully Adaptive Fault-Tolerant Routing Methodology Based on Intermediate Nodes,” Proc. IFIP Int'l Conf. Network and Parallel Computing, pp. 341-356, Oct. 2004.
[35] F. Petrini, W. Feng, A. Hoisie, S. Coll, and E. Frachtenberg, “The Quadrics Network (QsNet): High-Performance Clustering Technology,” Proc. Ninth IEEE Hot Interconnects, Aug. 2001 (original version), IEEE Micro, pp. 46-57, Jan./Feb. 2002 (extended version).
[36] V. Puente, J.A. Gregorio, J.M. Prellezo, R. Beivide, J. Duato, and C. Izu, “Adaptive Bubble Router: A Design to Balance Latency and Throughput in Networks for Parallel Computers,” Proc. Int'l Conf. Parallel Processing, pp. 58-67, Sept. 1999.
[37] V. Puente, J.A. Gregorio, R. Beivide, and F. Vallejo, “A Low Cost Fault-Tolerant Packet Routing for Parallel Computers,” Proc. Int'l Parallel and Distributed Processing Symp., Apr. 2003.
[38] V. Puente, J.A. Gregorio, F. Vallejo, and R. Beivide, “Immunet: A Cheap and Robust Fault-Tolerant Packet Routing Mechanism,” Proc. Int'l Symp. Computer Architecture, pp. 198-211, June 2004.
[39] J.C. Sancho, A. Robles, and J. Duato, “A Flexible Routing Scheme for Networks of Workstations,” Proc. Int'l Conf. High Performance Computing, pp. 260-267, Oct. 2000.
[40] R. Schwitters, “Requirements of ASCI,” The MITRE Corp., JASON Program Office, 2003.
[41] F. Silla, “Routing and Flow Control in Networks of Workstations,” PhD thesis, Mar. 1999.
[42] Y.J. Suh, B.V. Dao, J. Duato, and S. Yalamanchili, “Software-Based Rerouting for Fault-Tolerant Pipelined Communication,” IEEE Trans. Parallel and Distributed Systems, vol. 11, no. 3, pp. 193-211, 2000.
[43] S.L. Scott and G.M. Thorson, “The Cray T3E Network: Adaptive Routing in a High Performance 3D Torus,” Proc. Hot Interconnects IV, pp. 147-156, Aug. 1996.
[44] T.M. Pinkston, R. Pang, and J. Duato, “Deadlock-Free Dynamic Reconfiguration Schemes for Increased Network Dependability,” IEEE Trans. Parallel and Distributed Systems, vol. 14, no. 8, pp. 780-794, Aug. 2003.
[45] L.G. Valiant, “A Scheme for Fast Parallel Communication,” SIAM J. Computing, vol. 11, no. 2, pp. 350-361, 1982.
[46] J. Wu, “Unicasting in Faulty Hypercubes Using Safety Levels,” Proc. Int'l Conf. Parallel Processing, vol. 3, pp. 132-136, Aug. 1995, also available as Technical Report TR-CSE-95-2, Dept. of Computer Science and Eng., Florida Atlantic Univ.
[47] J. Wu, “A Fault-Tolerant and Deadlock-Free Routing Protocol in 2D Meshes Based on Odd-Even Turn Model,” IEEE Trans. Computers, vol. 52, no. 9, pp. 1154-1169, Sept. 2003.
[48] J.P. Zhou and F.C. M. Lau, “Multi-Phase Minimal Fault-Tolerant Wormhole Routing in Meshes,” Parallel Processing, vol. 30, no. 3, pp. 423-442, 2004.

Index Terms:
Fault tolerance, direct networks, adaptive routing, virtual channels, bubble flow control.
Citation:
Mar?a Engracia G?mez, Nils Agne Nordbotten, Jos? Flich, Pedro L?pez, Antonio Robles, Jose Duato, Tor Skeie, Olav Lysne, "A Routing Methodology for Achieving Fault Tolerance in Direct Networks," IEEE Transactions on Computers, vol. 55, no. 4, pp. 400-415, April 2006, doi:10.1109/TC.2006.46
Usage of this product signifies your acceptance of the Terms of Use.