• Publication
  • 2005
  • Issue No. 5 - May
  • Abstract - Dynamic Reconfiguration in Computer Clusters with Irregular Topologies in the Presence of Multiple Node and Link Failures
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Dynamic Reconfiguration in Computer Clusters with Irregular Topologies in the Presence of Multiple Node and Link Failures
May 2005 (vol. 54 no. 5)
pp. 603-615
Web Extra: View supplemental material
Component failures in high-speed computer networks can result in significant topological changes. In such cases, a network reconfiguration algorithm must be executed to restore the connectivity between the network nodes. Most contemporary networks use either static reconfiguration algorithms or stop the user traffic in order to prevent cyclic dependencies in the routing tables. The goal of this paper is to present NetRec, a dynamic network reconfiguration algorithm for tolerating multiple node and link failures in high-speed networks with arbitrary topology. The algorithm updates the routing tables asynchronously and does not require any global knowledge about the network topology. Certain phases of NetRec are executed in parallel, which reduces the reconfiguration time. The algorithm suspends the application traffic in small regions of the network only while the routing tables are being updated. The message complexity of NetRec is analyzed and the termination, liveness, and safety of the proposed algorithm are proven. Additionally, results from validation of the algorithm in a distributed network-validation testbed Distant, based on the MPI 1.2 features for building arbitrary virtual topologies, are presented.

[1] D. Garcia and W. Watson, “ServerNet II,” Proc. Parallel Computer Routing and Comm. Workshop, pp. 119-136, June 1997.
[2] N. Boden, D. Cohen, R. Felderman, A. Kulawik, C. Seitz, J. Seizovic, and W. Su, “Myrinet— A Gigabit per Second Local Area Network,” IEEE Micro, vol. 5, no. 1, pp. 29-36, Feb. 1995.
[3] M. Schroeder, A. Birrell, M. Burrows, H. Murray, R. Needham, and T. Rodeheffer, “Autonet: A High-Speed, Self-Configuring Local Area Network Using Point-to-Point Links,” IEEE J. Selected Areas in Comm., vol. 9, no. 8, pp. 1318-1335, Oct. 1991.
[4] D. Oppenheimer, A. Brown, J. Beck, D. Hettena, J. Kurode, N. Treuhaft, D.A. Patterson, and K. Yelick, “ROC-1: Hardware Support for Recovery-Oriented Computing,” IEEE Trans. Computers, special issue on fault-tolerant embedded systems, D. Avresky, B.W. Johnson, and F. Lombardi, eds., vol. 51, no. 2, pp. 100-107, Feb. 2002.
[5] R. Horst, “Tnet: A Reliable System Area Network,” IEEE Micro, vol. 15, no. 1, pp. 37-45, Feb. 1995.
[6] W. Baker, R. Horst, D. Sonnier, and W. Watson, “A Flexible ServerNet-Based Fault-Tolerant Architecture,” Proc. 25th Int'l Symp. Fault-Tolerant Computing, pp. 2-11, June 1995.
[7] J. Duato, R. Casado, F. Quiles, and J. Sanchez, “Dynamic Reconfiguration in High Speed Local Area Networks,” Dependable Network Computing, D. Avresky, ed., Kluwer Academic, 2000.
[8] C. Fang and T. Szymanski, “An Analysis of Deflection Routing in Multi-Dimensional Regular Mesh Networks,” Proc. IEEE INFOCOM '91, Apr. 1991.
[9] G.D. Pfiarre, L. Gavano, A. Feliperin, and J.L.C. Sanz, “Fully Adaptive Minimal Deadlock-Free Packet Routing in Hypercubes, Meshes and Other Networks: Algorithms and Simulations,” IEEE Trans. Parallel and Distributed Systems, vol. 5, no. 3, pp. 247-263, Mar. 1994.
[10] P.E. Berman, L. Gravano, G.D. Pfiarre, L. Gavano, and J.L.C. Sanz, “Adaptive Deadlock- and Livelock-Free Routing with All Minimal Paths in Torus Networks,” Proc. Fourth ACM Symp. Parallel Algorithms and Architectures, June 1992.
[11] P.T. Gaughan and S. Yalamanchili, “Adaptive Routing Protocols for Hypercube Interconnection Networks,” Computer, vol. 26, no. 5, pp. 12-23, May 1993.
[12] D. Avresky, J. Acosta, V. Shurbanov, and Z. McAffrey, “Adaptive Minimal-Path Routing in 2-Dimensional Torus ServerNet SAN,” Dependable Network Computing, D. Avresky, ed., Kluwer Academic, 2000.
[13] D. Avresky et al., “Embedding and Reconfiguration of Spanning Trees in Faulty Hypercube,” IEEE Trans. Parallel and Distributed Systems, vol. 10, no. 3, pp. 211-222, Mar. 1999.
[14] D. Avresky and C. Cunningham, “Single Source Fault-Tolerant Broadcasting for Two-Dimensional Meshes without Virtual Channels,” Microprocessors and Microsystems, vol. 21, pp. 175-182, 1997.
[15] D. Avresky, C. Cunningham, and H. Ravichandran, “Fault-Tolerant Routing for Wormhole-Routed Two-Dimensional Meshes,” Int'l J. Computer Systems Science & Eng., vol. 14, no. 6, Nov. 1999.
[16] C. Cunningham and D. Avresky, “Fault-Tolerant Adaptive Routing for Two-Dimensional Meshes,” Proc. IEEE First Int'l Symp. High Performance Computer Architecture, pp. 122-131, Jan. 1995.
[17] W. Qiao and L.M. Ni, “Adaptive Routing in Irregular Networks Using Cut-Through Switches,” Proc. 1996 Int'l Conf. Parallel Processingg, Aug. 1996.
[18] S. Konstantinidou and L. Synder, “The Chaos Router: A Practical Application of Randomization in Network Routing,” Proc. Second Ann. Symp. Parallel Algorithms and Architectures (SPAA 1990), pp. 21-30, 1990.
[19] X. Lin, P.K. McKinley, and L.M. Ni, “The Message Flow Model for Routing in Wormhole-Routed Networks,” Proc. 1993 Int'l Conf. Parallel Processing, Aug. 1993.
[20] W.J. Daly and C.L. Seitz, “Deadlock-Free Message Routing in Multi-Processor Interconnection Networks,” IEEE Trans. Computers, vol. 36, no. 5, pp. 547-553, May 1987.
[21] D.H. Linder and J.C. Harden, “An Adaptive Deadlock and Fault Tolerant Wormhole Routing Strategy for K-Ary N-Cubes,” IEEE Trans. Computers, vol. 40, no. 1, pp. 2-12, Jan 1991.
[22] F. Silla and J. Duato, “On the Use of Virtual Channels in Networks of Workstations with Irregular Topology,” IEEE Trans. Parallel and Distributed Systems, vol. 11, no. 8, pp. 813-828, Aug. 2000.
[23] F. Silla and J. Duato, “High-Performance Routing in Networks of Workstations with Irregular Topology,” IEEE Trans. Parallel and Distributed Systems, vol. 11, no. 7, pp. 699-719, July 2000.
[24] R. Casado, A. Bermudez, J. Duato, F.J. Quiles, and J.L. Sanchez, “A Protocol for Deadlock-Free Dynamic Reconfiguration in High-Speed Local Area Networks,” IEEE Trans. Parallel and Distributed Systems, special issue on dependable network computing, D. Avresky, J. Bruck, and D. Culler, eds., vol. 12, no. 2, pp. 115-132, Feb. 2001.
[25] T.M. Pinkston, R. Pang, and J. Duato, “Deadlock-Free Dynamic Reconfiguration Schemes for Increased Network Dependability,” IEEE Trans. Parallel and Distributed Systems, vol. 14, no. 8, pp. 780-794, Aug. 2003.
[26] D. Avresky, N. Natchev, and V. Shurbanov, “Dynamic Reconfiguration in High-Speed Computer Networks,” Proc. IEEE Symp. Cluster Computing, Oct. 2001.
[27] D. Dolev, R. Friedman, I. Keidar, and D. Malkhi, “Failure detectors in Omission Failure Environments,” Proc. 16th Symp. Principles of Distributed Computing (PODC), 1997.
[28] T. Chandra and S. Toueg, “Unreliable Failure Detectors for Reliable Distributed Systems,” J. ACM, vol. 43, no. 1, pp. 225-267, Mar. 1996.
[29] N. Oh, S. Mitra, and E. McCluskey, “ED$^4$ I: Error Detection by Diverse Data and Duplicated Instructions,” IEEE Trans. Computers, special issue on fault-tolerant embedded systems, D. Avresky, B.W. Johnson, and F. Lombardi, eds., vol. 51, no. 2, pp. 180-199, Feb. 2002.
[30] N. Lynch, Distributed Algorithms. Morgan Kaufmann, 1996.
[31] H. Samet, Design and Analysis of Spatial Data Structures, pp. 2-40. Addison-Wesley, 1990.

Index Terms:
Dynamic reconfiguration, multiple node and link failures, fault tolerance, clusters of workstations, irregular topologies.
Citation:
Dimiter Avresky, Natcho Natchev, "Dynamic Reconfiguration in Computer Clusters with Irregular Topologies in the Presence of Multiple Node and Link Failures," IEEE Transactions on Computers, vol. 54, no. 5, pp. 603-615, May 2005, doi:10.1109/TC.2005.76
Usage of this product signifies your acceptance of the Terms of Use.