This Article 
 Bibliographic References 
 Add to: 
Immunet: Dependable Routing for Interconnection Networks with Arbitrary Topology
December 2008 (vol. 57 no. 12)
pp. 1676-1689
Valentin Puente, Univeristy of Cantabria, Santander
José Angel Gregorio, Univeristy of Cantabria, Santander
Fernando Vallejo, Univeristy of Cantabria, Santander
Ramoón Beivide, Univeristy of Cantabria, Santander
A complete mechanism for tolerating multiple failures in parallel computer systems, denoted as Immunet, is described in this paper. Immunet can be applied to arbitrary topologies, either regular or irregular, exhibiting in both cases graceful performance degradation. Provided that the network remains connected, Immunet is able to deal with any number of failures regardless of their spatial and temporal distribution. Our mechanism operates on the basis of a dynamic network reconfiguration in response to failures. The network reconfiguration only employs local information recorded at the router nodes which leads to a highly scalable system. In addition, its low cost and overhead permit a practicable hardware implementation. Finaly, Immunet could allow circumvent failures transparently to applications running on a parallel system because it does not require dropping in-flight traffic. Only packets stored in or traveling through a broken component should be recovered by higher system levels.

[1] N.R. Adiga, G.S. Almasi, Y. Aridor, M. Bae, R. Barik et al., “An Overview of the BlueGene/L Supercomputer,” Supercomputing, 2002.
[2] M. Bae, and B. Bose, “Spare Processor Allocation for Fault Tolerance in Torus-Based Multicomputers,” Proc. 26th Int'l Symp. Fault-Tolerant Computing (FTCS '96), pp. 282-291, 1996.
[3] R.V. Boppana and S. Chalasani, “Fault-Tolerant Wormhole Routing Algorithms for Mesh Networks,” IEEE Trans. Computers, vol. 44, no. 7, pp. 848-864, July 1995.
[4] J. Bruck, R. Cypher, and C. Ho, “Fault-Tolerant Meshes with Small Degree,” SIAM J. Computing, vol. 26, no. 6, pp. 1764-1784, Dec. 1997.
[5] C. Carrión, R. Beivide, J.A. Gregorio, and F. Vallejo, “A Flow Control Mechanism to Prevent Message Deadlock in $k\hbox{-}{\rm ary} \;n\hbox{-}{\rm Cube}$ Networks,” Proc. Fourth Int'l Conf. High Performance Computing (HiPC '97), Dec. 1997.
[6] S. Chalasani and R.V. Boppana, “Communication in Multicomputers with Nonconvex Faults,” IEEE Trans. Computers, vol. 46, no. 5, pp. 616-622, May 1997.
[7] C.L. Chen and G.M. Chiu, “A Fault-Tolerant Routing Scheme for Meshes with Nonconvex Faults,” IEEE Trans. Parallel and Distributed Systems, vol. 12, no. 5, pp. 467-475, May 2001.
[8] P.J. Chuang and L.C. Yao, “An Efficient Reconfiguration Scheme for Fault-Tolerant Meshes,” Informatics and Computer Science, vol. 172, nos. 3-4, June 2005.
[9] Cray XT3 Datasheet, XT3Datasheet.pdf , 2008.
[10] J. Duato, “A Necessary and Sufficient Condition for Deadlock-Free Routing in Cut-Through and Store-and-Forward Networks,” IEEE Trans. Parallel and Distributed Systems, vol. 7, no. 8, pp. 841-854, Aug. 1996.
[11] M. Galles, “Spider: A High-Speed Network Interconnect,” IEEE Micro, vol. 17, no. 1, pp. 34-39, Jan./Feb. 1997.
[12] A. Gara, M.A. Blumrich, D. Chen, G.L.-T. Chiu, P. Coteus, M.E. Giampapa, R.A. Haring, P. Heidelberger, D. Hoenicke, G.V. Kopcsay, T.A. Liebsch, M. Ohmacht, B.D. Steinmacher-Burow, T. Takken, and P. Vranas, “Overview of the Blue Gene/L System Architecture,” IBM J. Research and Development, vol. 49, nos. 2-3, pp. 195-212, Mar.-May, 2005.
[13] M.E. Gómez, N.A. Nordbotten, J. Flich, P. López, A. Robles, J. Duato, T. Skeie, and O. Lysne, “A Routing Methodology for Achieving Fault Tolerance in Direct Networks,” IEEE Trans. Computers, vol. 55, no. 4, pp. 400-415, Apr. 2006.
[14] C. Hedrick, Routing Information Protocol, Internet RFC 1058, June 1988.
[15] C. Ho and L.J. Stockmeyer, “A New Approach to Fault-Tolerant Wormhole Routing for Mesh-Connected Parallel Computers,” Proc. 16th Int'l Parallel and Distributed Processing Symp. (IPDPS), 2002.
[16] I. Theiss and O. Lysne, “FRoots: A Fault Tolerant and Topology-Flexible Routing Technique,” IEEE Trans. Parallel and Distributed Systems, vol. 17, no. 10, pp. 1136-1150, Oct. 2006.
[17] InfiniBand Architecture Specification. InfiniBand Trade Assoc., 2006.
[18] P. Kermani and L. Kleinrock, “Virtual Cut-Through: A New Computer Communication Switching Technique,” Computer Networks, vol. 3, pp. 267-286, Sept. 1979.
[19] G.S. Malkin and M.E. Steenstrup, “Distance-Vector Routing,” Routing in Comm. Networks, M.E. Steenstrup, ed., pp. 83-98, Prentice Hall, 1995.
[20] Guide to Myrinet-2000 Switches and Switch Networks, Myrinet,, 2003.
[21] S. Mukherjee, P. Bannon, S. Lang, A. Spink, and D. Webb, “The Alpha 21364 Network Architecture,” IEEE Micro, vol. 22, no. 1, pp.26-35, Jan./Feb. 2002.
[22] V.S. Pai, P. Ranganathan, and S.V. Adve, “RSIM: An Execution-Driven Simulator for ILP-Based Shared-Memory Multiprocessors and Uniprocessors,” IEEE Technical Committee on Computer Architecture Newsletter, vol. 35, no. 11, pp.37-48, Oct. 1997.
[23] R. Pang, T. Pinkston, and J. Duato, “The Double Scheme: Deadlock-Free Reconfiguration of Cut-Through Networks,” Proc. Int'l Conf. Parallel Processing (ICPP '00), Aug. 2000.
[24] G.D. Pifarré, L. Gravano, S.A. Felperin, and J.L.C. Sanz, “Fully-Adaptive Minimal Deadlock-Free Packet Routing in Hypercubes, Meshes, and Other Networks,” Proc. ACM Proc. Third Symp. Parallel Algorithms and Architectures (SPAA '91), pp.278-290, July 1991.
[25] V. Puente, C. Izu, R. Beivide, J.A. Gregorio, F. Vallejo, and J.M. Prellezo, “The Adaptive Bubble Router,” J. Parallel and Distributed Computing, vol. 61, no. 9, Sept. 2001.
[26] V. Puente, J.A. Gregorio, J.M. Prellezo, R. Beivide, J. Duato, and C. Izu, “Adaptive Bubble Router: A Design to Improve Performance in Torus Networks,” Proc. Int'l Conf. Parallel Processing (ICPP '99), Sept. 1999.
[27] V. Puente, J.A. Gregorio, R. Beivide, F. Vallejo, and A. Ibañez, “A New Routing Mechanism for Networks with Irregular Topology,” Proc. ACM/IEEE Conf. Supercomputing (SC '01), Nov. 2001.
[28] V. Puente, J.A. Gregorio, and R. Beivide, “SICOSYS: An Integrated Framework for Studying Interconnection Network in Multiprocessor Systems,” Proc. 10th IEEE Euromicro Workshop Parallel, Distributed and Network-Based Processing (EUROMICRO-PDP '02), Jan. 2002.
[29] V. Puente, J.A. Gregorio, R. Beivide, and F. Vallejo, “A Low Cost Fault Tolerant Packet Routing for Parallel Computers,” Proc. 17th Int'l Parallel and Distributed Processing Symp. (IPDPS '03), Apr. 2003.
[30] V. Puente, J.A. Gregorio, F. Vallejo, and R. Beivide, “Immunet: A Cheap and Robust Fault-Tolerant Packet Routing Mechanism,” Proc. 31st Int'l Symp. Computer Architecture (ISCA '04), June 2004.
[31] V. Puente and J.A. Gregorio, “Immucube: Scalable Fault-Tolerant Routing for $k\hbox{-}{\rm ary} \;n\hbox{-}{\rm Cube}$ Networks,” IEEE Trans. Parallel and Distributed Systems, vol. 18, no. 6, pp. 776-788, June 2007.
[32] T.L. Rodeheffer and M.D. Schroeder, “Automatic Reconfiguration in Autonet,” ACM SIGOPS Operating Systems Rev., vol. 25, no. 5, pp. 183-197, Oct. 1991.
[33] M.D. Schroeder, A.D. Birrell, M. Burrows, H. Murray, R.M. Needham, T.L. Rodeheffer, and E.H. Satterthwaite, “Autonet: A High-Speed, Self-Configuring Local Area Network Using Point-to-Point Links,” IEEE J. Selected Areas in Comm., vol. 9, no. 8, Oct. 1991.
[34] J. Shih, “A Fault-Tolerant Wormhole Routing Scheme for Torus Networks with Nonconvex Faults,” Information Processing Letters, vol. 88, no. 6, pp. 271-278, Dec. 2003.
[35] D.J. Sorin, M.M.K. Martin, M.D. Hill, and D.A. Wood, “Safety Net: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery,” Proc. 29th Int'l Symp. Computer Architecture (ISCA '02), May 2002.
[36] C. Su and K.G. Shin, “Adaptive Fault-Tolerant Deadlock-Free Routing in Meshes and Hypercubes,” IEEE Trans. Computers, vol. 45, no. 6, pp. 666-683, June 1996.

Index Terms:
Parallel Architectures, Interconnection architectures, Support for reliability
Valentin Puente, José Angel Gregorio, Fernando Vallejo, Ramoón Beivide, "Immunet: Dependable Routing for Interconnection Networks with Arbitrary Topology," IEEE Transactions on Computers, vol. 57, no. 12, pp. 1676-1689, Dec. 2008, doi:10.1109/TC.2008.95
Usage of this product signifies your acceptance of the Terms of Use.