This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Fault-Tolerant Adaptive and Minimal Routing in Mesh-Connected Multicomputers Using Extended Safety Levels
February 2000 (vol. 11 no. 2)
pp. 149-159

Abstract—The minimal routing problem in mesh-connected multicomputers with faulty blocks is studied, Two-dimensional meshes are used to illustrate the approach. A sufficient condition for minimal routing in 2D meshes with faulty blocks is proposed. Unlike many traditional models that assume all the nodes know global fault distribution, our approach is based on the concept of an extended safety level, which is a special form of limited fault information. The extended safety level information is captured by a vector associated with each node. When the safety level of a node reaches a certain level (or meets certain conditions), a minimal path exists from this node to any nonfaulty nodes in 2D meshes. Specifically, we study the existence of minimal paths at a given source node, limited distribution of fault information, and minimal routing itself. We propose three fault-tolerant minimal routing algorithms which are adaptive to allow all messages to use any minimal path. We also provide some general ideas to extend our approaches to other low-dimensional mesh-connected multicomputers such as 2D tori and 3D meshes. Our approach is the first attempt to address adaptive and minimal routing in 2D meshes with faulty blocks using limited fault information.

[1] R. Boppana and S. Chalasani, "Fault-Tolerant Wormhole Routing Algorithms for Mesh Networks," IEEE Trans. Computers, vol. 44, no. 7, pp. 848-864, July 1995.
[2] Y.M. Boura and C.R. Das, “Fault-Tolerant Routing in Mesh Networks,” Proc. 1995 Int'l Conf. Parallel Processing, pp. I 106-I 109, 1995.
[3] X. Chen and J. Wu, “Minimal Routing in 3-D Meshes Using Extended Safety Levels,” TR-CSE-97-14, Dept. of Computer Science and Eng., Florida Atlantic Univ., Feb. 1997.
[4] A.A. Chien and J.H. Kim, "Planar-Adaptive Routing: Low-Cost Adaptive Networks for Multiprocessors," Proc. 19th Int'l Symp. Computer Architecture, vol. 20, no. 2, pp. 268-277, May 1992.
[5] G.M. Chiu and S.P. Wu, "A Fault-Tolerant Routing Strategy in Hypbercube Multicomputers," IEEE Trans. Computers, vol. 45, no. 2, pp. 143-156, Feb. 1996.
[6] C.M. Cunningham and D.R. Avresky, “Fault-Tolerant Adaptive Routing for Two-Dimensional Meshes,” Proc. First IEEE Symp. High Performance Computer Architecture, pp. 122-131, 1995.
[7] W.J. Dally, “The J-Machine: System Support for Actors,” Actors: Knowledge-Based Concurrent Computing, Hewitt and Agha, eds. MIT Press, 1989.
[8] J. Duato, S. Yalamanchili, and L.M. Ni, Interconnection Networks: An Engineering Approach. Los Alamitos, Calif.: IEEE CS Press, 1997.
[9] P.T. Gaughan, B.V. Dao, S. Yalamanchili, and D.E. Schimmel, "Distributed Deadlock-Free Routing in Faulty Pipelined k-Ary n-Cubes," IEEE Trans. Computers, vol. 45, no. 6, pp. 651-665, June 1996.
[10] G.J. Glass and L.M. Ni, Fault-Tolerant Wormhole Routing in Meshes without Virtual Channels IEEE Trans. Parallel and Distributed Systems, vol. 7, no. 6, pp. 620-636, June 1996.
[11] J. Jubin and J.D. Tornow, “The DARPA Packet Radio Network Protocols,” Proc. IEEE, vol. 75, no. 1, pp. 21-32, Jan. 1987
[12] R.K. Koeninger, M. Furtney, and M. Walker, “A Shared Memory MPP from Cray Research,” Digital Technical J., vol. 6, no. 2,pp. 8-21, Spring 1994.
[13] T.C. Lee and J.P. Hayes,“A fault-tolerant communication scheme for hypercube computers,” IEEE Trans. Computers, vol. 41, no. 10, pp. 1,242-1,256, Oct. 1992.
[14] A.C. Liang, S. Bhattacharya, W.T. Tsai, "Fault-Tolerant Multicast on Hypercube," J. Parallel and Distributed Computing, Vol. 23, No. 12, Dec. 1994, pp. 418-428.
[15] R. Libeskind-Hadas and E. Brandt, “Origin-Based Fault-Tolerant Routing in the Mesh,” Proc. First Int'l Symp. High Performance Computer Architecture, pp. 102-111, 1995.
[16] S.L. Lillevik,“The Touchstone 30 Gigaflop DELTA prototype,” Sixth Distributed Memory Computing Conf., pp. 671-677, 1991.
[17] J.M. McQuillan, I. Richer, and E.C. Rosen, "The New Routing Algorithm for the Arpanet," IEEE Trans. Comm., vol. 28, no. 5, pp. 711-719, May 1980.
[18] D.K. Panda, “Issues in Designing Efficient and Practical Algorithms for Collective Communication on Wormhole-Routed Systems,” Proc. 1995 ICPP Workshop Challenges for Parallel Processing, pp. 8-15, Aug. 1995.
[19] C.L. Seitz et al., "The Architecture and Programming of the Ametak Series 2010," Proc. Third Conf. Hypercube Concurrent Computers and Applications, pp. 33-37, Jan. 1988.
[20] C. Su and K. G. Shin, “Adaptive Fault Tolerant Deadlock-Free Routing in Meshes and Hypercubes,” IEEE Trans. Computers, vol. 45, no. 6, pp. 666–683, June 1996.
[21] Y.-J. Suh, B.V. Dao, J. Duato, and S. Yalamanchili, “Software Based Fault-Tolerant Oblivious Routing in Pipelined Networks,” Proc. 1995 Int'l Conf. Parallel Processing, pp. I 101-I 105, 1995.
[22] J. Wu, “Adaptive Fault-Tolerant Routing in Cube-Based Multicomputers Using Safety Vectors,” IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 4, pp. 321-334, Apr. 1998.
[23] J. Wu, “Fault-Tolerant Adaptive and Minimal Routing in Mesh-Connected Multicomputers Using Extended Safety Levels,” Proc. of the 18th International Conf. on Distributed Computing Systems, pp. 428-435, May 1998.
[24] J. Wu, “On Constructing Faulty Orthogonal Convex Polygons in 2-D Meshes,” TR-CSE-98-5, Dept. of Computer Science and Eng., Florida Atlantic Univ., Jan. 1998.
[25] J. Wu, "Unicasting in Faulty Hypercubes Using Safety Levels," IEEE Trans. Computers, vol. 46, no. 2, pp. 241-247, Feb. 1997.
[26] J. Wu and E.B. Fernandez, "Reliable Broadcasting in Faulty Hypercube Computers," Microprocessing and Microprogramming, vol. 39, pp. 43-53, 1993.

Index Terms:
Fault tolerance, mesh-connected multicomputers, minimal routing.
Citation:
Jie Wu, "Fault-Tolerant Adaptive and Minimal Routing in Mesh-Connected Multicomputers Using Extended Safety Levels," IEEE Transactions on Parallel and Distributed Systems, vol. 11, no. 2, pp. 149-159, Feb. 2000, doi:10.1109/71.841751
Usage of this product signifies your acceptance of the Terms of Use.