This Article 
 Bibliographic References 
 Add to: 
Routing in Modular Fault-Tolerant Multiprocessor Systems
November 1995 (vol. 6 no. 11)
pp. 1206-1220

Abstract—In this paper, we consider a class of modular multiprocessor architectures in which spares are added to each module to cover for faulty nodes within that module, thus forming a fault-tolerant basic block (FTBB). In contrast to reconfiguration techniques that preserve the physical adjacency between active nodes in the system, our goal is to preserve the logical adjacency between active nodes by means of a routing algorithm which delivers messages successfully to their destinations. We introduce two-phase routing strategies that route messages first to their destination FTBB, and then to the destination nodes within the destination FTBB. Such a strategy may be applied to a variety of architectures including binary hypercubes and three-dimensional tori. In the presence of f faults in hypercubes and tori, we show that the worst case length of the message route is min {σ+f, (K+ 1)σ};+c where σ is the shortest path in the absence of faults, K is the number of spare nodes in an FTBB, and c is a small constant. The average routing overhead is much lower than the worst case overhead.

[1] M. Alam and R. Melhem,“Channel multiplexing in modular fault-tolerant multiprocessors,” J. Parallel and Distributed Computing, vol. 24, no. 2, pp. 115-131, 1995.
[2] M. Alam and R. Melhem, An Efficient Modular Spare Allocation Scheme and Its Application to Fault Tolerant Binary Hypercubes IEEE Trans. Parallel and Distributed Systems, vol. 2, no. 1, pp. 117-126, Jan. 1991.
[3] C. Aykanat and F. Ozguner,“A concurrent error detecting conjugategradient algorithm on a hypercube multiprocessor,” Proc. 17th Int’l Symp. Fault-Tolerant Computing, pp. 204-209,Pittsburgh, Penn., July 1987.
[4] K. Belkhale and P. Banerjee,“Reconfiguration strategies for VLSI processor arrays and trees using a modified Diogenes approach,” IEEE Trans. Computers, vol. 41, no. 1, pp. 83-96, 1992.
[5] D. Blight and R. McLeod,“Non-deterministic adaptive routing techniques for WSI processor arrays,” Proc. IEEE Int’l Workshop Defect and Fault Tolerance in VLSI Systems, pp. 177-186, 1992.
[6] D. Blough and N. Bagherzadeh,“Near-optimal message routing and broadcasting in faulty hypercubes,” Int’l J. Parallel Programming, vol. 19, pp. 405-423, 1991.
[7] J. Bruck, R. Cypher, and D. Soroker, "Tolerating Faults in Hypercubes Using Subcube Partitioning," IEEE Trans. Computers, vol. 41, no. 5, pp. 599-605, May 1992.
[8] S.-C. Chau and A.L. Liestman, "A Proposal for a Fault-Tolerant Binary Hypercube Architecture," Proc. 19th Int'l Symp. Fault-Tolerant Computing, pp. 323-330, June 1989.
[9] M.S. Chen and K.G. Shin, "Depth-First Search Approach for Fault-Tolerant Routing in Hypercube Multicomputers," IEEE Trans. Parallel and Distributed Systems, vol. 1, no. 2, pp. 152-159, Apr. 1990.
[10] E. Chow, H. Madan, J. Peterson, D. Grunwald, and D. Reed,“Hyperswitch network for the hypercube computer,”inProc. 15th Annu. Int. Symp. Comput. Architect., May 1988, pp. 90–99.
[11] S. Öhring and S.K. Das, "The folded Petersen cube networks: New competitors for the hypercube," Proc. Fifth IEEE Symp. Parallel and Distributed Processing, pp. 582-589, Dec. 1993.
[12] J.M. Gordon and Q.F. Stout, “Hypercube Message Routing in the Presence of Faults,” Proc. Third Conf. Hypercube Concurrent Computers and Applications, pp. 318-327, Jan. 1988.
[13] J. Hastad, T. Leighton, and M. Newman, "Reconfiguring a Hypercube in the Presence of Faults," ACM Theory of Computing, pp. 274-284, 1987.
[14] M.R. Garey and D.S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness.New York: W.H. Freeman, 1979.
[15] S.L. Johnsson, "Communication Efficient Basic Linear Algebra Computations on Hypercube Architectures," J. Parallel and Distributed Computing, vol. 4, pp. 133-172, 1987.
[16] S.Y. Kung, S.N. Jean, and C.W. Chang, "Fault-Tolerant Array Processors Using Single-Track Switches," IEEE Trans. Computers, vol. 38, no. 4, pp. 501-514, Apr. 1989.
[17] T.C. Lee and J.P. Hayes, "Routing and Broadcasting in Faulty Hypercube Computers," Proc. Third Conf. Hypercube Concurrent Computers and Applications, pp. 625-630, 1988.
[18] D.H. Linder and J.C. Harden, "An Adaptive and Fault Tolerant Wormhole Routing Strategy for k-Ary n-Cubes," IEEE Trans. Computers, vol. 40, no. 1, pp. 2-12, Jan. 1991.
[19] U. Manber, Introduction to Algorithms: A Creative Approach. Addison-Wesley, 1989.
[20] R. Negrini,R. Stefanelli,, and M.G. Sami,“Time redundancy in WSI arrays of processing elements,” Proc. Int’l Conf. Supercomputing Systems, pp. 429-438, 1985.
[21] A. Olson and K.G. Shin,“Message routing in HARTS with faulty components,” Proc. 19th Int’l Symp. Fault-Tolerant Computing Systems, pp. 331-338, 1989.
[22] D. Peleg and B. Simons,“On fault-tolerant routing in general networks,” Proc. Principles of Database Conf., pp. 98-107, 1986.
[23] M.O. Rabin, Efficient Dispersal of Information for Security, Load Balancing and Fault Tolerance J. ACM, vol. 36, no. 2, pp. 335-348, 1989.
[24] D.A. Rennels,“On implementing fault-tolerance in binary hypercubes,” Proc. IEEE Fault Tolerant Computing, pp. 344-349, 1985.
[25] A.D. Singh, Interstitial Redundancy: An Area Efficient Fault Tolerance Scheme for Large Area VLSI Processor Arrays IEEE Trans. Computers, vol. 37, no. 11, pp. 1398-1410, Nov. 1988.
[26] A. Singh,“A reconfigurable modular fault-tolerant binary tree architecture,” Proc. 17th Int’l Symp. Fault-Tolerant Computing, pp. 298-304, June 1987.
[27] L.G. Valiant and G.J. Brebner,"Universal Schemes for Parallel Communication," Proc. 13th Ann. ACM Symp. Theory of Computing, pp. 263-277, May 1981.
[28] L. Valiant,“A scheme for fast parallel communication,” Siam J. on Computing, vol. 11, no. 2, pp. 350-361, 1982.
[29] L. Valiant,“Optimality of a two-phase routing in interconnection networks,” IEEE Trans. Computers, vol. 32, no. 9, pp. 861-863, 1983.
[30] M. Wang, M. Cutler, and S.Y.H. Su, "Reconfiguration of VLSI/WSI Mesh Array Processors with Two-Level Redundancy," IEEE Trans. Computers, vol. 38, no. 4, pp. 547-554, Apr. 1989.
[31] G. Chartrand and R. Wilson,“The Petersen graph,” Graphs and Applications, F. Harary and J. Maybee, eds., 1985.

Index Terms:
Sparing, modular multiprocessors, fault-tolerant routing, hypercube multicomputers, mesh connected processors.
M. Sultan Alam, Rami G. Melhem, "Routing in Modular Fault-Tolerant Multiprocessor Systems," IEEE Transactions on Parallel and Distributed Systems, vol. 6, no. 11, pp. 1206-1220, Nov. 1995, doi:10.1109/71.476192
Usage of this product signifies your acceptance of the Terms of Use.