This Article 
 Bibliographic References 
 Add to: 
Systematic Design of Fault-Tolerant Multiprocessors with Shared Buses
April 1997 (vol. 46 no. 4)
pp. 439-455

Abstract—A multiprocessor system is fault-tolerant (FT) if it preserves a fault-free subsystem of a predetermined interconnection structure when faults appear. We present a new method for designing FT multiprocessors that can efficiently tolerate both processor and interconnection faults. The approach is general, in that it can be applied to any multiprocessor topology. Shared buses serve as the main interconnection mechanism to minimize the switching logic needed for reconfiguration. We employ processor-bus-link (PBL) graphs to model multiprocessors with either dedicated or shared buses. Both processors and buses are represented as nodes so that bus faults can be considered explicitly and tolerated efficiently by spare buses instead of by spare processors. A minimum number of spare processors and buses are used to reduce hardware overhead. The node covering concept and the maximum-weight spanning tree algorithm are then employed to construct FT systems that have lower interconnection cost than most previous designs. We also present a cost-effective implementation method which is suitable for both static and dynamic reconfiguration techniques. The FT systems obtained have the advantages of no critical single point of failure, low redundancy, local replacement, and simple circuitry for fast reconfiguration.

[1] J. Bruck, R. Cypher, and C.-T. Ho, "Fault-Tolerant Meshes and Hypercubes with Minimal Numbers of Spares," IEEE Trans. Computers, vol. 42, no. 9, pp. 1,089-1,104, Sept. 1993.
[2] S.-C. Chau and A.L. Liestman, "A Proposal for a Fault-Tolerant Binary Hypercube Architecture," Proc. 19th Int'l Symp. Fault-Tolerant Computing, pp. 323-330, June 1989.
[3] M. Chean and J.A.B. Fortes, A Taxonomy of Reconfiguration Techniques for Fault-Tolerant Processor Arrays Computer, pp. 55-69, Jan. 1990.
[4] T.H. Cormen,C.E. Leiserson, and R.L. Rivest,Introduction to Algorithms.Cambridge, Mass.: MIT Press/McGraw-Hill, 1990.
[5] S. Dutt and J.P. Hayes, “Designing Fault-Tolerant Systems Using Auto-morphisms,” J. Parallel and Distributed Computing, vol. 12, no. 3, pp. 249–268, 1991.
[6] S. Dutt and J.P. Hayes, “Some Practical Issues in the Design of Fault-Tolerant Multiprocessors,” IEEE Trans. Computers, vol. 41, no. 5, pp. 588–598, May 1992.
[7] B. Duzett and R. Buck, "An Overview of the nCUBE3 Supercomputer," Proc. Fourth Symp. Frontiers of Massively Parallel Computation, pp. 458-464, 1992.
[8] L. Glasser and D. Dobberpuhl, The Design and Analysis of VLSI Circuits, Addison-Wesley, Reading, Mass., 1985.
[9] J.R. Goodman and P.J. Woest, “The Wisconsin Multicube: A New Large-Scale Cache-Coherent Multiprocessor,” Proc. 15th Ann. Int'l Symp. Computer Architecture, pp. 422-431, 1988.
[10] J. Gray, "A Census of Tandem System Availability Between 1985 and 1990," IEEE Trans. Reliability, vol. 39, no. 4, pp. 409-418, Oct. 1990.
[11] F. Harary and J.P. Hayes, "Edge Fault Tolerance in Graphs," Networks, vol. 23, pp. 135-142, 1993.
[12] J.P. Hayes, "A Graph Model for Fault Tolerant Computing Systems," IEEE Trans. Computers, vol. 25, pp. 875-883, 1976.
[13] W.D. Hillis and L.W. Tucker, “The CM-5 Connection Machine: A Scalable Supercomputer,” Comm. ACM, vol. 36, pp. 31–40, Nov. 1993.
[14] R.M. Hord, Parallel Supercomputing in MIMD Architectures.Boca Raton, Fla.: CRC Press, 1993.
[15] K. Hwang, Advanced Computer Architecture: Parallelism, Scalability, Programmability. McGraw-Hill, 1993.
[16] H.-K. Ku and J.P. Hayes, "Optimally Edge Fault-Tolerant Trees," Networks, vol. 27, pp. 203-214, 1996.
[17] H.-K. Ku and J.P. Hayes, "Connectivity and Fault Tolerance of Multiple-Bus Systems," Proc. 24th Int'l Symp. Fault-Tolerant Computing, pp. 372-381, 1994.
[18] C.E. Leiserson, "Fat-Trees: Universal Networks for Hardware Efficient Supercomputing," IEEE Trans. Computers, vol. C-34, no. 10, Oct. 1985, pp. 892-901.
[19] M.B. Lowrie and W.K. Fuchs, "Reconfigurable Tree Architectures Using Subtree Oriented Fault Tolerance," IEEE Trans. Computers, vol. 36, pp. 1,172-1,182, 1987.
[20] Fault-Tolerance through Reconfiguration of VLSI and WSI Arrays, R. Negrini, M.G. Sami, and R. Stefanellis, eds. Cambridge, Mass.: The MIT Press, 1989.
[21] R. Rinn and R. Wagner, "Fault Tolerance in Massively Parallel Computing," Digest of Papers, COMPCON Spring 93, pp. 253-257, 1993.
[22] A.L. Rosenberg, "Graph-Theoretic Approaches to Fault-Tolerant WSI Processor Arrays," Wafer Scale Integration, W. Moore and C. Jesshope, eds., pp. 10-23, 1985.
[23] A.L. Rosenberg, "A Hypergraph Model for Fault-Tolerant VLSI Processor Arrays," IEEE Trans. Computers, vol. 34, pp. 578-584, 1985.
[24] D. Siewiorek and R. Swarz, Reliable Computer Systems: Design and Evaluation. Digital Press, 1992.
[25] N. Weste and K. Eshraghian, Principles of CMOS VLSI Design, Addison-Wesley, 1994.
[26] L.D. Wittie, "Communication Structures for Large Networks of Microcomputers," IEEE Trans. Computers, vol. 30, pp. 264-273, 1981.

Index Terms:
Fault tolerance, graph model, interconnection method, multiple-bus architecture, point-to-point connection, VLSI design.
Hung-Kuei Ku, John P. Hayes, "Systematic Design of Fault-Tolerant Multiprocessors with Shared Buses," IEEE Transactions on Computers, vol. 46, no. 4, pp. 439-455, April 1997, doi:10.1109/12.588058
Usage of this product signifies your acceptance of the Terms of Use.