This Article 
 Bibliographic References 
 Add to: 
Some Practical Issues in the Design of Fault-Tolerant Multiprocessors
May 1992 (vol. 41 no. 5)
pp. 588-598

Methods for modeling and implementing various practical aspects of fault-tolerant multiprocessor systems largely neglected in prior research are examined. The node-covering design approach is generalized to accommodate systems whose structure and failure mechanisms are represented by arbitrary graphs. Several new types of covering graphs are defined, which lead to various useful design tradeoffs. A new technique for incremental design is presented, using a class of switch implementations that reduce a system's interconnection costs. The reduction of other cost factors is also addressed, and methods are presented for VLSI layout area minimization, fast and distributed reconfiguration, efficient transfer of state information for software recovery, and the efficient use of local spares.

[1] P. Banerjee, S. Y. Kuo, and W. K. Fuchs, "Reconfigurable cube-connected cycles architecture," inProc. Sixteenth Fault Tolerant Comput. Symp., June 1986, pp. 286-291.
[2] K. E. Batcher, "Design of a massively parallel processor,"IEEE Trans. Comput., vol. C-29, pp. 836-840, Sept. 1980.
[3] S. N. Bhatt and F. T. Leighton, "A framework for solving graph layout problems,"J. Comput. Syst. Sci., vol. 28, pp. 300-343, 1984.
[4] S-C. Chau and A. L. Liestman, "A proposal for a fault-tolerant binary hypercube," inProc. Nineteenth Fault Tolerant Comput. Symp., Chicago, IL, June 1989, pp. 323-330.
[5] F. R. K. Chung, F. T. Leighton, and A. L. Rosenberg, "Diogenes: A methodology for designing fault-tolerant VLSI processor arrays," inProc. Thirteenth Fault Tolerant Comput. Symp., June 1983, pp. 26-31.
[6] S. Dutt and J. P. Hayes, "Design and reconfiguration strategies for near-optimal fault-tolerant tree architectures," inProc. Eighteenth Fault Tolerant Comput. Symp., June 1988, pp. 328-333.
[7] S. Dutt and J. P. Hayes, "An automorphic approach to the design of fault-tolerant multiprocessors," inProc. Nineteenth Fault Tolerant Comput. Symp., June 1989, Chicago, IL, pp. 496-503.
[8] S. Dutt and J. P. Hayes, "On designing and reconfiguringk-fault-tolerant tree architectures,"IEEE Trans. Comput., vol. C-39, pp. 490-503, Apr. 1990.
[9] S. Dutt, "Designing and reconfiguring fault-tolerant multiprocessor systems," Ph.D. dissertation, Rep. CSE-TR-73-90, Dep. Elec. Eng. Comput. Sci., Univ. of Michigan, Ann Arbor, Aug. 1990.
[10] S. Dutt and J. P. Hayes, "Designing fault-tolerant systems using automorphisms,"J. Parallel Distributed Comput., pp. 249-268, July 1991.
[11] J. P. Hayes, "A graph model for fault tolerant computing systems,"IEEE Trans. Comput., vol. C-25, pp. 875-883, Sept. 1976.
[12] J. P. Hayeset al., "A microprocessor-based hypercube supercomputer,"IEEE Micro, vol. 6, pp. 6-17, Oct. 1986.
[13] D. Hillis,The Connection Machine. Cambridge, MA: M.I.T. Press, 1985.
[14] J. G. Kuhl and S. M. Reddy, "Distributed fault-tolerance for large multiprocessor system," inProc. 1980 Comput. Architecture Conf., France, May 1980.
[15] F. P. Preparata, G. Metze, and R. T. Chen, "On the connection assignment problem of diagnosable systems,"IEEE Trans. Electron. Comput., vol. EC-16, pp. 848-854, Dec. 1967.
[16] F. P. Preparata and J. Vuillemin, "The cube-connected cycle: A versatile network for parallel computation,"Commun. ACM, vol. 24, pp. 300-309, May 1981.
[17] C. S. Raghavendra, A. Avizienis, and M. D. Ercegovac, "Fault tolerance in binary tree architectures,"IEEE Trans. Comput., vol. C-33, pp. 568-572, June 1984.
[18] D. A. Rennels, "On implementing fault-tolerance in binary hypercubes," inProc. Sixteenth Fault Tolerant Comput. Symp., June 1986, pp. 344-349.
[19] D. P. Siewiorek and R. S. Swarz,The Theory and Practice of Reliable System Design, Bedford, MA: Digital, 1982.
[20] A. D. Singh, "Interstitial redundancy: An area efficient fault tolerance scheme for large area VLSI processor arrays,"IEEE Trans. Comput., pp. 1398-1410, Nov. 1988.

Index Terms:
fault-tolerant multiprocessors; node-covering design; covering graphs; incremental design; VLSI layout area minimization; distributed reconfiguration; state information; software recovery; local spares; circuit layout CAD; computational complexity; fault tolerant computing; graph theory; multiprocessing systems; parallel algorithms; VLSI.
S. Dutt, J.P. Hayes, "Some Practical Issues in the Design of Fault-Tolerant Multiprocessors," IEEE Transactions on Computers, vol. 41, no. 5, pp. 588-598, May 1992, doi:10.1109/12.142685
Usage of this product signifies your acceptance of the Terms of Use.