This Article 
 Bibliographic References 
 Add to: 
Fault-Tolerant Design Strategies for High Reliability and Safety
October 1993 (vol. 42 no. 10)
pp. 1195-1206

Several fundamental results related to reliability and safety are analyzed. Modular redundant systems consisting of multiple identical modules and an arbiter are considered. It is shown that for a given level of redundancy, a large number of implementation alternatives exist with varying degree of reliability and safety. Strategies are formulated that achieve a maximal combination of reliability and safety. The effect of increasing the number of modules on system reliability and safety is analyzed. It is shown that when one considers safety in addition to reliability, it does not necessarily help to simply add modules to the system. Specifically, increasing the number of modules by just one does not always improve both reliability and safety. To improve reliability and safety simultaneously, at least two additional modules are required when the outputs of the individual modules do not have any redundant information (e.g., coding for error detection). However, it is shown that if the modules themselves have built-in error detection capability, addition of just one module may be sufficient to improve both reliability and safety.

[1] D. M. Blough and G. F. Sullivan, "A comparison of voting strategies for fault-tolerant distributed systems," inProc. 9th Symp. Reliab. Distr. Syst., Oct. 1990, pp. 136-145.
[2] B. Courtois, "On balancing safety and reliability of hybrid and biduplexed systems," inDig. 6th Symp. Fault-Tolerant Comput., 1976, pp. 52-57.
[3] M. H. DeGroot,Probability and Statistics. Reading, MA: Addison-Wesley, 1975.
[4] Y. Deswarte, "A high safety multi-processor architecture," inDig. 6th Int. Symp. Fault-Tolerant Comput., 1976, pp. 171-175.
[5] W. K. Fuchs, K.-L. Wu, and J. A. Abraham, "Low-cost comparison and diagnosis of large remotely located files," in5th Symp Reliab. Distr. Soft. Database Syst., Jan. 1986, pp. 67-73.
[6] R. J. Hill and D. N. Weedon, "Safety&reliability of synchronizable digital coding in railway track-circuits,"IEEE Trans. Reliab., vol. 39, pp. 581-591, Dec. 1990.
[7] K. Iwasaki and F. Arakawa, "An analysis of the aliasing probability of multiple input signature registers in the case of a 2m-ary symmetric channel,"IEEE Trans. Comput.-Aided Design, vol. 9, no. 4, pp. 427-438, Apr. 1990.
[8] B. W. Johnson and J. H. Aylor, "Reliability&safety analysis of a fault-tolerant controller,"IEEE Trans. Reliab., vol. 35, pp. 355-362, Oct. 1986.
[9] B.W. Johnson,Design and Analysis of Fault Tolerant Digital Systems, Addison-Wesley, Reading, Mass., 1989.
[10] M. G. Karpovsky, S. K. Gupta, and D. K. Pradhan, "Aliasing and diagnosis probability in MISR and STUMPS using a general error model," inProc. Int. Test Conf., pp. 828-839, Oct. 1991.
[11] S. Lin and D. J. Costello, Jr.,Error Control Coding: Fundamentals and Applications. Englewood Cliffs, NJ: Prentice-Hall, 1983.
[12] M. Mulazzani, "Reliability and safety," inProc. 4th IFAC Workshop Safety Comput. Contr. Syst. (SAFECOMP'85), 1985, pp. 141-146.
[13] D. K. Pradhan and N. H. Vaidya, "Roll-forward checkpointing scheme: Concurrent retry with nondedicated spares," inProc. IEEE Workshop Fault Tolerant Parallel Distrib. Syst., July 1992, pp. 166-174.
[14] R. E. Prather,Discrete Mathematical Structures for Computer Science. Boston, MA: Houghton Mifflin, 1976.
[15] D. P. Siewíorek and R. S. Swarz,The Theory and Practice of Reliable System Design. Bedford, MA: Digital Press, 1982.
[16] J. J. Stiffler, "Computer-aided reliability estimation," inFault-Tolerant Computing: Theory and Techniques, D. K. Pradhan, Ed. Englewood Cliffs, NJ: Prentice-Hall, 1986.
[17] N. H. Vaidya and D. K. Pradhan, "Voting in fault-tolerant systems: Reliability and safety issues," Tech. Rep. TR-91-CSE-7, ECE Dept., Univ. of Massachusetts, Amherst, June 1991.
[18] N. H. Vaidya, "Low-cost schemes for fault tolerance," Ph.D. dissertation, Univ. Mass., Amherst, Feb. 1993.
[19] N. H. Vaidya and D. K. Pradhan, "A fault tolerance scheme for a system of duplicated communicating processes," inProc. IEEE Workshop Fault Tolerant Parallel Distrib. Syst., July 1992, pp. 98-104.

Index Terms:
fault-tolerant design strategies; high reliability; safety; modular redundant systems; multiple identical modules; arbiter; built-in error detection; computer interfaces; error detection; fault tolerant computing; redundancy.
N.F. Vaidya, D.K. Pradhan, "Fault-Tolerant Design Strategies for High Reliability and Safety," IEEE Transactions on Computers, vol. 42, no. 10, pp. 1195-1206, Oct. 1993, doi:10.1109/12.257706
Usage of this product signifies your acceptance of the Terms of Use.