This Article 
 Bibliographic References 
 Add to: 
Design and Analysis of an Optimal Instruction-Retry Policy for TMR Controller Computers
November 1996 (vol. 45 no. 11)
pp. 1217-1225

Abstract—An instruction-retry policy is proposed to enhance the fault-tolerance of triple modular redundant (TMR) controller computers by adding time redundancy to them. A TMR failure is said to occur if a TMR system fails to establish a majority among its modules' outputs due to multiple faulty modules or a faulty voter. Either multiple consecutive TMR failures the active period of which exceeds a certain time limit or the exhaustion of spares as a result of frequent system reconfigurations may result in failure to meet the timing constraints of one or more tasks, called the dynamic failure, during a given mission. An optimal instruction-retry period is derived by minimizing the probability of dynamic failure upon detection of either a masked (by the TMR) error or a TMR failure. We also derive the minimum number of spares needed to keep below the pre-specified level the probability of dynamic failure for a given mission by using the derived optimal retry period.

[1] C.M. Belcastro, "Laboratory Test Methodology for Evaluating the Effects of Electromagnetic Disturbances on Fault-Tolerant Control Systems," NASA TM-101665, Nov. 1989.
[2] M. Berg and I. Koren, "On Switching Policies for Modular Redundancy Fault-Tolerant Computing Systems," IEEE Trans. Computers, vol. 36, no. 9, pp. 1,052-1,062, Sept. 1987.
[3] P.K. Chande, A.K. Ramani, and P.C. Sharma, "Modular TMR Multiprocessor System," IEEE Trans. Industrial Electronics, vol. 36, no. 1, pp. 34-41, Feb. 1989.
[4] N. Gaitanis, "The Design of Totally Self-Checking TMR Fault-Tolerant Systems," IEEE Trans. Computers, vol. 37, no. 11, pp. 1,450-1,454, Nov. 1988.
[5] A.L. Hopkins Jr., T.B. Smith III, and J.H. Lala, "FFTMP—A Highly Reliable Fault-Tolerant Multiprocessor for Aircraft," Proc. IEEE, vol. 66, no. 10, pp. 1,221-1,239, Oct. 1978.
[6] M. Kameyama and T. Higuchi, "Design of Dependent-Failure-Tolerant Microcomputer System Using Triple-Modular Redundancy," IEEE Trans. Computers, vol. 29, no. 2, pp. 202-205, Feb. 1980.
[7] H. Kim and K.G. Shin, "On Reconfiguration Latency in Fault-Tolerant Systems," Proc. IEEE 1995 Aerospace Applications Conf., pp. 287-301, Snowmass at Aspen, Colo., Feb. 1995.
[8] I. Koren and Z. Koren, "Analysis of a Class of Recovery Procedures," IEEE Trans. Computers, vol. 35, no. 8, pp. 703-712, Aug. 1986.
[9] Y.H. Lee and K.G. Shin, "Optimal Design and Use of Retry in Fault-Tolerant Computing Systems," J. ACM, vol. 35, pp. 45-69, Jan. 1988.
[10] T.-H Lin and K.G. Shin, "An Optimal Retry Policy Based on Fault Classification," IEEE Trans. Computers, vol. 43, no. 9, pp. 1,014-1,025, Sept. 1994.
[11] S.R. McConnel, D.P. Siewiorek, and M.M. Tsao, "The Measurement and Analysis of Transient Errors in Digital Computer Systems," Digest of Papers, FTCS-9, pp. 67-70, June 1979.
[12] C.V. Ramamoorthy and Y.-W Han, "Reliability Analysis of Systems with Concurrent Error Detection," IEEE Trans. Computers, vol. 24, no. 9, pp. 868-878, Sept. 1975.
[13] K.G. Shin and H. Kim, “Derivation and Application of Hard Deadlines for Real-Time Control Systems,” IEEE Trans. Systems, Man, and Cybernetics, vol. 22, no. 6, pp. 1,403–1,413, Nov. 1992.
[14] K.G. Shin and H. Kim, "A Time Redundancy Approach to TMR Failures Using Fault-State Likelihoods," IEEE Trans. Computers, vol. 43, no. 10, pp. 1,151-1,162, Oct. 1994.
[15] D.P. Siewiorek, V. Kini, and H. Mashburn, "A Case Study of C.mmp, Cm*, and C.vmp: Part I—Experiences with Fault Tolerance in Multiprocessor Systems," Proc. IEEE, vol. 66, no. 10, pp. 1,178-1,199, Oct. 1978.
[16] J.F. Wakerly, "Microcomputer Reliability Improvement Using Triple-Modular Redundancy," IEEE Trans. Computers, vol. 64, no. 6, pp. 889-895, June 1976.
[17] X.-Y Zhuo and S.-L Li, "A New Design Method of Voter in Fault-Tolerant Redundancy Multiple-Module Multi-Microcomputer System," Digest of Papers FTCS-13, pp. 472-475, June 1983.

Index Terms:
Real-time control systems, controller computer, internal and external faults, common-cause faults, TMR failures and masked errors, retry, reconfiguration, dynamic failure, hard deadlines.
Hagbae Kim, Kang G. Shin, "Design and Analysis of an Optimal Instruction-Retry Policy for TMR Controller Computers," IEEE Transactions on Computers, vol. 45, no. 11, pp. 1217-1225, Nov. 1996, doi:10.1109/12.544478
Usage of this product signifies your acceptance of the Terms of Use.