This Article 
 Bibliographic References 
 Add to: 
A Time Redundancy Approach to TMR Failures Using Fault-State Likelihoods
October 1994 (vol. 43 no. 10)
pp. 1151-1162

Failure to establish a majority among the processing modules in a triple modular redundant (TMR) system, called a TMR failure, is detected by using two voters and a disagreement detector. Assuming that no more than one module becomes permanently faulty during the execution of a task, Re-execution of the task on the Same HardWare (RSHW) upon detection of a TMR failure becomes a cost-effective recovery method, because 1) the TMR system can mask the effects of one faulty module while RSHW can recover from nonpermanent faults, and 2) system reconfiguration-Replace the faulty HardWare, reload, and Restart (RHWR)-is expensive both in time and hardware. We propose an adaptive recovery method for TMR failures by "optimally" choosing either RSHW or RHWR based on the estimation of the costs involved. We apply the Bayes theorem to update the likelihoods of all possible states in the TMR system with each voting result. Upon detection of a TMR failure, the expected cost of RSHW is derived with these likelihoods and then compared with that of RHWR. RSHW will continue either until it recovers from the TMR failure or until the expected cost of RSHW becomes larger than that of RHWR. As the number of unsuccessful RSHW's increases, the probability of permanent fault(s) having caused the TMR failure will increase, which will, in turn, increase the cost of RSHW. Our simulation results show that the proposed method outperforms the conventional reconfiguration method using only RHWR under various conditions.

[1] A. Avizienis and G. C. Gilley, "The STAR (self-testing and repairing) computer: An investigation of theory and practice of fault-tolerant computer design,"IEEE Trans. Comput., vol. C-20, no. 11, pp. 1312-1321, Nov. 1971.
[2] M. Berg and I. Koren, "On switching policies for modular redundancy fault-tolerant computing systems,"IEEE Trans. Comput., vol. C-36, no. 9, pp. 1052-1062, Sept. 1987.
[3] P. K. Chande, A. K. Ramani, and P. C. Sharma, "Modular TMR multiprocessor system,"IEEE Trans. Indust. Electron., Feb. 1989.
[4] B. Cuchi, "Reliability and analysis of hybrid redundancy," inDig. Pap., FTCS-5, 1975, pp. 75-79.
[5] P. T. de Sousa and F. P. Mathur, "Shift-out modular redundancy,"IEEE Trans. Comput., vol. C-27, no. 7, pp. 624-627, July 1978.
[6] A. L. Hopkins, Jr., T. B. Smith, III, and J. H. Lala, "FTMP-A highly reliable fault-tolerant multiprocessor for aircraft,"Proc. IEEE, vol. PROC-66, no. 10, pp. 1221-1239, Oct. 1978.
[7] M. Kameyama and T. Higuchi, "Design of dependent-failure-tolerant microcomputer system using triple-modular redundancy,"IEEE Trans. Comput., vol. C-29, no. 2, pp. 202-205, Feb. 1980.
[8] D. Kiskis and K. Shin, "Embedding triple-modular redundancy into a hypercube architecture," inProc. of the Third Conf. on Hypercube Concurrent Comput. and Applicat., 1988, pp. 337-345.
[9] J. Koren, Z. Koren, and S.Su, "Analysis of a class of recovery procedures,"IEEE Trans. Comput., vol. C-35, pp. 703-712, 1986.
[10] T.-H. Lin and K. G. Shin, "An optimal retry policy based on fault classification,"IEEE Trans. Comput., vol. 43, no. 9, pp. 1014-1025, Sept. 1994.
[11] J.-C. Liu and K. G. Shin, "A RAM architecture for concurrent access and on-chip testing,"IEEE Trans. Comput., vol. 40, no. 10, pp. 1153-1158, Oct. 1991.
[12] J. Losq, "A highly efficient redundancy scheme: Self-purging redundancy,"IEEE Trans. Comput., vol. C-25, no. 6, pp. 569-578, June 1976.
[13] R. E. Lyons and W. Vanderkulk, "The use of triple-modular redundancy to improve computer reliability,"IBM J. Res. Develop., vol. 6, pp. 200-209, Apr. 1962.
[14] S. R. McConnel, D. P. Siewiorek, and M. M. Tsao, "The measurement and analysis of transient errors in digital computer systems," inDig. Papers, FTCS-9, June 1979, pp. 67-70.
[15] K. G. Shin and Y.-H. Lee, "Error detection process-Model, design, and its impact on computer performance,"IEEE Trans. Comput., vol. C-33, no. 6, pp. 529-539, June 1984.
[16] K. G. Shin, T. Lin, and Y.-H. Lee, "Optimal checkpointing of real-time tasks,"IEEE Trans. Computers, vol. C-36, no. 11, pp. 1328-1341, Nov. 1987.
[17] K. G. Shin and J.-C. Liu, "Study on fault-tolerant processor for advanced launch system,"NASA Contractor Rep., June 1990.
[18] D. P. Siewiorek, V. Kini, and H. Mashburn, "A case study of C.mmp, Cm*, and C.vmp: Part I-Experiences with fault tolerance in multiprocessor systems,"Proc. IEEE, vol. PROC-66, no. 10, pp. 1178-1199, Oct. 1978.
[19] D. P. Siewiorek and R. S. Swarz,The Theory and Practice of Reliable System Design. Bedford, MA: Digital Equipment Corporation, 1982.
[20] J. S. Upadhyaya and K. K. Saluja, "A watchdog processor based general rollback technique with multiple retries,"IEEE Trans. Software Eng., vol. SE-12, pp. 87-95, Jan. 1986.
[21] J. F. Wakerly, "Transient failures in triple modular redundancy systems with sequential modules,"IEEE Trans. Comput., vol. 33, no. 5, pp. 570-573, May 1975.
[22] J. F. Wakerly, "Microcomputer reliability improvement using triple-modular redundancy,"IEEE Trans. Comput., vol. 34, no. 6, pp. 889-895, June 1976.
[23] X.-Y. Zhuo and S.-L. Li, "A new design method of voter in fault-tolerant redundancy multiple-module multi-microcomputer system," inDig. Pap., FTCS-13, June 1983, pp. 472-475.

Index Terms:
fault tolerant computing; redundancy; Bayes methods; digital simulation; time redundancy approach; TMR failures; fault-state likelihoods; processing modules; triple modular redundant system; voters; disagreement detector; system reconfiguration; adaptive recovery method; Bayes theorem; simulation results.
K.G. Shin, Hagbae Kim, "A Time Redundancy Approach to TMR Failures Using Fault-State Likelihoods," IEEE Transactions on Computers, vol. 43, no. 10, pp. 1151-1162, Oct. 1994, doi:10.1109/12.324541
Usage of this product signifies your acceptance of the Terms of Use.