This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Location of a Faulty Module in a Computing System
February 1990 (vol. 39 no. 2)
pp. 182-194

Considering the interplay between different phases of fault tolerance, a new problem of locating a faulty module in a computing system is formulated and solved. First, the probability of each module being faulty, or faulty probability, is calculated using the likelihood principle from the model parameters for fault detection, diagnostics, error propagation, and error detection. Then, based on the faulty probabilities and a given required diagnostic coverage, the order in which modules are to be diagnosed and the maximum time allotted to diagnose each module are determined by minimizing the average total diagnostic time. An example is presented and analyzed to answer the question of whether or not a system should delay the diagnosis upon detection of an error until more errors are detected.

[1] F. Barsi, F. Grandoni, and P. Maestrini, "A theory of diagnosability without repairs,"IEEE Trans. Comput., vol. C-25, no. 6, pp. 585- 593, June 1976.
[2] M. L. Blount, "Modeling of diagnosis fail-softly computer systems," inDig. Papers FTCS-8, June 1978, pp. 53-58.
[3] M. L. Blount, "Probabilistic treatment of diagnosis in digital systems," inDig. Papers FTCS-7, June 1977, pp. 72-77.
[4] D. C. Bossen and M. Y. Hsiao, "Model for transient and permanent error-detection and faulty isolation coverage,"IBM J. Res. Develop., vol. 26, no. 1, pp. 67-77, Jan. 1982.
[5] K. Y. Chwa and S. L. Hakimi, "On fault identification in diagnosable systems,"IEEE Trans. Comput., vol. C-30, no. 6, pp. 414-422, June 1981.
[6] A. T. Dahbura and G. M. Masson, "Greedy diagnosis as the basis of an intermittent-fault/transient-upset tolerant system design,"IEEE Trans. Comput., vol. C-32, no. 10, pp. 953-957, Oct. 1983.
[7] A.T. Dahbura and K. K. Sabnani, "Performance analysis of a fault detection scheme in multiprocessor systems,"Perform. Eval. Rev., pp. 143-154, May 1987.
[8] A. T. Dahbura, K. K. Sabnani, and L. L. King, "The comparison approach to multiprocessor fault diagnosis," inDig. Papers, FTCS- 15, June 1985, pp. 260-265.
[9] A. Damm, "The effectiveness of software error-detection mechanisms in real-time operating systems," inDig. Papers, FTCS-16, June 1986, pp. 171-176.
[10] S. L. Hakimi and A. T. Amin, "Characterization of connection assignment of diagnosable systems,"IEEE Trans. Comput., vol. C- 23, pp. 86-88, Jan. 1974.
[11] S. L. Hakimi and K. Nakajama, "On adaptive system diagnosis,"IEEE Trans. Comput., vol. C-33, no. 3, pp. 234-240, Mar. 1984.
[12] B. E. Helvik, "Periodic maintenance, on the effect of imperfectness," inDig. Papers FTCS-10, June 1980, pp. 204-206.
[13] J. C. Laprie, "Dependable computing and fault tolerance: basic concepts and terminology," inProc. 15th Int. IEEE Symp. on Fault Tolerant Computing (FTCS-15)(Ann Arbor, MI), June 1985, pp. 2-11.
[14] J. Maeng and M. Malek, "A comparison connection assignment for self-diagnosis of multiprocessor systems," inDig. Papers FTCS-11, June 1981, pp. 173-175.
[15] S. N. Maheshwari and S. L. Hakimi, "On models for diagnosable systems and probabilistic fault diagnosis,"IEEE Trans. Comput., vol. C-25, no. 3, pp. 228-236, Mar. 1976.
[16] S.V. Makam and A. Avizienis, "Modelling and analysis of periodically renewed closed fault-tolerant systems," inDig. Papers FTCS- 11, June 1981, pp. 134-141.
[17] M. Malek, "A comparison connection assignment for diagnosis of multiprocessor systems," inProc. 7th Symp. Comput. Architecture, May 1980, pp. 31-35.
[18] S. Mallela and G. M. Masson, "Diagnosable systems for intermittent faults,"IEEE Trans. Comput., vol. C-27, no. 6, pp. 560-566, June 1978.
[19] S. Mallela and G. M. Masson, "Diagnosis without repairs for hybrid fault situations,"IEEE Trans. Comput., vol. C-29, no. 6, pp. 461-470, June 1980.
[20] T. Nakagawa, "Optimum policies when preventive maintenance is imperfect,"IEEE Trans. Reliability, vol. R-28, no. 4, pp. 331-332, Oct. 1979.
[21] T. Nakagawa, K. Yasui, and S. Osaki, "Optimum maintenance policies for a computer system with restart," inDig. Papers FTCS-11, June 1981, pp. 148-150.
[22] F. P. Preparata, G. Metze, and R. T. Chien, "On the connection assignment problem of diagnosable systems,"IEEE Trans. Electron. Comput., vol. EC-16, pp. 848, Dec. 1967.
[23] J. D. Russell and C. R. Kime, "System fault diagnosis: Closure and diagnosability without repairs,"IEEE Trans. Comput., vol. C-24, no. 11, pp. 1078-1089, Nov. 1975.
[24] J. D. Russell and C. R. Kime, "System fault diagnosis: Masking, exposure and diagnosability without repairs,"IEEE Trans. Comput., vol. C-24, no. 12, pp. 1151-1161, Dec. 1975.
[25] M. A. Schuette, J. P. Shen, D. P. Siewiorek, and Y. X. Zhu, "Experimental evaluation of two concurrent error detection schemes," inDig. Papers FTCS-16, June 1986, pp. 138-143.
[26] K. G. Shin and Y.-H. Lee, "Error detection process--Model, design, and its impact on computer performance,"IEEE Trans. Comput., vol. C-33, no. 6, pp. 529-540, June 1984.
[27] K. Shin and Y.-H. Lee, "Measurement and application of fault latency,"IEEE Trans. Computers, vol. C-35, pp. 370-375, Apr. 1986.
[28] K. G. Shin and T.-H. Lin, "Modeling error propagation in a multi-module computing system,"IEEE Trans. Comput, vol. C-37, no. 9, pp. 1053-1066, Sept. 1988.
[29] A. K. Somani, V. K. Agrawal, and D. Avis, "A generalized theory for system level diagnosis,"IEEE Trans. Comput., vol. C-36, pp. 538-546, May 1987.
[30] N. N. Tendolkar and R. L. Swann, "Automated diagnostic methodology for the IBM 3081 processor complex,"IBM J. Res. Develop., vol. 26, no. 1, pp. 78-88, Jan. 1982.
[31] Y. W. Yak, T. S. Dillon, and K. E. Forward, "The effect of imperfect periodic maintenance on fault-tolerant computer systems," inDig. Papers FTCS-14, June 1984, pp. 66-70.
[32] Y. W. Yak, T. S. Dillon, and K. E. Forward, "Incorporation of recovery and repair time in the reliability modeling of fault-tolerance systems," inProc. IEE/IFAC SAFECOMP, 1983, pp. 45-52.

Index Terms:
faulty module; computing system; fault tolerance; probability; likelihood principle; model parameters; error propagation; error detection; fault tolerant computing.
Citation:
T.-H. Lin, K.G. Shin, "Location of a Faulty Module in a Computing System," IEEE Transactions on Computers, vol. 39, no. 2, pp. 182-194, Feb. 1990, doi:10.1109/12.45204
Usage of this product signifies your acceptance of the Terms of Use.