This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Optimal Diagnosis of Heterogeneous Systems with Random Faults
March 1998 (vol. 47 no. 3)
pp. 298-304

Abstract—We consider the problem of fault diagnosis in multiprocessor systems. Processors perform tests on one another; fault-free testers correctly identify the fault status of tested processors, while faulty testers can give arbitrary test results. Processors fail with arbitrary probabilities and all failures are independent. The goal is to identify correctly the status of all processors, based on the set of test results. A diagnosis algorithm is optimal if it has the highest probability of correctness (reliability) among all (deterministic) diagnosis algorithms. We give a fast diagnosis algorithm and prove its optimality for arbitrary values of failure probabilities. This is the first time that optimal diagnosis is given for systems without any assumptions on the behavior of faulty processors or on the values of failure probabilities.

We also investigate locally optimal diagnosis algorithms: For any set of test results, they return the most probable configuration of faulty and fault-free processors that could yield it. We show a fast diagnosis which is always locally optimal. If all processors have failure probabilities smaller than ${\textstyle{1 \over 2}},$ a locally optimal diagnosis is proved to be optimal. However, if some processors have failure probabilities exceeding ${\textstyle{1 \over 2}},$ a locally optimal diagnosis need not have the highest reliability. We even show examples that it may have arbitrarily small reliability when the number of processors increases, while optimal reliability remains constant.

[1] R. Beigel, W. Hurwood, and N. Kahale, "Fault Diagnosis in a Flash," Proc. 36th Symp. Foundations of Computer Science, 1995.
[2] R. Beigel, S.R. Kosaraju, and G.F. Sullivan, "Locating Faults in a Constant Number of Testing Rounds," Proc. First Ann. ACM Symp. Parallel Algorithms and Architecture, pp. 189-198, 1989.
[3] R. Beigel, G. Margulis, and D.A. Spielman, "Fault Diagnosis in a Constant Number of Parallel Testing Rounds," Proc. Fifth Ann. ACM Symp. Parallel Algorithms and Architecture, pp. 21-29, 1993.
[4] R. Bianchini Jr. and R. Buskens, "An Adaptive Distributed System-Level Diagnosis Algorithm and Its Implementation," Proc. 21st Int'l Symp. Fault-Tolerant Computing (FTCS-21), pp. 222-229, 1991.
[5] P.M. Blecher, "On a Logical Problem," Disc. Math., vol. 43, pp. 107-110, 1983.
[6] D.M. Blough and A. Pelc, "Complexity of Fault Diagnosis in Comparison Models," IEEE Trans. Computers, vol. 41, pp. 318-324, 1992.
[7] D.M. Blough, G.F. Sullivan, and G.M. Masson, "Efficient Diagnosis of Multiprocessor Systems Under Probabilistic Models," IEEE Trans. Computers, vol. 41, pp. 1,126-1,136, 1992.
[8] D.M. Blough, G.F. Sullivan, and G.M. Masson, "Intermittent Fault Diagnosis in Multiprocessor Systems," IEEE Trans. Computers, vol. 41, pp. 1,430-1,441, 1992.
[9] M. Blount, "Probabilistic Treatment of Diagnosis in Digital Systems," Digest of Papers Seventh Int'l Symp. Fault-Tolerant Computing, pp. 72-77, 1977.
[10] A.T. Dahbura, "An Efficient Algorithm for Identifying the Most Likely Fault Set in a Probabilistically Diagnosable System," IEEE Trans. Computers, vol. 35, pp. 354-356, 1986.
[11] A.T. Dahbura, "System-Level Diagnosis: A Perspective for the Third Decade," Concurrent Computation: Algorithms, Architectures, Technologies.New York: Plenum Press, 1988.
[12] K. Diks, E. Kranakis, D. Krizanc, B. Mans, and A. Pelc, "Optimal Coteries and Voting Schemes," Information Processing Letters, vol. 51, pp. 1-6, 1994.
[13] K. Diks and A. Pelc, "Globally Optimal Diagnosis in Systems with Random Faults," IEEE Trans. Computers, vol. 46, pp. 200-204, 1997.
[14] H. Garcia-Molina and D. Barbara, “How to Assign Votes in a Distributed System,” J. ACM, vol. 32, no. 4, pp. 841-860, Oct. 1985.
[15] L.E. LaForge, K. Huang, and V.K. Agarwal, "Almost Sure Diagnosis of Almost Every Good Element," IEEE Trans. Computers, vol. 43, pp. 295-305, 1994.
[16] S. Lee and K.G. Shin, "Probabilistic Diagnosis of Multiprocessor Systems," ACM Computing Surveys, vol. 26, pp. 121-139, 1994.
[17] S.N. Maheshwari and S.L. Hakimi, "On Models for Diagnosable Systems and Probabilistic Fault Diagnosis," IEEE Trans. Computers, vol. 25, pp. 228-236, 1976.
[18] A. Pelc, "Undirected Graph Models for System-Level Fault Diagnosis," IEEE Trans. Computers, vol. 40, pp. 1,271-1,276, 1991.
[19] F. Preparata, G. Metze, and R. Chien, "On the Connection Assignment Problem of Diagnosable Systems," IEEE Trans. Electronic Computers, vol. 16, pp. 848-854, 1967.
[20] S. Rangarajan and D. Fussell, "A Probabilistic Method for Fault Diagnosis of Multiprocessor Systems," Digest of Papers 18th Int'l Symp. Fault-Tolerant Computing, pp. 278-283, 1988.
[21] E. Scheinerman, "Almost Sure Fault-Tolerance in Random Graphs," SIAM J. Computing, vol. 16, pp. 1,124-1,134, 1987.

Index Terms:
Fault diagnosis, fault tolerance, random fault, test.
Citation:
Andrzej Pelc, "Optimal Diagnosis of Heterogeneous Systems with Random Faults," IEEE Transactions on Computers, vol. 47, no. 3, pp. 298-304, March 1998, doi:10.1109/12.660165
Usage of this product signifies your acceptance of the Terms of Use.