This Article 
 Bibliographic References 
 Add to: 
Formally Verified On-Line Diagnosis
November 1997 (vol. 23 no. 11)
pp. 684-721

Abstract—A reconfigurable fault tolerant system achieves the attributes of dependability of operations through fault detection, fault isolation and reconfiguration, typically referred to as the FDIR paradigm. Fault diagnosis is a key component of this approach, requiring an accurate determination of the health and state of the system. An imprecise state assessment can lead to catastrophic failure due to an optimistic diagnosis, or conversely, result in underutilization of resources because of a pessimistic diagnosis. Differing from classical testing and other off-line diagnostic approaches, we develop procedures for maximal utilization of the system state information to provide for continual, on-line diagnosis and reconfiguration capabilities as an integral part of the system operations. Our diagnosis approach, unlike existing techniques, does not require administered testing to gather syndrome information but is based on monitoring the system message traffic among redundant system functions. We present comprehensive on-line diagnosis algorithms capable of handling a continuum of faults of varying severity at the node and link level. Not only are the proposed algorithms on-line in nature, but are themselves tolerant to faults in the diagnostic process. Formal analysis is presented for all proposed algorithms. These proofs offer both insight into the algorithm operations and facilitate a rigorous formal verification of the developed algorithms.

[1] M.A. Barborak, M. Malek, and A.T. Dahbura, "The Consensus Problem in Fault-Tolerant Computing," ACM Computer Surveys, vol. 25, pp. 171-220, June 1993.
[2] W. Bevier and W. Young, "Machine-Checked Proofs of a Byzantine Agreement Algorithm," Proc. TR 55, CLI, Austin, Texas, June 1990.
[3] R. Boyer and J. Moore, "MJRTY—A Fast Majority Vote Algorithm," R.S. Boyer, ed., vol. 1, Automated Reasoning Series, pp. 105-117. Kluwer, 1991.
[4] D. Dolev and H. Strong, "Authenticated Algorithms for Byzantine Algorithms," SIAM J. Computing, vol. 12, pp. 656-666, Nov 1983.
[5] L. Gong, P. Lincoln, and J. Rushby, "Byzantine Agreement with Authentication: Observations and Applications in Tolerating Hybrid and Link Faults," Proc. DCCA-5, 1995.
[6] F. Henke et al., "The EHDMVerification Environment: An Overview," Proc. 11th Nat'l Computer Security Conf., pp. 147-155, Oct. 1988.
[7] G.J. Holzmann, Design and Validation of Computer Protocols, Prentice Hall, 1991.
[8] R.M. Kieckhafer,C.J. Walter,A.M. Finn, and P.M. Thambidurai,"The MAFT Architecture for Distributed Fault-Tolerance," IEEE Trans. Computers, vol. 37, no. 4, pp. 398-405, Apr. 1988.
[9] J.G. Kuhl and S.M. Reddy, "Distributed Fault Tolerance for Large Multiprocessor Systems," Proc. 1980 Computer ArchitectureSymp., pp. 222-229, May 1980.
[10] J. Kuhl and S. Reddy, "Fault-Diagnosis in Fully Distributed Systems," Proc. FTCS-11, pp. 100-105, 1981.
[11] L. Lamport, R. Shostak, and M. Pease, "The Byzantine Generals Problem," ACM Trans. Programming Languages and Systems, vol. 4, no. 3, July 1982, pp. 382-401.
[12] L. Lamport and P.M. Melliar-Smith, “Synchronizing Clocks in the Presence of Faults,” J. ACM, vol. 32, no. 1, pp. 52–78, Jan. 1985.
[13] P. Lincoln and J. Rushby, “A Formally Verified Algorithm for Interactive Consistency Under a Hybrid Fault Model,” Proc. Fault Tolerant Computing Symp. 23, pp. 402–411, Toulouse, France, June 1993.
[14] P. Lincoln and J. Rushby, “Formal Verification of an Interactive Consistency Algorithm for the Draper FTP Architecture Under a Hybrid Fault Model,” Proc. Ninth Ann. Conf. Computer Assurance, pp. 107–120, Gaithersburg, Md., June 1994.
[15] S. Mallela and G. Masson, "Diagnosis without Repair for Hybrid Fault Situations," IEEE Trans. Computers, vol. 29, pp. 461-470, June 1980.
[16] F.J. Meyer and D.K. Pradhan, "Consensus With Dual Failure Modes," IEEE Trans. Parallel and Distributed Systems, vol. 2, no. 2, pp. 214-222, Apr. 1991.
[17] S. Owre, J. Rushby, N. Shankar, and F. von Henke, "Formal Verification for Fault-Tolerant Architectures: Prolegomena to the Design of PVS," IEEE Trans. Software Eng., vol. 21, pp. 107-125, Feb. 1995.
[18] S. Owre et al., "The PVS Specification Language," TR CSL-SRI, Feb. 1993.
[19] S. Owre et al., "User Guide to PVS Specification and Verification System," TR CSL-SRI, Feb. 1993.
[20] M. Pease, R. Shostak, and L. Lamport, “Reaching Agreement in the Presence of Faults,” J. ACM, vol. 27, no. 2, pp. 228–234, Apr. 1980.
[21] F. Preparata, G. Metze, and R. Chien, "On the Connection Assignment Problem of Diagnosable Systems," IEEE Trans. Computers, vol. 16, no. 6, pp. 848-854, Dec. 1967.
[22] K. Ramarao and J. Adams, "On the Diagnosis of Byzantine Faults," Seventh Symp. Reliable Distributed Systems, pp. 144-153, 1988.
[23] J. Rushby, "Reconfiguration and Transient Recovery in State m/c Architectures," Proc. FTCS-26, pp. 6-25, 1996.
[24] J. Rushby, "Formal Verification of an Oral Messages Algorithm for Interactive Consistency," TR SRI-CSL-92-1, CSL, SRI 1992.
[25] J. Rushby, “A Formally Verified Algorithm for Clock Synchronization under a Hybrid Fault Model,” Proc. 13th ACM Symp. Principles of Distributed Computing, pp. 304–313, Los Angeles, Calif., Aug. 1994. Also available as NASA Contractor Report 198289
[26] J. Rushby and F. Henke, "Formal Verification of the Interactive Convergence Clock Synchronization Algorithm Using EHDM," TR SRI-CSL-89-3R, CSL, SRI, 1991.
[27] J. Rushby and F. von Henke, "Formal Verification of Algorithms for Critical Systems," IEEE Trans. Software Engineering, vol. 19, no. 1, pp. 13-23, Jan. 1993.
[28] A. Sengupta and A. Dahbura, “On Self-Diagnosable Multiprocessor Systems: Diagnosis by the Comparison Approach,” IEEE Trans. Computers, vol. 41, no. 11, pp. 1386-1396, Nov. 1992.
[29] A. Sengupta et al., "On System Diagnosability in the Presence of Hybrid Faults," IEEE Trans. Computers, vol. 35, pp. 90-93, Jan. 1986.
[30] N. Shankar, “Mechanical Verification of a Generalized Protocol for Byzantine Fault-Tolerant Clock Synchronization,” pp. 217-236, Jan. 1992.
[31] K. Shin and P. Ramanathan, "Diagnosis of Processors with Byzantine Faults in a Distributed Computing System," Proc. FTCS-17, pp. 55-60, 1987.
[32] N. Suri,M. Hugue, and C. Walter,"Fault Classification and Distribution Effects on the Reliability Modeling of Large Fault-Tolerant Systems," Proc. 22nd Fault-Tolerant Computing Symp., pp. 212-220, July 1992.
[33] N. Suri, C. Walter, and M. Hugue, Advances in Ultra-Dependable Distributed Systems. IEEE CS Press, 1995.
[34] P.M. Thambidurai and Y.K. Park,"Interactive Consistency with Multiple Failure Modes," Proc. seventh Reliable Dist. Systems Symp., Oct. 1988.
[35] C.J. Walter, “Identifying the Cause of Detected Errors,” Proc. Int'l Symp. Fault-Tolerant Computing 20, pp. 48-55, 1990.
[36] C. Walter, N. Suri, and M. Hugue, "Continual On-Line Diagnosis of Hybrid Faults," Proc. DCCA-4, Jan. 1994.
[37] C.J. Walter et al., "MAFT: A Multicomputer Architecture for Fault-Tolerance in Real-Time Control Systems," Proc. RTSS, 1985.
[38] C. Yang and G. Masson, "Hybrid Fault Diagnosability with Unreliable Communication Links," Proc. FTCS-16, pp. 226-231, 1986.
[39] C. Yang and G. Masson, "A New Measure for Hybrid Fault Diagnosability," IEEE Trans. Computers, vol. 36, no. 3, pp. 378-382, 1987.
[40] W. Young, "Verifying the Interactive Convergence Clock-Synchronization Algorithm Using the Boyer-Moore Prover," NASA TR 189649, NASA LARC, Apr. 1992.

Index Terms:
Fault diagnosis, on-line, fault handling, formal methods.
Chris J. Walter, Patrick Lincoln, Neeraj Suri, "Formally Verified On-Line Diagnosis," IEEE Transactions on Software Engineering, vol. 23, no. 11, pp. 684-721, Nov. 1997, doi:10.1109/32.637385
Usage of this product signifies your acceptance of the Terms of Use.