This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Online Diagnosis and Recovery: On the Choice and Impact of Tuning Parameters
October-December 2007 (vol. 4 no. 4)
pp. 295-312
Afault-tolerant system is designed to provide sustained delivery of services despite encountered perturbations. The ability to accurately detect, diagnose and recover from faults in an on-line manner (i.e., during system operation) constitutes an important aspect of fault-tolerance. This FDIR process has two primary objectives: to consistently identify a faulty node so as to restrictits effect on system operations, and to support the process of system recovery via isolation and reconfiguration of the system resources to sustain ongoing system operations. If FDIR isperformed as an on-line procedure this provides an effective capability of resource management, responding promptly to the appearance and disappearance of faults with a small duration of system susceptibility to subsequent fault accumulation.

[1] P. Agrawal, “Fault Tolerance in Multiprocessor Systems without Dedicated Redundancy,” IEEE Trans. Computers, vol. 37, no. 3, pp.358-362, Mar. 1988.
[2] M. Barborak, M. Malek, and A. Dahbura, “The Consensus Problem in Fault-Tolerant Computing,” ACM Surveys, vol. 25, no. 2, pp. 171-220, June 1993.
[3] K. Birman and T. Joseph, “Exploiting Virtual Synchrony in Distributed Systems,” Proc. 11th Symp. Operating Systems Principles (SOSP '87), pp. 123-138, 1987.
[4] D.M. Blough and H.W. Brown, “The Broadcast Comparison Model for On-Line Fault Diagnosis in Multicomputer Systems: Theory and Implementation,” IEEE Trans. Computers, vol. 48, no. 5, pp. 470-493, May 1999.
[5] M. Blount, “Probabilistic Treatment of Diagnosis in Digital Systems,” Proc. Seventh Ann. Int'l Symp. Fault-Tolerant Computing (FTCS '77), pp. 72-77, 1977.
[6] A. Bondavalli, S. Chiaradonna, F. Di Giandomenico, and F. Grandoni, “Discriminating Fault Rate and Persistency to Improve Fault Treatment,” Proc. 27th Ann. Int'l Symp. Fault-Tolerant Computing Symp. (FTCS '97), pp. 354-362, 1997.
[7] A. Bondavalli, S. Chiaradonna, F. Di Giandomenico, and F. Grandoni, “Threshold-Based Mechanisms to Discriminate Transient from Intermittent Faults,” IEEE Trans. Computers, vol. 49, no. 3, pp. 230-245, Mar. 2000.
[8] T. Chandra and S. Toueg, “Unreliable Failure Detectors for Reliable Distributed Systems,” J. ACM, vol. 43, no. 2, pp. 225-267, Mar. 1996.
[9] C. Constantinescu, “Impact of Deep Submicron Technology on Dependability of VLSI Circuits,” Proc. IEEE Int'l Conf. Dependable Systems and Networks (DSN '02), pp. 205-209, 2002.
[10] F. Cristian, “Reaching Agreement on Processor-Group Membership in Synchronous Distributed Systems,” Distributed Computing, vol. 4, no. 4, pp. 175-187, Dec. 1991.
[11] F. Cristian and C. Fetzer, “The Timed Asynchronous Distributed System Model,” IEEE Trans. Parallel and Distributed Systems, vol. 10, no. 6, pp. 642-657, June 1999.
[12] D.D. Deavours, G. Clark, T. Courtney, D. Daly, S. Derisavi, J.M. Doyle, and W.H. Sanders, “The Möbius Framework and Its Implementation,” IEEE Trans. Software Eng., vol. 20, no. 10, pp.956-969, Oct. 2002.
[13] L. Gong, P. Lincoln, and J. Rushby, “Byzantine Agreement with Authentication: Observations and Applications in Tolerating Hybrid and Link Faults,” Proc. Fifth Conf. Dependable Computing for Critical Applications (DCCA '95), pp. 139-157, 1995.
[14] “Road Vehicles—Electrical Disturbances from Conduction and Coupling,” ISO 7637, Int'l Organization for Standardization, 1997.
[15] R. Iyer, L.T. Young, and P.V.K. Iyer, “Automatic Recognition of Intermittent Failures: An Experimental Study of Field Data,” IEEE Trans. Computers, vol. 39, no. 3, pp. 525-537, Apr. 1990.
[16] H. Kopetz and G. Grunsteidl, “TTP—A Protocol for Fault-Tolerant Real-Time Systems,” Computer, vol. 27, no. 1, pp. 14-23, Jan. 1994.
[17] J. Kuhl and S. Reddy, “Fault Diagnosis in Fully Distributed Systems,” Proc. 11th Ann. Int'l Symp. Fault-Tolerant Computing (FTCS '81), pp. 100-105, 1981.
[18] J. Lala and L. Alger, “Hardware and Software Fault Tolerance: A Unified Architectural Approach,” Proc. 18th Ann. Int'l Symp. Fault-Tolerant Computing (FTCS '88), pp. 240-245, 1988.
[19] J.-C. Laprie, “Dependable Computing and Fault Tolerance: Concepts and Terminology,” Proc. 25th Ann. Int'l Symp. Fault-Tolerant Computing (FTCS '95), pp. 2-11, 1995.
[20] E. Latronico and P. Koopman, “Design Time Reliability Analysis of Distributed Fault Tolerance Algorithms,” Proc. IEEE Int'l Conf. Dependable Systems and Networks (DSN), pp. 486-495, 2005.
[21] T. Lin and D. Siewiorek, “Error Log Analysis: Statistical Modeling and Heuristic Trend Analysis,” IEEE Trans. Computers, vol. 39, no. 4, pp. 419-432, Oct. 1990.
[22] P. Lincoln and J. Rushby, “A Formally Verified Algorithm for Interactive Consistency under a Hybrid Fault Model,” Proc. 23rd Ann. Int'l Symp. Fault-Tolerant Computing (FTCS '93), pp. 402-411, 1993.
[23] S. Mallela and G. Masson, “Diagnosis without Repair for Hybrid Fault Situations,” IEEE Trans. Computers, vol. 29, no. 6, pp. 461-470, June 1980.
[24] M. Malek, “A Comparison Connection Assignment for Diagnosis of Multiprocessor Systems,” Proc. Seventh Ann. Symp. Computer Architecture, pp. 31-36, 1980.
[25] D. Powell, J. Arlat, L. Beus-Dukic, A. Bondavalli, P. Coppola, A. Fantechi, E. Jenn, C. Rabéjac, and A. Wellings, “GUARDS: A Generic Upgradable Architecture for Real-Time Dependable Systems,” IEEE Trans. Parallel and Distributed Systems, vol. 10, no. 6, pp. 580-599, June 1999.
[26] F.P. Preparata, G. Metze, and R.T. Chien, “On the Connection Assignment Problem of Diagnosable Systems,” IEEE Trans. Electronic Computers, vol. 16, no. 12, pp. 848-854, Dec. 1967.
[27] U. Schmid, “How to Model Link Failures: A Perception-Based Fault Model,” Proc. Int'l Conf. Dependable Systems and Networks (DSN '95), pp. 57-66, 1995.
[28] A. Sengupta and A. Dahbura, “On Self-Diagnosable Multiprocessor Systems: Diagnosis by the Comparison Approach,” IEEE Trans. Computers, vol. 41, no. 11, pp. 1386-1396, Nov. 1992.
[29] M. Serafini, N. Suri, J. Vinter, A. Ademaj, W. Brandstätter, F. Tagliabò, and J. Koch, “A Tunable Add-On Diagnostic Protocol for Time-Triggered Systems,” Proc. Int'l Conf. Dependable Systems and Networks (DSN '07), pp. 164-174, 2007.
[30] K. Shin and P. Ramanathan, “Diagnosis of Processors with Byzantine Faults in a Distributed Computing System,” Proc. 17th Ann. Int'l Symp. Fault-Tolerant Computing (FTCS '87), pp. 55-60, 1987.
[31] D.P. Siewiorek and R.R. Swarz, Reliable Computer Systems: Design and Evaluation. AK Peters, 1998.
[32] C. Walter, M.M. Hugue, and N. Suri, “Continual On-Line Diagnosis of Hybrid Faults,” Proc. Fourth Conf. Dependable Computing for Critical Applications (DCCA '94), pp. 150-166, 1994.
[33] C. Walter, P. Lincoln, and N. Suri, “Formally Verified On-Line Diagnosis,” IEEE Trans. Software Eng., vol. 23, no. 11, pp. 684-721, Nov. 1997.

Citation:
Marco Serafini, Andrea Bondavalli, Neeraj Suri, "Online Diagnosis and Recovery: On the Choice and Impact of Tuning Parameters," IEEE Transactions on Dependable and Secure Computing, vol. 4, no. 4, pp. 295-312, Oct.-Dec. 2007, doi:10.1109/TDSC.2007.70210
Usage of this product signifies your acceptance of the Terms of Use.