Afault-tolerant system is designed to provide sustained delivery of services despite encountered perturbations. The ability to accurately detect, diagnose and recover from faults in an on-line manner (i.e., during system operation) constitutes an important aspect of fault-tolerance. This FDIR process has two primary objectives: to consistently identify a faulty node so as to restrictits effect on system operations, and to support the process of system recovery via isolation and reconfiguration of the system resources to sustain ongoing system operations. If FDIR isperformed as an on-line procedure this provides an effective capability of resource management, responding promptly to the appearance and disappearance of faults with a small duration of system susceptibility to subsequent fault accumulation.
Neeraj Suri, Marco Serafini, Andrea Bondavalli, "Online Diagnosis and Recovery: On the Choice and Impact of Tuning Parameters", IEEE Transactions on Dependable and Secure Computing, vol. 4, no. , pp. 295-312, October-December 2007, doi:10.1109/TDSC.2007.70210
