This Article 
 Bibliographic References 
 Add to: 
On the Quality of Service of Crash-Recovery Failure Detectors
July-September 2010 (vol. 7 no. 3)
pp. 271-283
Tiejun Ma, Imperial College London, London
Jane Hillston, University of Edinburgh, Edinburgh
Stuart Anderson, University of Edinburgh, Edinburgh
We model the probabilistic behavior of a system comprising a failure detector and a monitored crash-recovery target. We extend failure detectors to take account of failure recovery in the target system. This involves extending QoS measures to include the recovery detection speed and proportion of failures detected. We also extend estimating the parameters of the failure detector to achieve a required QoS to configuring the crash-recovery failure detector. We investigate the impact of the dependability of the monitored process on the QoS of our failure detector. Our analysis indicates that variation in the MTTF and MTTR of the monitored process can have a significant impact on the QoS of our failure detector. Our analysis is supported by simulations that validate our theoretical results.

[1] J. Laprie, A. Avizienis, and H. Kopetz, Dependability: Basic Concepts and Terminology. Springer-Verlag, 1992.
[2] L. Lamport, R. Shostak, and M. Pease, "The Byzantine Generals Problem," ACM Trans. Programming Languages and Systems, vol. 4, no. 3, pp. 382-401, 1982.
[3] M.J. Fischer, N.A. Lynch, and M.S. Paterson, "Impossibility of Distributed Consensus with One Faulty Process," J. ACM, vol. 32, no. 2, pp. 374-382, Apr. 1985.
[4] T.D. Chandra and S. Toueg, "Unreliable Failure Detectors for Asynchronous Systems (Preliminary Version)," Proc. 10th ACM Symp. Principles of Distributed Computing (PODC '91), pp. 325-340, 1991.
[5] W. Chen, S. Toueg, and M.K. Aguilera, "On the Quality of Service of Failure Detectors," IEEE Trans. Computers, vol. 51, no. 5, pp. 561-580, May 2002.
[6] L. Falai and A. Bondavalli, "Experimental Evaluation of the QoS of Failure Detectors on Wide Area Network," Proc. Int'l Conf. Dependable Systems and Networks, pp. 624-633, July 2005.
[7] N. Hayashibara, A. Cherif, and T. Katayama, "Failure Detectors for Large-Scale Distributed Systems," Proc. 21st IEEE Symp. Reliable Distributed Systems, pp. 404-409, 2002.
[8] N. Hayashibara, X. Defago, R. Yared, and T. Katayama, "The Accrual Failure Detector," Proc. 23rd IEEE Int'l Symp. Reliable Distributed Systems, pp. 66-78, 2004.
[9] R.C. Nunes and I. Jansch-Porto, "QoS of Timeout-Based Self-Tuned Failure Detectors: The Effects of the Communication Delay Predictor and the Safety Margin," Proc. Int'l Conf. Dependable Systems and Networks, pp. 753-761, 2004.
[10] I. Sotoma and E.R.M. Madeira, "A Markov Model for Quality of Service of Failure Detectors in the Pressure of Loss Bursts," Proc. 18th Int'l Conf. Advanced Information Networking and Applications, vol. 2, pp. 62-67, 2004.
[11] R. Guerraoui and L. Rodrigues, Introduction to Reliable Distributed Programming. Springer, 2006.
[12] E.M. Dashofy, A. van der Hoek, and R.N. Taylor, "Towards Architecture-Based Self-Healing Systems," Proc. First Workshop Self-Healing Systems (WOSS '02), pp. 21-26, 2002.
[13] M.E. Shin and D. Cooke, "Connector-Based Self-Healing Mechanism for Components of a Reliable System," Proc. 2005 Workshop Design and Evolution of Autonomic Application Software, pp. 1-7, 2005.
[14] R. Koo and S. Toueg, "Checkpointing and Rollback-Recovery for Distributed Systems," IEEE Trans. Software Eng., vol. 13, no. 1, pp. 23-31, Jan. 1987.
[15] D. Manivannan and M. Singhal, "A Low-Overhead Recovery Technique Using Quasi Synchronous Checkpointing," Proc. IEEE Int'l Conf. Distributed Computing Systems, pp. 100-107, 1996.
[16] T. Ma, J. Hillston, and S. Anderson, "Evaluation of the QoS of Crash-Recovery Failure Detection," Proc. ACM Symp. Applied Computing (DADS Track), 2007.
[17] T. Ma, J. Hillston, and S. Anderson, "On the Quality of Service of Crash-Recovery Failure Detectors," Proc. Int'l Conf. Dependable Systems and Networks, June 2007.
[18] M. Bertier, O. Marin, and P. Sens, "Implementation and Performance Evaluation of an Adaptable Failure Detector," Proc. Int'l Conf. Dependable Systems and Networks, pp. 354-363, 2002.
[19] I. Gupta, T.D. Chandra, and G.S. Goldszmidt, "On Scalable and Efficient Distributed Failure Detectors," Proc. 12th ACM Symp. Principles of Distributed Computing, pp. 170-179, 2001.
[20] R.V. Renesse, Y. Minsky, and M. Hayden, "A Gossip-Style Failure Detection Service," technical report, Cornell Univ., 1998.
[21] P. Stelling, I. Foster, C. Kesselman, C.A. Lee, and G. von Laszewski, "A Fault Detection Service for Wide Area Distributed Computations," Cluster Computing, vol. 2, no. 2, pp. 117-128, 1999.
[22] R. Boichat and R. Guerraoui, "Reliable and Total Order Broadcast in the Crash Recovery Model," PhD thesis, Ecole Polytechnique Fed., 2001.
[23] M.K. Aguilera, W. Chen, and S. Toueg, "Failure Detection and Consensus in the Crash-Recovery Model," Distributed Computing, vol. 13, no. 2, pp. 99-125, Apr. 2000.
[24] D. Dolev, R. Friedman, I. Keidar, and D. Malkhi, "Failure Detectors in Omission Failure Environments," Technical Report 96-1608, Dept. of Computer Science, Cornell Univ., 1996.
[25] M. Hurfin, A. Mostefaoui, and M. Raynal, "Consensus in Asynchronous Systems Where Processes Can Crash and Recover," Proc. 17th IEEE Symp. Reliable Distributed Systems, pp. 280-286, Oct. 1998.
[26] R. Oliveira, R. Guerraoui, and A. Schiper, "Consensus in the Crash-Recover Model," Technical Report 97-239, Dept. d'Informatique, EPFL, html , 1997.
[27] J.C. Knight and E.A. Strunk, "Software Dependability," Proc. Int'l Conf. Dependable Systems and Networks, Tutorials, June 2006.
[28] R. Guerraoui, R. Oliveira, and A. Schiper, "Stubborn Communication Channels," technical report, Dept. d'Informatique, EPFL, 1998.
[29] T. Ma, "Qos of Crash-Recovery Failure Detection," PhD dissertation, The Univ. of Edinburgh, Mar. 2007.

Index Terms:
Failure detectors, crash recovery, quality of service, availability, dependability, performance.
Tiejun Ma, Jane Hillston, Stuart Anderson, "On the Quality of Service of Crash-Recovery Failure Detectors," IEEE Transactions on Dependable and Secure Computing, vol. 7, no. 3, pp. 271-283, July-Sept. 2010, doi:10.1109/TDSC.2009.35
Usage of this product signifies your acceptance of the Terms of Use.