This Article 
 Bibliographic References 
 Add to: 
On the Quality of Service of Failure Detectors
January 2002 (vol. 51 no. 1)
pp. 13-32

Editor's Note: This paper unfortunately contains some errors which led to the paper being reprinted in the May 2002 issue. Please see IEEE Transactions on Computers, vol. 51, no. 5, May 2002, pp. 561-580 for the correct paper (available without subscription).

We study the quality of service (QoS) of failure detectors. By QoS, we mean a specification that quantifies 1) how fast the failure detector detects actual failures and 2) how well it avoids false detections. We first propose a set of QoS metrics to specify failure detectors for systems with probabilistic behaviors, i.e., for systems where message delays and message losses follow some probability distributions. We then give a new failure detector algorithm and analyze its QoS in terms of the proposed metrics. We show that, among a large class of failure detectors, the new algorithm is optimal with respect to some of these QoS metrics. Given a set of failure detector QoS requirements, we show how to compute the parameters of our algorithm so that it satisfies these requirements and we show how this can be done even if the probabilistic behavior of the system is not known. We then present some simulation results that show that the new failure detector algorithm provides a better QoS than an algorithm that is commonly used in practice. Finally, we suggest some ways to make our failure detector adaptive to changes in the probabilistic behavior of the network.

[1] M.K. Aguilera, W. Chen, and S. Toueg, “Using the Heartbeat Failure Detector for Quiescent Reliable Communication and Consensus in Partitionable Networks,” Theoretical Computer Science, vol. 220, no. 1, pp. 3-30, June 1999.
[2] M.K. Aguilera, W. Chen, and S. Toueg, “Failure Detection and Consensus in the Crash-Recovery Model,” Distributed Computing, vol. 13, no. 2, pp. 99-125, Apr. 2000.
[3] M.K. Aguilera, W. Chen, and S. Toueg, “On Quiescent Reliable Communication,” SIAM J. Computing, vol. 29, no. 6, pp. 2040-2073, Apr. 2000.
[4] A. O. Allen,Probability, Statistics, and Queueing Theory with Computer Science Applications.New York: Academic, 1978.
[5] Y. Amir et al., Transis:“A Communication Subsystem for High Availability,” Proc. Int’l Symp. Fault‐Tolerant Computing, IEEE CS Press, Los Alamitos, Calif., 1992, pp. 76‐84.
[6] K. Arvind, “Probabilistic Clock Synchronization in Distributed Systems,” IEEE Trans. Parallel and Distributed Systems, vol. 5, no. 5, pp. 475-487, May 1994.
[7] O. Babaoglu, R. Davoli, L.-A. Giachini, and M.G. Baker, “Relacs: A Communications Infrastructure for Constructing Reliable Applications in Large-Scale Distributed Systems,” BROADCAST Project deliverable report, Dept. of Computing Science, Univ. of Newcastle upon Tyne, U.K., 1994.
[8] P. Billingsley, Probability and Measure, third ed. John Wiley&Sons, 1995.
[9] Reliable Distributed Computing with the Isis Toolkit, K.P. Birman and R. van Renesse, eds. IEEE CS Press, 1993.
[10] Requirements for Internet Hosts-Communication Layers, R. Braden, ed., RFC 1122, Oct. 1989.
[11] T.D. Chandra, V. Hadzillacos, S. Toueg, and B. Charron-Bost, “On the Impossibility of Group Membership,” Proc. 15th ACM Symp. Principles of Distributed Computing, pp. 322–330, 1996.
[12] T.D. Chandra and S. Toueg, “Unreliable Failure Detectors for Reliable Distributed Systems,” J. ACM, vol. 43, no. 2, pp. 225–267, 1996.
[13] W. Chen, “On the Quality of Service of Failure Detectors,” PhD thesis, Cornell Univ., May 2000, available at.
[14] F. Cristian, “Probabilistic Clock Synchronization,” Distributed Computing, vol. 3, no. 3, pp. 146-158, 1989.
[15] B. Deianov and S. Toueg, “Failure Detector Service for Dependable Computing (Fast Abstract),” Proc. 2000 Int'l Conf. Dependable Systems and Networks, pp. B14-B15, June 2000.
[16] D. Dolev, R. Friedman, I. Keidar, and D. Malkhi, “Failure Detectors in Omission Failure Environments,” Technical Report 96-1608, Dept. of Computer Science, Cornell Univ., Ithaca, N.Y., Sept. 1996.
[17] C. Fetzer and F. Cristian, “Fail-Aware Failure Detectors,” Proc. 15th Symp. Reliable Distributed Systems, pp. 200-209, Oct. 1996.
[18] C. Fetzer and F. Cristian, “A Fail-Aware Datagram Service,” Proc. Second Ann. Workshop Fault-Tolerant Parallel and Distributed Systems, Apr. 1997.
[19] C. Fetzer and F. Cristian, “A Fail-Aware Membership Service,” Proc. 16th Symp. Reliable Distributed Systems, pp. 157-164, Oct. 1997.
[20] M.G. Gouda and T.M. McGuire, “Accelerated Heartbeat Protocols,” Proc. 18th Int'l Conf. Distributed Computing Systems, May 1998.
[21] R. Guerraoui, M. Larrea, and A. Schiper, Non-Blocking Atomic Commitment with an Unreliable Failure Detector Proc. 14th Symp. Reliable Distributed Systems (SRDS '95), pp. 41-51, Sept. 1995.
[22] M.G. Hayden, “The Ensemble System,” PhD thesis, Cornell Univ., 1998.
[23] L.E. Moser, P.M. Melliar-Smith, D.A. Agarwal, R.K. Budhia, and C.A. Lingley-Papadopoulos, “Totem: A Fault-Tolerant Multicast Group Communication System,” Comm. ACM, vol. 39, no. 4, pp. 54–63, 1996.
[24] G.F. Pfister, In Search of Clusters, second ed. New Jersey: Prentice Hall, 1998.
[25] M. Raynal and F. Tronel, “Group Membership Failure Detection: A Simple Protocol and Its Probabilistic Analysis,” Distributed Systems Eng. J., vol. 6, no. 3, pp. 95-102, 1999.
[26] S.M. Ross, Stochastic Processes. John Wiley&Sons, 1983.
[27] K. Sigman, Stationary Marked Point Processes, an Intuitive Approach. Chapman&Hall, 1995.
[28] S. Toueg and D. Ivan, private communication, May 2001.
[29] R. van Renesse, K.P. Birman, and S. Maffeis, “Horus: A Flexible Group Communication System,” Comm. ACM, vol. 39, no. 4, pp. 76–83, 1996.
[30] R. van Renesse, Y. Minsky, and M. Hayden, “A Gossip-Style Failure Detection Service,” Proc. Middleware '98, Sept. 1998.
[31] P. Veríssimo and M. Raynal, “Time in Distributed System Models and Algorithms,” Advances in Distributed Systems: Advanced Distributed Computing from Algorithms to Systems, S. Krakowiak and S. K. Shrivastava, eds., chapter 1, Springer-Verlag, 2000.

Index Terms:
Failure detectors, quality of service, fault tolerance, distributed algorithm, probabilistic analysis.
Wei Chen, Sam Toueg, Marcos Kawazoe Aguilera, "On the Quality of Service of Failure Detectors," IEEE Transactions on Computers, vol. 51, no. 1, pp. 13-32, Jan. 2002, doi:10.1109/12.980014
Usage of this product signifies your acceptance of the Terms of Use.