This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
On the Implementation of Unreliable Failure Detectors in Partially Synchronous Systems
July 2004 (vol. 53 no. 7)
pp. 815-828

Abstract—Unreliable failure detectors were proposed by Chandra and Toueg as mechanisms that provide information about process failures. Chandra and Toueg defined eight classes of failure detectors, depending on how accurate this information is, and presented an algorithm implementing a failure detector of one of these classes in a partially synchronous system. This algorithm is based on all--to-all communication and periodically exchanges a number of messages that is quadratic on the number of processes. In this paper, we study the implementability of different classes of failure detectors in several models of partial synchrony. We first show that no failure detector with perpetual accuracy (namely, \cal P, \cal Q, \cal S, and \cal W) can be implemented in these models in systems with even a single failure. We also show that, in these models of partial synchrony, it is necessary a majority of correct processes to implement a failure detector of the class \Theta proposed by Aguilera et al. Then, we present a family of distributed algorithms that implement the four classes of unreliable failure detectors with eventual accuracy (namely, \diamond {\cal{P}}, \diamond {\cal{Q}}, \diamond {\cal{S}}, and \diamond {\cal{W}}). Our algorithms are based on a logical ring arrangement of the processes, which defines the monitoring and failure information propagation pattern. The resulting algorithms periodically exchange at most a linear number of messages.

[1] M. Aguilera, C. Delporte-Gallet, H. Fauconnier, and S. Toueg, Stable Leader Election Proc. 15th Int'l Symp. DIstributed Computing (DISC '2001), pp. 108-122, Oct. 2001.
[2] M. Aguilera, S. Toueg, and B. Deianov, Revisiting the Weakest Failure Detector for Uniform Reliable Broadcast Proc. 13th Int'l Symp. DIstributed Computing (DISC '99), pp. 19-33, Sept. 1999.
[3] E. Anceaume, A. Fernández, A. Mostefaoui, G. Neiger, and M. Raynal, A Necessary and Sufficient Condition for Transforming Limited Accuracy Failure Detectors J. Computer and System Sciences, vol. 68, no. 1, pp. 123-133, Feb. 2004.
[4] M. Bertier, O. Marin, and P. Sens, Implementation and Performance Evaluation of an Adaptable Failure Detector Proc. IEEE Int'l Conf. Dependable Systems and Networks (DSN '2002), pp. 354-363, June 2002.
[5] T.D. Chandra, V. Hadzilacos, and S. Toueg, The Weakest Failure Detector for Solving Consensus J. ACM, vol. 43, no, 4, pp. 685-722, July 1996.
[6] T.D. Chandra and S. Toueg, Unreliable Failure Detectors for Reliable Distributed Systems J. ACM, vol. 43, no. 2, pp. 225-267, Mar. 1996.
[7] W. Chen, S. Toueg, and M.K. Aguilera, On the Quality of Service of Failure Detectors IEEE Trans. Computers, vol. 51, no. 5, pp. 561-580, 2002.
[8] C. Delporte-Gallet, H. Fauconnier, and R. Guerraoui, A Realistic Look at Failure Detectors Proc. IEEE Int'l Conf. Dependable Systems and Networks (DSN '2002), pp. 345-352, June 2002.
[9] D. Dolev, C. Dwork, and L. Stockmeyer, On the Minimal Synchronism Needed for Distributed Consensus J. ACM, vol. 34, no. 1, pp. 77-98, Jan. 1987.
[10] C. Dwork, N. Lynch, and L. Stockmeyer, Consensus in the Presence of Partial Synchrony J. ACM, vol. 35, no. 2, pp. 288-323, Apr. 1988.
[11] C. Fetzer, M. Raynal, and F. Tronel, An Adaptive Failure Detection Protocol Proc. Eighth IEEE Pacific Rim Int'l Symp. Dependable Computing (PRDC '2001), pp. 146-153, Dec. 2001.
[12] M. Fischer, N. Lynch, and M. Paterson, Impossibility of Distributed Consensus with One Faulty Process J. ACM, vol. 32, no. 2, pp. 374-382, Apr. 1985.
[13] R. Guerraoui, M. Larrea, and A. Schiper, Non-Blocking Atomic Commitment with an Unreliable Failure Detector Proc. 14th Symp. Reliable Distributed Systems (SRDS '95), pp. 41-51, Sept. 1995.
[14] R. Guerraoui and A. Schiper, $\Gamma$-Accurate Failure Detectors Proc. 10th Int'l Workshop Distributed Algorithms (WDAG '96), pp. 269-286, Oct. 1996.
[15] I. Gupta, T.D. Chandra, and G. Goldszmidt, On Scalable and Efficient Distributed Failure Detectors Proc. 20th Ann. ACM Symp. Principles of Distributed Computing (PODC '2001), pp. 170-179, Aug. 2001.
[16] M. Larrea, S. Arévalo, and A. Fernández, Efficient Algorithms to Implement Unreliable Failure Detectors in Partially Synchronous Systems Proc. 13th Int'l Symp. DIstributed Computing (DISC '99), pp. 34-48, Sept. 1999.
[17] M. Larrea, A. Fernández, and S. Arévalo, Optimal Implementation of the Weakest Failure Detector for Solving Consensus Proc. 19th IEEE Symp. Reliable Distributed Systems (SRDS '2000), pp. 52-59, Oct. 2000.
[18] M. Larrea, A. Fernández, and S. Arévalo, Eventually Consistent Failure Detectors Proc. 10th Euromicro Workshop Parallel, Distributed, and Network-Based Processing (PDP '2002), pp. 91-98, Jan. 2002. Also inBrief Announcements of the 14th Int'l Symp. DIstributed Computing (DISC '2000),Oct. 2000.
[19] A. Mostefaoui, E. Mourgaya, and M. Raynal, Asynchronous Implementation of Failure Detectors Technical Report 1484, Institut de Recherche en Informatique et Systémes Aléatoires (IRISA), Sept. 2002.
[20] A. Mostefaoui and M. Raynal, Unreliable Failure Detectors with Limited Scope Accuracy and an Application to Consensus Proc. 19th Int'l Conf. Foundations of Software Technology and Theoretical Computer Science (FST&TCS '99), pp. 329-340, Dec. 1999.
[21] A. Mostefaoui and M. Raynal, k-Set Agreement and Limited Accuracy Failure Detectors Proc. 19th Ann. ACM Symp. Principles of Distributed Computing (PODC '2000), pp. 143-152, July 2000.
[22] M. Pease, R. Shostak, and L. Lamport, Reaching Agreement in the Presence of Faults J. ACM, vol. 27, no. 2, pp. 228-234, Apr. 1980.
[23] M. Raynal and F. Tronel, Group Membership Failure Detection: A Simple Protocol and Its Probabilistic Analysis Distributed Systems Eng. J., vol. 6, no. 3, pp. 95-102, 1999.
[24] M. Raynal and F. Tronel, Restricted Failure Detectors: Definition and Reduction Protocols Information Processing Letters, vol. 72, pp. 91-97, 1999.
[25] J. Yang, G. Neiger, and E. Gafni, Structured Derivations of Consensus Algorithms for Failure Detectors Proc. 17th Ann. ACM Symp. Principles of Distributed Computing (PODC '98), pp. 297-308, July 1998.

Index Terms:
Consensus problem, crash failures, distributed systems, failure detection, partial synchrony, unreliable failure detectors.
Citation:
Mikel Larrea, Antonio Fern?ndez, Sergio Ar?valo, "On the Implementation of Unreliable Failure Detectors in Partially Synchronous Systems," IEEE Transactions on Computers, vol. 53, no. 7, pp. 815-828, July 2004, doi:10.1109/TC.2004.33
Usage of this product signifies your acceptance of the Terms of Use.