This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Perfect Failure Detection in Timed Asynchronous Systems
February 2003 (vol. 52 no. 2)
pp. 99-112

Abstract—Perfect failure detectors can correctly decide whether a computer is crashed. However, it is impossible to implement a perfect failure detector in purely asynchronous systems. We show how to enforce perfect failure detection in timed asynchronous systems with hardware watchdogs. The two main system model assumptions are 1) each computer can measure time intervals with a known maximum error and 2) each computer has a watchdog that crashes the computer unless the watchdog is periodically updated. We have implemented a system that satisfies both assumptions using a combination of off-the-shelf software and hardware. To implement a perfect failure detector for process crash failures, we show that, in some systems, a hardware watchdog is actually not necessary.

[1] A. Bhide, E. Elnozahy, and S.P. Morgan, “A Highly Available Network File Server,” Proc. USENIX Winter Conf., pp. 199-205, Jan. 1991.
[2] K. Birman, "Replication and Fault-Tolerance in the ISIS System," Proc. 10th ACM Symp. Operating Systems Principles, pp. 79-86, Dec. 1985.
[3] T. Chandra, V. Hadzilacos, and S. Toueg, “The Weakest Failure Detector for Solving Consensus,” Proc. 11th ACM Symp. Principles of Distributed Computing, pp. 147-158, Aug. 1992.
[4] T.D. Chandra and S. Toueg, "Unreliable Failure Detectors for Asynchronous Systems," Proc. 10th ACM Symp. Principles of Distributed Computing, pp. 325-340, Aug. 1991.
[5] T.D. Chandra and S. Toueg, “Unreliable Failure Detectors for Reliable Distributed Systems,” J. ACM, vol. 43, no. 2, pp. 225–267, 1996.
[6] K.M. Chandy and L. Lamport, "Distributed Snapshots: Determining Global States of Distributed Systems," ACM Trans. Computer Systems, Feb. 1985.
[7] D.R. Cheriton and D. Skeen, "Understanding the Limitations of Causally and Totally Ordered Communications," Operating Systems Rev., Dec. 1993, pp. 44-57.
[8] F. Cristian and C. Fetzer, “The Timed Asynchronous Distributed System Model,” IEEE Trans. Parallel and Distributed Systems, vol. 10, no. 6, pp. 642-657, June 1999.
[9] C. Fetzer and F. Cristian, “A Fail-Aware Datagram Service,” IEE Proceedings—Software Eng., pp. 58-74, Apr. 1999.
[10] C. Fetzer and K. Hogstedt, “Rejuvenation and Failure Detection in Partitionable Systems,” Proc. Pacific Rim Int'l Symp. Dependable Computing (PRDC 2001), Dec. 2001.
[11] C. Fetzer and M. Raynal, “Approximate Real-Time Clocks for Scheduled Events,” Proc. Fifth IEEE Int'l Symp. Object-Oriented Real-Time Distributed Computing, Apr. 2002.
[12] M.J. Fischer, N.A. Lynch, and M.S. Paterson, “Impossibility of Distributed Consensus with One Faulty Process,” J. ACM, vol. 32, no. 2, pp. 374i–382, 1985.
[13] C. Gray and D. Cheriton, "Leases: An Efficient Fault-Tolerant Mechanism for Distributed File Cache Consistency," Proc. 12th Int'l Symp. Operating System Principles, 1989.
[14] R. Guerraoui, R. Oliveira, and A. Schiper, “Stubborn Communication Channels,” Technical Report 98-278, Départment d'Informatique,École Polytechnique Fédérale de Lausanne, 1998.
[15] Y. Huang, C. Kintala, N. Kolettis, and N.D. Fulton, Software Rejuvenation: Analysis, Module and Applications Proc. 25th IEEE Int'l Symp. Fault-Tolerant Computing, pp. 381-390, June 1995.
[16] K.W. Preslan, “Scalability and Failure Recovery in a Linux Based File System,” Proc. Fourth Ann. USENIX Linux Showcase&Conference, Oct. 2000.
[17] L. Lamport, "Time, clocks and the ordering of events in a distributed system," Comm. ACM, vol. 21, no. 7, pp. 558-565, July 1978.
[18] M. Larrea, A. Fernndez, and S. Arvalo, “On the Impossibility of Implementing Perpetual Failure Detectors in Partially Synchronous Systems,” Brief Announcements 15th Int'l Symp. Distributed Computing (DISC 2001), Oct. 2001.
[19] L. Sabel and K. Marzullo, “Simulating Fail-Stop in Asynchronous Distributed Systems,” Proc. 13th IEEE Symp. Reliable Distributed Systems (SRDS), pp. 138-147, Oct. 1994.

Index Terms:
Perfect failure detection, crash failures, asynchronous distributed systems, timed asynchronous system model.
Citation:
Christof Fetzer, "Perfect Failure Detection in Timed Asynchronous Systems," IEEE Transactions on Computers, vol. 52, no. 2, pp. 99-112, Feb. 2003, doi:10.1109/TC.2003.1176979
Usage of this product signifies your acceptance of the Terms of Use.