Pacific Rim International Symposium on Dependable Computing, IEEE (2001)
Dec. 17, 2001 to Dec. 19, 2001
The detection of process failures is a crucial problem system designers have to cope with in order to build fault-tolerant distributed platforms. Unfortunately, it is impossible to distinguish with certainty a crashed process from a very slow process in a purely asynchronous distributed system. This prevents some problems to be solved in such systems. That is why failure detector oracles have been introduced to circumvent these impossibility results. This paper presents a relatively simple protocol that allows a process to "monitor" another process, and consequently to detect its crash. This protocol enjoys the nice property to rely as much as possible on application messages to do this monitoring. Differently from previous process crash detection protocols, it uses control messages only when no application messages is sent by the monitoring process to the observed process. This protocol has noteworthy features. When the underlying system satisfies the partial synchrony assumption, it actually implements an eventually perfect failure detector (i.e., a failure detector of the class usually denoted 3 P). Moreover, if the average observed transmission delay is finite and the upper layer application terminates within a bounded number of steps for any failure detector in 3 P after the failure detector becomes "perfect", then, when run with the proposed protocol, it also terminates correctly. These properties make the protocol attractive: it is inexpensive, implementable, and powerful. The paper also describes performance measurements of an implementation of the protocol.
Christol Fetzer, Michel Raynal, Frederic Tronel, "An Adaptive Failure Detection Protocol", Pacific Rim International Symposium on Dependable Computing, IEEE, vol. 00, no. , pp. 146, 2001, doi:10.1109/PRDC.2001.992691