Subscribe
Issue No.06 - June (2009 vol.20)
pp: 778-787
Achour Mostefaoui , IRISA, Université de Rennes, Campus de Beaulieu, France
Michel Raynal , IRISA, Université de Rennes, Campus de Beaulieu, France
Gilles Tredan , IRISA, Université de Rennes, Campus de Beaulieu, France
ABSTRACT
It is well known that in an asynchronous system where processes are prone to crash, it is impossible to design a protocol that provides each process with the set of processes that are currently alive. Basically, this comes from the fact that it is impossible to distinguish a crashed process from a process that is very slow or with which communications are very slow. Nevertheless, designing protocols that provide the processes with good approximations of the set of processes that are currently alive remains a real challenge in fault-tolerant-distributed computing. This paper proposes such a protocol, plus a second protocol that allows to cope with heterogeneous communication networks. These protocols consider a realistic computation model where the processes are provided with nonsynchronized local clocks and a function \alpha () that takes a local duration \Delta as a parameter, and returns an integer that is an estimate of the number of processes that could have crashed during that duration \Delta. A simulation-based experimental evaluation of the proposed protocols is also presented. These experiments show that the protocols are practically relevant.
INDEX TERMS
Approximation protocol, asynchronous system, coverage assumption, crash failure, crash detection, fault-tolerance, message passing, nonsynchronized local clocks.
CITATION
Achour Mostefaoui, Michel Raynal, Gilles Tredan, "On the Fly Estimation of the Processes that Are Alive in an Asynchronous Message-Passing System", IEEE Transactions on Parallel & Distributed Systems, vol.20, no. 6, pp. 778-787, June 2009, doi:10.1109/TPDS.2009.12
REFERENCES
 [1] H. Attiya and J. Welch, Distributed Computing: Fundamentals, Simulations and Advanced Topics, second ed. Wiley-Interscience, p.414, 2004. [2] M. Ben-Or, “Another Advantage of Free Choice: Completely Asynchronous Agreement Protocols,” Proc. Second ACM Symp. Principles of Distributed Computing (PODC '83), pp.27-30, 1983. [3] T. Chandra and S. Toueg, “Unreliable Failure Detectors for Reliable Distributed Systems,” J. ACM, vol. 43, no. 2, pp.225-267, 1996. [4] F. Cristian and C. Fetzer, “The Timed Asynchronous Distributed System Model,” IEEE Trans. Parallel Distributed Systems, vol. 10, no. 6, pp.642-657, June 1999. [5] C. Dwork, N. Lynch, and L. Stockmeyer, “Consensus in the Presence of Partial Synchrony,” J. ACM, vol. 35, no. 2, pp.288-323, 1988. [6] C. Fetzer, “Perfect Failure Detection in Timed Asynchronous Systems,” IEEE Trans. Computers, vol. 52, no. 2, pp.99-112, Feb. 2003. [7] C. Fetzer, M. Raynal, and F. Tronel, “An Adaptive Failure Detection Protocol,” Proc. Eighth IEEE Pacific Rim Int'l Symp. Dependable Computing (PRDC '01), pp.146-153, 2001. [8] M.J. Fischer, N.A. Lynch, and M.S. Paterson, “Impossibility of Distributed Consensus with One Faulty Process,” J. ACM, vol. 33, no. 2, pp.374-382, 1985. [9] N. Hayashibara, X. Defago, R. Yared, and T. Kayatama, “The $\phi$ Accrual Failure Detector,” Proc. 23rd Int'l IEEE Symp. Reliable Distributed Systems (SRDS '04), pp.66-78, 2004. [10] J. Kleinberg, “The Small-World Phenomenon: An Algorithmic Perspective,” Proc. 32nd ACM Symp. Theory of Computing (STOC '00), pp.163-170, 2000. [11] S. Krishnamurthy, W.H. Sanders, and M. Cukier, “An Adaptive Quality of Service Aware Middleware for Replicated Services,” IEEE Trans. Parallel and Distributed Systems, vol. 14, no. 11, pp.1112-1125, Nov. 2003. [12] L. Lamport, “Time, Clocks and the Ordering of Events in a Distributed System,” Comm. ACM, vol. 21, no. 7, pp.558-565, 1978. [13] N.A. Lynch, Distributed Algorithms, p.872. Morgan Kaufmann Publishers, 1996. [14] A. Mostefaoui, E. Mourgaya, and M. Raynal, “Asynchronous Implementation of Failure Detectors,” Proc. Int'l IEEE Conf. Dependable Systems and Networks (DSN '03), pp.351-360, 2003. [15] D. Powell, “Failure Mode Assumptions and Assumption Coverage,” Proc. 22nd Int'l IEEE Symp. Fault-Tolerant Computing (FTCS-22), pp.386-395, 1992. [16] M. Rabin, “Randomized Byzantine Generals,” Proc. 24th IEEE Symp. Foundations of Computer Science (FOCS '83), pp.116-124, 1983. [17] M. Raynal, “A Short Introduction to Failure Detectors for Asynchronous Distributed Systems,” ACM SIGACT News, Distributed Computing Column, vol. 36, no. 1, pp.53-70, 2005. [18] M. Raynal and F. Tronel, “Group Membership Failure Detection: A Simple Protocol and Its Probabilistic Analysis,” Distributed Systems Eng., vol. 6, no. 3, pp.95-102, 1999. [19] R. Van Renesse, Y. Minsky, and M. Hayden, “A Gossip-Style Failure Detection Service,” Proc. IFIP Int'l Conf. Distributed Systems Platforms and Open Distributed Processing (Middleware), 1998. [20] H. Zhang, A. Goel, and R. Govindan, “An Empirical Evaluation of Internet Latency Expansion,” ACM SIGCOMM Computer Comm. Rev., vol. 35, no. 1, pp.93-97, 2005.