loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Distributed Diagnosis in Dynamic Fault Environments
May 2004 (vol. 15 no. 5)
pp. 453-467

Abstract—The problem of distributed diagnosis in the presence of dynamic failures and repairs is considered. To address this problem, the notion of bounded correctness is defined. Bounded correctness is made up of three properties: bounded diagnostic latency, which ensures that information about state changes of nodes in the system reaches working nodes with a bounded delay, bounded start-up time, which guarantees that working nodes determine valid states for every other node in the system within bounded time after their recovery, and accuracy, which ensures that no spurious events are recorded by working nodes. It is shown that, in order to achieve bounded correctness, the rate at which nodes fail and are repaired must be limited. This requirement is quantified by defining a minimum state holding time in the system. Algorithm HeartbeatComplete is presented and it is proven that this algorithm achieves bounded correctness in fully-connected systems while simultaneously minimizing diagnostic latency, start-up time, and state holding time. A diagnosis algorithm for arbitrary topologies, known as Algorithm ForwardHeartbeat, is also presented. ForwardHeartbeat is shown to produce significantly shorter latency and state holding time than prior algorithms, which focused primarily on minimizing the number of tests at the expense of latency.

[1] M.K. Aguilera, W. Chen, and S. Toueg, Failure Detection and Consensus in the Crash-Recovery Model Distributed Computing, vol. 13, no. 2, pp. 99-125, 2000.
[2] A. Bagchi and S.L. Hakimi, "An Optimal Algorithm for Distributed System Level Diagnosis," Proc. IEEE CS 21st Int'l Symp. Fault-Tolerant Computing, pp. 214-221, 1991.
[3] M. Barborak, M. Malek, and A.T. Dahbura, The Consensus Problem in Fault-Tolerant Computing ACM Computing Surveys, vol. 25, pp. 171-220, June 1993.
[4] D. Blough and H. Brown, The Broadcast Comparison Model for On-Line Fault Diagnosis in Multicomputer Systems: Theory and Implementation IEEE Trans. Computers, vol. 48, pp. 470-493, May 1999.
[5] R. Bianchini and R. Buskens, “Implementation of On-Line Distributed System-Level Diagnosis Theory,” IEEE Trans. Computers, vol. 41, no. 5, pp. 616-626, May 1992.
[6] M. Clegg and K. Marzullo, A Low-Cost Processor Group Membership Protocol for a Hard Real-Time Distributed System Proc. 18th Real-Time Systems Symp., pp. 90-98, 1997.
[7] F. Cristian, Reaching Agreement on Processor-Group Membership in Synchronous Distributed Systems Distributed Computing, pp. 175-187, 1991.
[8] E.P. Duarte Jr., A. Brawerman, and L.C.P. Albini, An Algorithm for Distributed Hierarchical Diagnosis of Dynamic Fault and Repair Events Proc. Seventh Int'l Conf. Parallel and Distributed Systems, pp. 299-306, 2000.
[9] E.P. Duarte Jr. and T. Nanya, A Hierarchical Adaptive Distributed System-Level Diagnosis Algorithm IEEE Trans. Computers, vol. 47, pp. 34-45, Jan. 1998.
[10] S. Even, Graph Algorithms. Computer Science Press, 1979.
[11] P.D. Ezhilchelvan and R. de Lemos, “A Robust Group Membership Algorithm for Distributed Real-Time Systems,” Proc. Real-Time Systems Symp., 1990.
[12] M. Hiltunen, “Membership and System Diagnosis,” Proc. 14th Symp. Reliable Distributed Systems, pp. 208-217, 1995.
[13] S. Hosseini, J. Kuhl, and S. Reddy, A Diagnosis Algorithm for Distributed Comp. Systems with Dynamic Failure and Repair IEEE Trans. Computers, vol. 33, pp. 223-233, Mar. 1984.
[14] W. Hurwood, Ongoing Fault Diagnosis Proc. 15th Symp. Reliable Distributed Systems, pp. 108-117, 1996.
[15] K.H. Kim et al., An Efficient Decentralized Approach to Processor-Group Membership Maintenance in Real-Time LAN Systems: The PRHB/ED Scheme Proc. 11th Symp. Real-Time Distributed Systems, pp. 74-83, 1992.
[16] H. Kopetz and G. Grünsteidl, "TTP: A Time-Triggered Protocol for Fault-Tolerant Real-Time Systems," Computer, vol. 24, no. 1, Jan. 1994, pp. 14-23.
[17] P. Maestrini and P. Santi, “Self-Diagnosis of Processor Arrays Using a Comparison Model,” Proc. Symp. Reliable and Distributed Systems (SRDS-14), pp. 218-228, Sept. 1995.
[18] A. Pelc, "Undirected Graph Models for System-Level Fault Diagnosis," IEEE Trans. Computers, vol. 40, pp. 1,271-1,276, 1991.
[19] S. Rangarajan, A.T. Dahbura, and E.A. Ziegler, "A Distributed System-Level Diagnosis Algorithm for Arbitrary Network Topologies," IEEE Trans. Computers, vol. 44, pp. 312-333, 1995.
[20] S. Rangarajan and D. Fussell, “Diagnosing Arbitrarily Connected Parallel Computers with High Probability,” IEEE Trans. Computers, vol. 41, pp. 606-615, 1992.
[21] J. Rushby, A Formally Verified Algorithm for Clock Synchronization under a Hybrid Fault Model Proc. 13th ACM Symp. Principles of Distributed Computing, pp. 304-313, 1994.
[22] R.D. Schlichting and F.B. Schneider, Fail-Stop Processors: An Approach to Designing Fault-Tolerant Computing Systems ACM Trans. Computer Systems, vol. 1, no. 3, pp. 222-238, Aug. 1983.
[23] A. Sengupta and A. Dahbura, “On Self-Diagnosable Multiprocessor Systems: Diagnosis by the Comparison Approach,” IEEE Trans. Computers, vol. 41, no. 11, pp. 1386-1396, Nov. 1992.
[24] M. Stahl, R. Buskens, and R. Bianchini, On-Line Diagnosis in General Topology Networks Proc. Workshop Fault Tolerant Parallel and Distributed Systems, 1992.
[25] A. Subbiah, Design and Evaluation of a Distributed Diagnosis Algorithm for Arbitrary Network Topologies in Dynamic Fault Environments MS thesis, Georgia Inst. of Technology,http://www.ece.gatech.edu/~arunms_thesis.pdf , 2001.
[26] P.M. Thambidurai and Y.K. Park,"Interactive Consistency with Multiple Failure Modes," Proc. seventh Reliable Dist. Systems Symp., Oct. 1988.

Index Terms:
Distributed diagnosis, dynamic failures, fault tolerance, synchronous systems.
Citation:
Arun Subbiah, Douglas M. Blough, "Distributed Diagnosis in Dynamic Fault Environments," IEEE Transactions on Parallel and Distributed Systems, vol. 15, no. 5, pp. 453-467, May 2004, doi:10.1109/TPDS.2004.1278102
Usage of this product signifies your acceptance of the Terms of Use.