Issue No. 02 - February (1995 vol. 44)

ISSN: 0018-9340

pp: 312-334

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/12.364542

ABSTRACT

<p><it>Abstract—</it>In this paper, a distributed algorithm is described for detecting and diagnosing faulty processors in an arbitrary network. Fault-free processors perform simple periodic tests on one another; when a fault is detected or a newly-repaired processor joins the network, this new information is disseminated <math><tmath>\mbi{in}</tmath></math><math><tmath>\mbi{parallel}</tmath></math> throughout the network. It is formally proven that the algorithm is correct, and it is also shown that the algorithm is optimal in terms of the time required for all of the fault-free processors in the network to learn of a new event. Simulation results are given for arbitrary network topologies.</p><p><it>Index Terms—</it>Computer fault diagnosis, computer fault tolerance, computer networks, distributed computing, system-level fault diagnosis, distributed algorithm, fault detection.</p>

INDEX TERMS

CITATION

Eric A. Ziegler, Sampath Rangarajan, Anton T. Dahbura, "A Distributed System-Level Diagnosis Algorithm for Arbitrary Network Topologies",

*IEEE Transactions on Computers*, vol. 44, no. , pp. 312-334, February 1995, doi:10.1109/12.364542