Issue No. 08 - Aug. (2012 vol. 23)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TPDS.2011.284
Elias P. Duarte , Federal University of Paraná, Curitiba
Andréa Weber , Federal University of Paraná, Curitiba
Keiko V. Ono Fonseca , Federal University of Paraná, Curitiba
This work introduces the Distributed Network Reachability (DNR) algorithm, a distributed system-level diagnosis algorithm that allows every node of a partitionable arbitrary topology network to determine which portions of the network are reachable and unreachable. DNR is the first distributed diagnosis algorithm that works in the presence of network partitions and healings caused by dynamic fault and repair events. Both crash and timing faults are assumed, and a faulty node is indistinguishable of a network partition. Every link is alternately tested by one of its adjacent nodes at subsequent testing intervals. Upon the detection of a new event, the new diagnostic information is disseminated to reachable nodes. New events can occur before the dissemination completes. Any time a new event is detected or informed, a working node may compute the network reachability using local diagnostic information. The bounded correctness of DNR is proved, including the bounded diagnostic latency, bounded startup and accuracy. Simulation results are presented for several random and regular topologies, showing the performance of the algorithm under highly dynamic fault situations.
Network reachability, distributed diagnosis, multiprocessor systems, dynamic fault diagnosis, bounded correctness.
E. P. Duarte, K. V. Fonseca and A. Weber, "Distributed Diagnosis of Dynamic Events in Partitionable Arbitrary Topology Networks," in IEEE Transactions on Parallel & Distributed Systems, vol. 23, no. , pp. 1415-1426, 2011.