This Article 
 Bibliographic References 
 Add to: 
A Distributed System-Level Diagnosis Algorithm for Arbitrary Network Topologies
February 1995 (vol. 44 no. 2)
pp. 312-334

Abstract—In this paper, a distributed algorithm is described for detecting and diagnosing faulty processors in an arbitrary network. Fault-free processors perform simple periodic tests on one another; when a fault is detected or a newly-repaired processor joins the network, this new information is disseminated \mbi{in}\mbi{parallel} throughout the network. It is formally proven that the algorithm is correct, and it is also shown that the algorithm is optimal in terms of the time required for all of the fault-free processors in the network to learn of a new event. Simulation results are given for arbitrary network topologies.

Index Terms—Computer fault diagnosis, computer fault tolerance, computer networks, distributed computing, system-level fault diagnosis, distributed algorithm, fault detection.

[1] A. Bagchi and S. L. Hakimi,“An optimal algorithm for distributed system level diagnosis,”inProc. 21st Int. Symp. Fault-Tolerant Computing, June 1991.
[2] ——,“Information dissemination in distributed systems with faulty units,”manuscript, Oct. 1993.
[3] R. Bianchini, K. Goodwin, and D. S. Nydick,“Practical application and implementation of distributed system-level diagnosis theory,”inProc. 20th Int. Symp. Fault-Tolerant Computing, June 1990.
[4] R. Bianchini and R. Buskens,“An adaptive distributed systems level diagnosis and its implementation,”inProc. 21st Int. Symp. Fault-Tolerant Computing, June 1991.
[5] ——,“Implementation of on-line distributed system level diagnosis theory,”IEEE Trans. Comput., pp. 616–626, May 1992.
[6] M. Stahl, R. Buskens, and R. Bianchini,“On-line diagnosis in general topology networks,”inProc. Workshop Fault-Tolerant Parallel and Distributed Systems, July 1992.
[7] R. Bianchini, M. Stahl, and R. Buskens,“The Adapt2 on-line diagnosis algorithm for general topology networks,”inProc. Globecom, pp. 610–614, 1992.
[8] ——,“Simulation of the Adapt2 on-line diagnosis algorithm for general topology networks,”inProc. Symp. Reliable Distributed Systems, Oct. 1992.
[9] D. M. Blough, G. F. Sullivan, and G. M. Masson,“Almost certain diagnosis for intermittently faulty systems,”inProc. 18th Int. Symp. Fault-Tolerant Computing, pp. 260–265, 1988.
[10] T.H. Cormen,C.E. Leiserson, and R.L. Rivest,Introduction to Algorithms.Cambridge, Mass.: MIT Press/McGraw-Hill, 1990.
[11] A. Dahbura, K. Sabnani, and L. King, "The Comparison Approach to Multiprocessor Fault Diagnosis," IEEE Trans. Computers, vol. 36, no. 3, pp. 373-378, Mar. 1987.
[12] S. L. Hakimi and E. F. Schmeichel,“An adaptive algorithm for system level diagnosis,”J. Algorithms, no. 5, pp. 526–530, 1984.
[13] S. L. Hakimi and A. T. Amin,“Characterization of the connection assignment of diagnosable systems,”IEEE Trans. Comput., pp. 86–88, Jan. 1974.
[14] S. L. Hakimi and K. Nakajima,“On adaptive system diagnosis,”IEEE Trans. Comput., pp. 234–240, Mar. 1984.
[15] S. H. Hosseini, J. G. Kuhl, and S. M. Reddy,“A diagnosis algorithm for distributed computing systems with dynamic failure and repair,”IEEE Trans. Comput., pp. 223–233, Mar. 1984.
[16] J.G. Kuhl and S.M. Reddy, "Distributed Fault Tolerance for Large Multiprocessor Systems," Proc. 1980 Computer ArchitectureSymp., pp. 222-229, May 1980.
[17] J. G. Kuhl and S. M. Reddy,“Fault-diagnosis in fully distributed systems,”inProc. 11th Int. Symp. Fault-Tolerant Computing, pp. 100–105, June 1981.
[18] L. Lamport, "Time, clocks and the ordering of events in a distributed system," Comm. ACM, vol. 21, no. 7, pp. 558-565, July 1978.
[19] F. P. Preparata, G. Metze, and R. T. Chien,“On the connection assignment problem of diagnosable systems,”IEEE Trans. Electron. Comput., vol. EC-16, pp. 848–854, Dec. 1967.
[20] S. Rangarajan and D. Fussell,“Probabilistic diagnosis algorithms tailored to system topology,”inProc. 21st Int. Symp. Fault-Tolerant Computing, pp. 230–237, June 1991.
[21] ——,“Diagnosing arbitrarily connected parallel computers with high probability,”IEEE Trans. Comput., Special Issue on Fault-Tolerant Computing, May 1992.
[22] R. D. Schlichting and F. B. Schneider,“Fail-stop processors: An approach to designing fault-tolerant computing systems,”ACM Trans. Comput. Syst., vol. 1, no. 3, pp. 222–238, Aug. 1983.
[23] H. Schwetman,CSIM Reference Manual (Revision 16), Microelectronics and Computer Technology Corporation, Austin, TX.
[24] C.-L. Yang and G. M. Masson,“Hybrid fault diagnosability with unreliable communication links,”inProc. 16th Fault-Tolerant Computing Symp., July 1986, pp. 226–231.

Sampath Rangarajan, Anton T. Dahbura, Eric A. Ziegler, "A Distributed System-Level Diagnosis Algorithm for Arbitrary Network Topologies," IEEE Transactions on Computers, vol. 44, no. 2, pp. 312-334, Feb. 1995, doi:10.1109/12.364542
Usage of this product signifies your acceptance of the Terms of Use.