The Broadcast Comparison Model for On-Line Fault Diagnosis in Multicomputer Systems: Theory and Implementation
May 1999 (vol. 48 no. 5)
pp. 470-493
DOI Bookmark:
http://doi.ieeecomputersociety.org/10.1109/12.769431
Abstract—This paper describes a new comparison-based model for distributed fault diagnosis in multicomputer systems with a weak reliable broadcast capability. The classical problems of diagnosability and diagnosis are both considered under this broadcast comparison model. A characterization of diagnosable systems is given, which leads to a polynomial-time diagnosability algorithm. A polynomial-time diagnosis algorithm for [1] F.J. Allan, T. Kameda, and S. Toida, “An Approach to the Diagnosability Analysis of a System,” IEEE Trans. Computers, vol. 23, no. 10, pp. 1,040-1,042, Oct. 1975.[2] A. Bagchi and S.L. Hakimi, "An Optimal Algorithm for Distributed System Level Diagnosis," Proc. IEEE CS 21st Int'l Symp. Fault-Tolerant Computing, pp. 214-221, 1991.[3] F. Bao and Y. Igarashi, “Reliable Broadcasting in Product Networks with Byzantine Faults,” Digest 26th Int'l Symp. Fault-Tolerant Computing, pp. 262-271, 1996.[4] F. Barsi, F. Grandoni, and P. Maestrini, “A Theory of Diagnosability of Digital Systems,” IEEE Trans. Computers, vol. 25, no. 6, pp. 585-593, June 1976.[5] R. Bianchini and R. Buskens, “Implementation of On-Line Distributed System-Level Diagnosis Theory,” IEEE Trans. Computers, vol. 41, no. 5, pp. 616-626, May 1992.[6] R. Bianchini, K. Goodwin, and D. Nydick, “Practical Application and Implementation of Distributed System-Level Diagnosis Theory,” Digest 20th Int'l Symp. Fault-Tolerant Computing, pp.332-339, 1990.[7] K.Y. Chwa and S.L. Hakimi, “Schemes for Fault-Tolerant Computing: A Comparison of Modularly Redundant and$t$-Diagnosable Systems,” Information and Control, vol. 49, pp. 212-238, 1981.[8] F. Cristian, H. Aghili, and R. Strong, “Atomic Broadcast: From Simple Message Diffusion to Byzantine Agreement,” Digest 15th Int'l Symp. Fault-Tolerant Computing, pp. 200-206, 1985.[9] D. Cummings and L. Alkalaj, “Checkpoint/Rollback in a Distributed System using Coarse-Grained Dataflow,” Digest 24th Int'l Symp. Fault-Tolerant Computing, pp. 424-433, 1994.[10] A.T. Dahbura, “System-Level Diagnosis: A Perspective for The Third Decade,” Concurrent Computation: Algorithms, Architectures, Technologies. Plenum 1988.[11] A.T. Dahbura and G.M. Masson, “An$O(n^{2. 5})$Fault Identification Algorithm for Diagnosable Systems,” IEEE Trans. Computers, vol.33, no. 6, pp. 486-492, June 1984.[12] D. Dolev, “The Byzantine Generals Strike Again,” J. Algorithms, vol. 3, pp. 14-30, 1982.[13] V. Hadzilacos and S. Toueg, "Fault-Tolerant Broadcasts and Related Problems," in Distributed Systems, S. Mullender, ed., ACM Press, New York, 1993, pp. 97-138.[14] M. Hiltunen, “Membership and System Diagnosis,” Proc. 14th Symp. Reliable Distributed Systems, pp. 208-217, 1995.[15] S. Hosseini, J. Kuhl, and S. Reddy, “A Diagnosis Algorithm for Distributed Computing Systems with Dynamic Failure and Repair,” IEEE Trans. Computers, vol. 33, no. 3, pp. 223-233, Mar. 1984.[16] J.G. Kuhl and S.M. Reddy, "Distributed Fault Tolerance for Large Multiprocessor Systems," Proc. 1980 Computer ArchitectureSymp., pp. 222-229, May 1980.[17] B.F. Lewis et al. “COSMOS Multicomputer Operating System and Development Environment Functional Specification,” NASA Technical Memorandum, Caltech, JPL, Aug. 1992.[18] B.F. Lewis and R.L. Bunker, “MAX: An Advanced Parallel Computer for Space Applications,” Proc. Second Int'l Symp. Space Information Systems, Sept. 1990.[19] J. Maeng and M. Malek, “A Comparison Connection Assignment for Self-Diagnosis of Multiprocessor Systems,” Digest 11th Int'l Symp. Fault Tolerant Computing, pp. 173-175, 1981.[20] M. Malek, “A Comparison Connection Assignment for Diagnosis of Multiprocessor Systems,” Proc. Seventh Int'l Symp. Computer Architecture, pp. 31-35, 1980.[21] G.G.L. Meyer, “A Diagnosis Algorithm for the BGM System Level Fault Model,” IEEE Trans. Computers vol. 33, no. 8, pp. 756-758, Aug. 1984.[22] J. Narasimhan and K. Nakajima, “An Algorithm for Determining the Fault Diagnosability of a System,” IEEE Trans. Computers, vol. 35, no. 11, pp. 1,004-1,008, Nov. 1986.[23] M. Pfluegl and D. Blough, “Communication Protocols for Fault-Tolerant Clock Synchronization in Not-Completely-Connected Networks,” Proc. 11th Symp. Reliable Distributed Systems, pp. 130-137, 1992.[24] D. Powell, G. Bonn, D. Seaton, P. Verissimo, and F. Waeselynck, The Delta-4 Approach to Dependability in Open Distributed Computing Systems Proc. 18th IEEE Int'l Symp. Fault-Tolerant Computing (FTCS-18), pp. 246-251, June 1988.[25] F.P. Preparata, G. Metze, and R.T. Chien, “On the Connection Assignment Problem of Diagnosable Systems,” IEEE Trans. Electronic Computers, vol. 16, pp. 848-854, Dec. 1967.[26] R. D. Schlichting and F. B. Schneider,“Fail-stop processors: An approach to designing fault-tolerant computing systems,”ACM Trans. Comput. Syst., vol. 1, no. 3, pp. 222–238, Aug. 1983. [27] A. Sengupta and A. Dahbura, “On Self-Diagnosable Multiprocessor Systems: Diagnosis by the Comparison Approach,” IEEE Trans. Computers, vol. 41, no. 11, pp. 1386-1396, Nov. 1992.[28] G.F. Sullivan, "A Polynomial Time Algorithm for Fault Diagnosability," Proc. 25th Symp. Foundations of Computer Science, pp. 148-156, Oct. 1984.[29] C. Walter, N. Suri, and M. Hugue, “Continual On-Line Diagnosis of Hybrid Faults,” Proc. Fourth IFIP Working Conf. Dependable Computing for Critical Applications, pp. 233-249, 1994.[30] H. Wang, “Practical Comparison-Based Fault Diagnosis in Multiprocessor Systems,” PhD dissertation, Dept. of Electrical and Computer Eng., Univ. of California, Irvine, June 1995.[31] H. Wang, D. Blough, and L. Alkalaj, “Analysis and Experimental Evaluation of Comparison-Based System-Level Diagnosis for Multiprocessor Systems,” Digest 24th IEEE Int'l Symp. Fault-Tolerant Computing, pp. 55-64, 1994.
Index Terms:
Diagnosability, distributed algorithms, fault diagnosis, multicomputer systems.
Citation:
Douglas M. Blough, Hongying W. Brown, "The Broadcast Comparison Model for On-Line Fault Diagnosis in Multicomputer Systems: Theory and Implementation," IEEE Transactions on Computers, vol. 48, no. 5, pp. 470-493, May 1999, doi:10.1109/12.769431
Usage of this product signifies your acceptance of the
Terms of Use.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||