This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
The Broadcast Comparison Model for On-Line Fault Diagnosis in Multicomputer Systems: Theory and Implementation
May 1999 (vol. 48 no. 5)
pp. 470-493

Abstract—This paper describes a new comparison-based model for distributed fault diagnosis in multicomputer systems with a weak reliable broadcast capability. The classical problems of diagnosability and diagnosis are both considered under this broadcast comparison model. A characterization of diagnosable systems is given, which leads to a polynomial-time diagnosability algorithm. A polynomial-time diagnosis algorithm for $t$-diagnosable systems is also given. A variation of this algorithm, which allows dynamic fault occurrence and incomplete diagnostic information, has been implemented in the COmmon Spaceborne Multicomputer Operating System (COSMOS). Results produced using a simulator for the JPL MAX multicomputer system running COSMOS show that the algorithm diagnoses all fault situations with low latency and very little overhead. These simulations demonstrate the practicality of the proposed diagnosis model and algorithm for multicomputer systems having weak reliable broadcast. This includes systems with fault-tolerant hardware for broadcast, as well as those where reliable broadcast is implemented in software.

[1] F.J. Allan, T. Kameda, and S. Toida, “An Approach to the Diagnosability Analysis of a System,” IEEE Trans. Computers, vol. 23, no. 10, pp. 1,040-1,042, Oct. 1975.
[2] A. Bagchi and S.L. Hakimi, "An Optimal Algorithm for Distributed System Level Diagnosis," Proc. IEEE CS 21st Int'l Symp. Fault-Tolerant Computing, pp. 214-221, 1991.
[3] F. Bao and Y. Igarashi, “Reliable Broadcasting in Product Networks with Byzantine Faults,” Digest 26th Int'l Symp. Fault-Tolerant Computing, pp. 262-271, 1996.
[4] F. Barsi, F. Grandoni, and P. Maestrini, “A Theory of Diagnosability of Digital Systems,” IEEE Trans. Computers, vol. 25, no. 6, pp. 585-593, June 1976.
[5] R. Bianchini and R. Buskens, “Implementation of On-Line Distributed System-Level Diagnosis Theory,” IEEE Trans. Computers, vol. 41, no. 5, pp. 616-626, May 1992.
[6] R. Bianchini, K. Goodwin, and D. Nydick, “Practical Application and Implementation of Distributed System-Level Diagnosis Theory,” Digest 20th Int'l Symp. Fault-Tolerant Computing, pp.332-339, 1990.
[7] K.Y. Chwa and S.L. Hakimi, “Schemes for Fault-Tolerant Computing: A Comparison of Modularly Redundant and$t$-Diagnosable Systems,” Information and Control, vol. 49, pp. 212-238, 1981.
[8] F. Cristian, H. Aghili, and R. Strong, “Atomic Broadcast: From Simple Message Diffusion to Byzantine Agreement,” Digest 15th Int'l Symp. Fault-Tolerant Computing, pp. 200-206, 1985.
[9] D. Cummings and L. Alkalaj, “Checkpoint/Rollback in a Distributed System using Coarse-Grained Dataflow,” Digest 24th Int'l Symp. Fault-Tolerant Computing, pp. 424-433, 1994.
[10] A.T. Dahbura, “System-Level Diagnosis: A Perspective for The Third Decade,” Concurrent Computation: Algorithms, Architectures, Technologies. Plenum 1988.
[11] A.T. Dahbura and G.M. Masson, “An$O(n^{2. 5})$Fault Identification Algorithm for Diagnosable Systems,” IEEE Trans. Computers, vol.33, no. 6, pp. 486-492, June 1984.
[12] D. Dolev, “The Byzantine Generals Strike Again,” J. Algorithms, vol. 3, pp. 14-30, 1982.
[13] V. Hadzilacos and S. Toueg, "Fault-Tolerant Broadcasts and Related Problems," in Distributed Systems, S. Mullender, ed., ACM Press, New York, 1993, pp. 97-138.
[14] M. Hiltunen, “Membership and System Diagnosis,” Proc. 14th Symp. Reliable Distributed Systems, pp. 208-217, 1995.
[15] S. Hosseini, J. Kuhl, and S. Reddy, “A Diagnosis Algorithm for Distributed Computing Systems with Dynamic Failure and Repair,” IEEE Trans. Computers, vol. 33, no. 3, pp. 223-233, Mar. 1984.
[16] J.G. Kuhl and S.M. Reddy, "Distributed Fault Tolerance for Large Multiprocessor Systems," Proc. 1980 Computer ArchitectureSymp., pp. 222-229, May 1980.
[17] B.F. Lewis et al. “COSMOS Multicomputer Operating System and Development Environment Functional Specification,” NASA Technical Memorandum, Caltech, JPL, Aug. 1992.
[18] B.F. Lewis and R.L. Bunker, “MAX: An Advanced Parallel Computer for Space Applications,” Proc. Second Int'l Symp. Space Information Systems, Sept. 1990.
[19] J. Maeng and M. Malek, “A Comparison Connection Assignment for Self-Diagnosis of Multiprocessor Systems,” Digest 11th Int'l Symp. Fault Tolerant Computing, pp. 173-175, 1981.
[20] M. Malek, “A Comparison Connection Assignment for Diagnosis of Multiprocessor Systems,” Proc. Seventh Int'l Symp. Computer Architecture, pp. 31-35, 1980.
[21] G.G.L. Meyer, “A Diagnosis Algorithm for the BGM System Level Fault Model,” IEEE Trans. Computers vol. 33, no. 8, pp. 756-758, Aug. 1984.
[22] J. Narasimhan and K. Nakajima, “An Algorithm for Determining the Fault Diagnosability of a System,” IEEE Trans. Computers, vol. 35, no. 11, pp. 1,004-1,008, Nov. 1986.
[23] M. Pfluegl and D. Blough, “Communication Protocols for Fault-Tolerant Clock Synchronization in Not-Completely-Connected Networks,” Proc. 11th Symp. Reliable Distributed Systems, pp. 130-137, 1992.
[24] D. Powell, G. Bonn, D. Seaton, P. Verissimo, and F. Waeselynck, The Delta-4 Approach to Dependability in Open Distributed Computing Systems Proc. 18th IEEE Int'l Symp. Fault-Tolerant Computing (FTCS-18), pp. 246-251, June 1988.
[25] F.P. Preparata, G. Metze, and R.T. Chien, “On the Connection Assignment Problem of Diagnosable Systems,” IEEE Trans. Electronic Computers, vol. 16, pp. 848-854, Dec. 1967.
[26] R. D. Schlichting and F. B. Schneider,“Fail-stop processors: An approach to designing fault-tolerant computing systems,”ACM Trans. Comput. Syst., vol. 1, no. 3, pp. 222–238, Aug. 1983.
[27] A. Sengupta and A. Dahbura, “On Self-Diagnosable Multiprocessor Systems: Diagnosis by the Comparison Approach,” IEEE Trans. Computers, vol. 41, no. 11, pp. 1386-1396, Nov. 1992.
[28] G.F. Sullivan, "A Polynomial Time Algorithm for Fault Diagnosability," Proc. 25th Symp. Foundations of Computer Science, pp. 148-156, Oct. 1984.
[29] C. Walter, N. Suri, and M. Hugue, “Continual On-Line Diagnosis of Hybrid Faults,” Proc. Fourth IFIP Working Conf. Dependable Computing for Critical Applications, pp. 233-249, 1994.
[30] H. Wang, “Practical Comparison-Based Fault Diagnosis in Multiprocessor Systems,” PhD dissertation, Dept. of Electrical and Computer Eng., Univ. of California, Irvine, June 1995.
[31] H. Wang, D. Blough, and L. Alkalaj, “Analysis and Experimental Evaluation of Comparison-Based System-Level Diagnosis for Multiprocessor Systems,” Digest 24th IEEE Int'l Symp. Fault-Tolerant Computing, pp. 55-64, 1994.

Index Terms:
Diagnosability, distributed algorithms, fault diagnosis, multicomputer systems.
Citation:
Douglas M. Blough, Hongying W. Brown, "The Broadcast Comparison Model for On-Line Fault Diagnosis in Multicomputer Systems: Theory and Implementation," IEEE Transactions on Computers, vol. 48, no. 5, pp. 470-493, May 1999, doi:10.1109/12.769431
Usage of this product signifies your acceptance of the Terms of Use.