This Article 
 Bibliographic References 
 Add to: 
Closure and Convergence: A Foundation of Fault-Tolerant Computing
November 1993 (vol. 19 no. 11)
pp. 1015-1027

The authors formally define what it means for a system to tolerate a class of faults. The definition consists of two conditions. The first is that if a fault occurs when the system state is within the set of legal states, the resulting state is within some larger set and, if faults continue to occur, the system state remains within that larger set (closure). The second is that if faults stop occurring, the system eventually reaches a state within the legal set (convergence). The applicability of the definition for specifying and verifying the fault-tolerance properties of a variety of digital and computer systems is demonstrated. Using the definition, the authors obtain a simple classification of fault-tolerant systems. Methods for the systematic design of such systems are discussed.

[1] T. Anderson and P. Lee, "Fault tolerance terminology proposals," inProc. FTCS-12, 1982, pp. 29-33.
[2] A. Arora, "A foundation of fault-tolerant computing," Ph.D. dissertation, The University of Texas, Austin, 1992.
[3] A. Arora and M. Gouda, "Closure and convergence: A formulation of fault-tolerant computing," inProc. 22nd Int. Symp. Fault-Tolerant Computing, 1992, pp. 396-403.
[4] A. Arora and M. Gouda, "Distributed reset," to be published inIEEE Trans. Comput.; inProc. 10th Conf. Foundations Software Technol. Theoretical Comput. Sci., Lecture Notes in Comput. Sci. 472, (New York: Springer-Verlag), 1990, pp. 316-331.
[5] A. Arora, M. Gouda, and G. Varghese, "Distributed constraint satisfaction,"Int. Conf. Distributed Comput. Syst., 1994, submitted for publication.
[6] A. Avizienis, "The four-universe information system model for the study of fault tolerance," inProc. 12th Int. Symp. Fault-Tolerant Computing, 1982, pp. 6-13.
[7] F. Bastani, I.-L. Yen, and I. Chen, "A class of inherently fault-tolerant distributed programs,"IEEE Trans. Software Eng., vol. 14, no. 10, pp. 1431-1442, 1988.
[8] P.A. Bernstein, V. Hadzilacos, and N. Goodman,Concurrency Control and Recovery in Database Systems, Addison-Wesley, Reading, Mass., 1987.
[9] Y. Afek and G. Brown, "Self-stabilization of the alternating-bit protocol," inProc. Eighth Symp. Reliable Distributed Syst., 1989, pp. 80-83.
[10] M. Breuer and A. Friedman,Diagnosis and Reliable Design of Digital Systems. Computer Science Press, 1976.
[11] J. Burns and J. Pachl, "Uniform self-stabilizing rings,"ACM Trans. Programming Languages Syst., vol. 11, no. 2, pp. 330-344, 1989.
[12] K. M. Chandy and J. Misra,Parallel Program Design: A Foundation. Reading, MA: Addison-Wesley, 1988.
[13] F. Cristian, "Understanding fault-tolerant distributed systems,"Commun. ACM, vol. 34, no. 2, pp. 56-78, 1991.
[14] F. Cristian, "A rigorous approach to fault-tolerant programming,"IEEE Trans. Software Eng., vol. SE-11, no. 1, 1985.
[15] E. Dijkstra, "Self-stabilizing systems in spite of distributed control,"Commun. ACM, vol. 17, pp. 643-644, 1974.
[16] E. W. Dijkstra,A Discipline of Programming. Englewood Cliffs, NJ: Prentice-Hall, 1976.
[17] E. W. Dijkstra, "Solution of a problem in concurrent programming control,"Commun. ACM, vol. 8, pp. 569-569, Sept. 1965.
[18] E. Dijkstra and C. Scholten,Predicate Calculus and Program Semantics. New York: Springer-Verlag, 1990.
[19] P. Ezhilchelvan and S. Shrivastava, "A characterization of faults in systems," inProc. 5th Symp. Reliability Distrib. Software Database Syst., 1986.
[20] M. J. Fischer, N. A. Lynch, and M. S. Paterson, "Impossibility of distributed consensus with one faulty process,"J. ACM, vol. 32, no. 2, pp. 374-382, Apr. 1985.
[21] M. Gouda, and N. Multari, "Stabilizing communication protocols,"IEEE Trans. Comput., vol. 40, no. 4, pp. 448-458, 1991.
[22] D. Gries,The Science of Programming. New York: Springer-Verlag, 1981.
[23] B.W. Johnson,Design and Analysis of Fault Tolerant Digital Systems, Addison-Wesley, Reading, Mass., 1989.
[24] S. Katz and K. Perry, "Self-stabilizing extensions for message-passing systems," inProc. 9th Ann. Symp. Principles of Distributed Computing, 1990, pp. 91-101.
[25] L. Lamport, "Solved problems, unsolved problems, and nonproblems in concurrency," Invited Address, inProc. Third ACM Symp. Principles of Distributed Comput., 1984, pp. 1-11.
[26] B. Lampson and H. Sturgis, "Crash recovery in a distributed storage system," Tech. Rep., Xerox Palo Alto Research Center, Palo Alto, CA, 1979.
[27] J.-C. Laprie, "Dependable computing and fault tolerance: Concepts and terminology," inProc. 15th Int. Symp. Fault-Tolerant Computing, 1985, pp. 2-11.
[28] N. Lynch, "A hundred impossibility proofs for distributed computing," invited talk, inProc. 8th Ann. ACM Symp. Principles Distrib. Computing, 1989, pp. 1-29.
[29] A. Mili,An Introduction to Program Fault-Tolerance. New York: Prentice-Hall, 1990.
[30] C. Mohan, R. Strong, and S. Finkelstein, "Methods for distributed transaction commit and recovery using byzantine agreement within clusters of processes," inProc. 2nd ACM Symp. Principles Distrib. Computing, 1983, pp. 29-43.
[31] J. von Neumann, "Probabilistic logics and the synthesis of reliable organisms from unreliable components," inAutomata Studies. Princeton University Press, 1956, pp. 43-98.
[32] R. D. Schlichting and F.B. Schneider, "Fail-stop processors: An approach to designing fault-tolerant computing systems,"ACM Trans. Comput. Syst., vol. 1, no. 3, pp. 222-238, Aug. 1983.
[33] M. Schneider, "Self-Stabilization,"ACM Comput. Surveys, vol. 25, no. 1, pp. 45-67, 1993.
[34] C. Seitz, "System timing," inIntroduction to VLSI Systems. Addison-Wesley, 1980.
[35] D. Siewiorek, "Architecture of fault-tolerant computers," inFault-Tolerant Computing(vol. II). New York: Prentice-Hall, 1986.
[36] D. Skeen and M. Stonebraker, "A formal model of crash recovery in a distributed system,"IEEE Trans. Software Eng., pp. 219-228, 1983.
[37] T. Srikanth and S. Toeug, "Simulating authenticated broadcast to derive simple fault tolerant algorithms,"Distrib. Computing, vol. 2, no. 2, pp. 80-94, 1987.
[38] B. Randell, "System structure for software fault tolerance,"IEEE Trans. Software Eng., pp. 220-232, 1975.
[39] A. S. Tanenbaum,Computer Networks, Englewood Cliffs, NJ: Prentice-Hall, 1981.
[40] I.-L. Yen, F. Bastani, and E. Leiss, "An inherently fault-tolerant sorting algorithm," inProc. 5th Int. Parallel Process, Symp., 1991, pp. 37-42.
[41] Y. Zhao and F. Bastani, "A self-adjusting algorithm for byzantine agreement,"Distributed. Comput., vol. 5, pp. 219-226, 1992.

Index Terms:
fault-tolerant computing; legal states; convergence; closure; verification; fault tolerant computing; formal verification
A. Arora, M. Gouda, "Closure and Convergence: A Foundation of Fault-Tolerant Computing," IEEE Transactions on Software Engineering, vol. 19, no. 11, pp. 1015-1027, Nov. 1993, doi:10.1109/32.256850
Usage of this product signifies your acceptance of the Terms of Use.