This Article 
 Bibliographic References 
 Add to: 
Programming Language Support for Writing Fault-Tolerant Distributed Software
February 1995 (vol. 44 no. 2)
pp. 203-212

Abstract—Good programming language support can simplify the task of writing fault-tolerant distributed software. Here, an approach to providing such support is described in which a general high-level distributed programming language is augmented with mechanisms for fault tolerance. Unlike approaches based on sequential languages or specialized languages oriented towards a given fault-tolerance technique, this approach gives the programmer a high level of abstraction, while still maintaining flexibility and execution efficiency. The paper first describes a programming model that captures the important characteristics that should be supported by a programming language of this type. It then presents a realization of this approach in the form of FT-SR, a programming language that augments the SR distributed programming language with features for replication, recovery, and failure notification. In addition to outlining these extensions, an example program consisting of a data manager and its associated stable storage is given. Finally, an implementation of the language that uses the $\mbi{x}$-kernel and runs standalone on a network of Sun workstations is discussed. The overall structure and several of the algorithms used in the runtime are interesting in their own right.

[1] K. Birman, A. Schiper, and P. Stephenson, “Lightweight Causal and Atomic Group Multicast,” ACM Trans. Computer Systems, vol. 9, no. 3, pp. 272-314, Aug. 1991.
[2] B. Liskov,“The Argus language and system,”inDistributed Systems: Methods and Tools for Specification, LNCS, Vol. 190, M. Paul and H. Siegert, Eds. Berlin: Springer-Verlag, 1985, ch. 7, pp. 343–430.
[3] C. Ellis, J. Feldman, and J. Heliotis,“Language constructs and support systems for distributed computing,”inProc. ACM Symp. Prin. of Dist. Comp., Aug. 1982, pp. 1–9.
[4] H. Bal,“A comparative study of five parallel programming languages,”inProc. EurOpen Conf. Open Dist. Systems, May 1991.
[5] U.S. Dept. of Defense,Reference Manual for the Ada Programming Language. Washington DC, 1983.
[6] C.A.R. Hoare,“Communicating sequential processes,” Comm. of the ACM, vol. 21, no. 8, pp. 666-677, Aug. 1978.
[7] G. R. Andrews and R. A. Olsson,The SR Programming Language: Concurrency in Practice. Benjamin/Cummings, 1993.
[8] N. Hutchinson and L. L. Peterson,“Thex-Kernel: An architecture for implementing network protocols,”IEEE Trans. Software Eng., vol. 17, pp. 64–76, Jan. 1991.
[9] F.B. Schneider, "Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial," ACM Computing Surveys, vol. 22, no. 4, pp. 299-319, Dec. 1990.
[10] B. Lampson, "Atomic Transactions," Lecture notes in Computer Science—Distributed Systems: Architecture and Implementation, vol. 105, pp. 246-265. Springer-Verlag, 1981.
[11] R. D. Schlichting and F. B. Schneider,“Fail-stop processors: An approach to designing fault-tolerant computing systems,”ACM Trans. Comput. Syst., vol. 1, no. 3, pp. 222–238, Aug. 1983.
[12] P.A. Lee and T. Anderson, Fault Tolerance: Principles and Practice, second ed. Vienna, Austria: Springer–Verlag, 1990.
[13] F. Cristian, "Understanding Fault-Tolerant Distributed Systems," Comm. ACM, vol. 34, no. 2, Feb. 1991.
[14] J.N. Gray, "Notes on Database Operating Systems" Operating Systems: An Advanced Course, R. Bayer, R.M. Graham, and G. Seegmuller, eds., Lecture Notes in Computer Science 60, Springer-Verlag, Heidelberg, Germany, 1978.
[15] G.R. Andrews,R.A. Olsson,M. Coffin,I. Elshoff,K. Nilsen,T. Purdin,, and G. Townsend,“An overview of the SR language and implementation,” ACM Trans. on Programming Languages and Systems, vol. 10, no. 1, pp. 51-86, Jan. 1988.
[16] F. Cristian, H. Aghili, R. Strong, and D. Dolev,“Atomic broadcast: From simple message diffusion to Byzantine agreement,”inProc. 15th Fault-Tolerant Computing Symp., June 1985, pp. 200–206.
[17] H. Kopetz, A, Damm, C. Koza, and M. Mulozzani,“Distributed fault tolerant real-time systems: The Mars approach,”IEEE Micro, pp. 25–40, 1989.
[18] P. Melliar-Smith, L. Moser, and V. Agrawala,“Broadcast protocols for distributed systems,”IEEE Trans. Parallel Distributed Syst., vol. 1, pp. 17–25, Jan. 1990.
[19] D. Powell, Ed.,Delta-4: A Generic Architecture for Dependable Computing. Berlin: Springer-Verlag, 1991.
[20] F. Cristian,“Reaching agreement on processor-group membership in synchronous distributed systems,”Distributed Computing, vol. 4, pp. 175–187, 1991.
[21] H. Kopetz, G. Grunsteidl, and J. Reisinger,“Fault-tolerant membership service in a synchronous distributed real-time system,”inDependable Computing for Critical Applications, A. Avi\v zienis and J.-C. Laprie, Eds. Wien: Springer-Verlag, 1991, pp. 411–429.
[22] V. Thomas,“FT-SR: A programming language for constructing fault-tolerant distributed systems,”Ph.D. dissertation, Dept. of CS, Univ. of Arizona, 1993.
[23] J. Chang and N. Maxemchuk,“Reliable broadcast protocols,”ACM Trans. Comput. Syst., vol. 2, pp. 251–273, Aug. 1984.
[24] M.F. Kaashoek, A.S. Tanenbaum, S. Hummel, and H.E. Bal, “An Efficient Reliable Broadcast Protocol,” Operating Systems Review, vol. 23, no. 4, pp. 5–19, Oct. 1989.
[25] J. Gray,“Why do computers stop and what can be done about it,”inProc. 5th Symp. Reliability in Dist. Software and Database Systems, Jan. 1986, pp. 3–12.
[26] R. LeBlanc and C. T. Wilkes,“Systems programming with objects and actions,”inProc. 5th Conf. Distributed Computing Systems, Denver, CO, May 1985, pp. 132–139.
[27] R. Cmelik, N. Gehani, and W. D. Roome,“Fault Tolerant Concurrent C: A tool for writing fault tolerant distributed programs,”inProc. 18th Fault-Tolerant Computing Symp., June 1988, pp. 55–61.
[28] H. Madduri,“Fault-tolerant distributed computing,”Scientific Honeyweller, vol. Winter 1986-87, pp. 1–10, 1986.
[29] J. Knight and J. Urquhart,“On the implementation and use of Ada on fault-tolerant distributed systems,”IEEE Trans. Software Eng., vol. SE-13, pp. 553–563, May 1987.
[30] M. F. Kaashoek, R. Michiels, H. Bal, and A. Tanenbaum,“Transparent fault-tolerance in parallel Orca programs,”inProc. USENIX Symp. Exper. with Distributed and Multiprocessor Systems, Mar. 1992, pp. 297–311.
[31] R. Schlichting, F. Cristian, and T. Purdin,“A linguistic approach to failure-handling in distributed systems,”inDependable Computing for Critical Applications, A. Avi\v zienis and J.-C. Laprie, Eds. Wien: Springer-Verlag, 1991, pp. 387–409.
[32] S. Shrivastava, G. Dixon, and G. Parrington, "An Overview of Arjuna: A Programming System for Reliable Distributed Computing," IEEE Software, Vol. 8, No.1, Jan. 1991, pp. 63-73.
[33] M. Herlihy and J. Wing,“Avalon: Language support for reliable distributed systems,”inProc. 17th Fault-Tolerant Computing Symp., July 1987, pp. 89–94.

Richard D. Schlichting, Vicraj T. Thomas, "Programming Language Support for Writing Fault-Tolerant Distributed Software," IEEE Transactions on Computers, vol. 44, no. 2, pp. 203-212, Feb. 1995, doi:10.1109/12.364532
Usage of this product signifies your acceptance of the Terms of Use.