23rd IEEE International Conference on Distributed Computing Systems (ICDCS'03)
Enhancing The Fault-Tolerance of Nonmasking Programs
Providence, Rhode Island
May 19-May 22
ISBN: 0-7695-1920-2
In this paper, we focus on automated techniques to enhance the fault-tolerance of a nonmasking fault-tolerant program to masking. A masking program continually satisfies its specification even if faults occur. By contrast, a nonmasking program merely guarantees that after faults stop occurring, the program recovers to states from where it continually satisfies its specification. Until the recovery is complete, however, a nonmasking program can violate its (safety) specification. Thus, the problem of enhancing fault-tolerance from nonmasking to masking requires that safety be added and recovery be preserved. We focus on this enhancement problem for high atomicity programs -where each process can read all variables- and for distributed programs -where restrictions are imposed on what processes can read and write. We present a sound and complete algorithm for high atomicity programs and a sound algorithm for distributed programs. We also argue that our algorithms are simpler than previous algorithms, where masking fault-tolerance is added to a fault-intolerant program. Hence, these algorithms can partially reap the benefits of automation when the cost of adding masking fault-tolerance to a fault-intolerant program is high. To illustrate these algorithms, we show how the masking fault-tolerant programs for triple modular redundancy and Byzantine agreement can be obtained by enhancing the fault-tolerance of the corresponding non-masking versions. We also discuss how the derivation of these programs is simplified when we begin with a nonmasking fault-tolerant program.
Index Terms:
Automatic addition of fault-tolerance, Formal methods, Fault-tolerance, Program synthesis, Program transformation, Distributed programs
Citation:
Sandeep S. Kulkarni, Ali Ebnenasir, "Enhancing The Fault-Tolerance of Nonmasking Programs," icdcs, pp.441, 23rd IEEE International Conference on Distributed Computing Systems (ICDCS'03), 2003