This Article 
 Bibliographic References 
 Add to: 
Support for Software Interrupts in Log-Based Rollback-Recovery
October 1998 (vol. 47 no. 10)
pp. 1113-1123

Abstract—The piecewise deterministic execution model is a fundamental assumption in many log-based rollback-recovery protocols. Process execution in this model consists of intervals, each starting with the receipt of a message at an application-defined execution point. Execution within each interval is deterministic and messages are the only source of nondeterminism that affects the computation. This simple model excludes the nondeterminism that results when asynchronous signals or interrupts occur at arbitrary execution points. As a result, a wide range of applications cannot use log-based rollback-recovery in practice.

We present a solution that removes this restriction and allows applications to replay interrupts at the same execution points during recovery. The solution relies on using a software counter to compute the number of instructions between the asynchronous signals during normal operation. Should a failure occur, the instruction counts are used to force the replay of these signals at the same execution points. The execution of the application thus can be replayed to recreate the prefailure state while accommodating nondeterminism due to asynchronous signals. We then use the deterministic replay of interrupts to solve another problem, namely tracking nondeterminism due to interleaved shared memory access in multithreaded applications on a single processor. We use the instruction counter solution to implement a user-level thread package in which thread scheduling decisions can be replayed if a failure occurs. By repeating the scheduling decisions during an execution replay, threads access the shared memory in the same order and the execution to be reconstructed. This technique allows multithreaded applications to use log-based rollback-recovery with low overhead, which was not previously possible. We carried out two prototype implementations that have shown the overhead is no more than a 6 percent slowdown in application execution on the DEC Alpha, and from 6 percent to 18 percent on the Intel Pentium. Thus, restrictions of the piecewise deterministic execution model can be lifted at a reasonable cost.

[1] L. Alvisi, B. Hoppe, and K. Marzullo, "Nonblocking and Orphan-Free Message Logging Protocols," Digest of Papers: The 23rd Int'l Symp. Fault-Tolerant Computing, pp. 145-154, 1993.
[2] L. Alvisi and K. Marzullo, “Message Logging: Pessimistic, Optimistic and Causal,” Proc. 15th Int'l Conf. Distributed Computing Systems, pp. 229-236, 1995.
[3] P.A. Barrett et al., “The Delta-4 Extra Performance Architecture,” Proc. 20th Int'l Symp. Fault-Tolerant Computing (FTCS-20), pp. 481-488, 1990.
[4] A. Borg, J. Baumbach, and S. Glazer, “A Message System Supporting Fault Tolerance,” Proc. Symp. ACM SIGOPS Operating Systems Principles, pp. 90-99, Oct. 1983.
[5] A. Borg, W. Blau, W. Graetsch, F. Herrmann, and W. Oberle, "Fault Tolerance Under UNIX," ACM Trans. Computer Systems, vol. 7, no. 1, pp. 1-24, Feb. 1989.
[6] T. Bressoud and F. Schneider, "Hypervisor-Based Fault Tolerance," Proc. 15th ACM Symp. Operating Systems Principles, Dec. 1995.
[7] T. Cargill and B. Locanthi, "Cheap hardware Support for Software Debugging and Profiling," Proc. Second Symp. Architectural Support for Programming Languages and Operating Systems, pp. 82-83, Oct. 1987.
[8] R. Kazman, "Tool Support for Architectural Analysis and Design," Joint Proc. SIGSOFT '96 Workshops, ACM Press, New York, pp. 94-97.
[9] D. Cheriton, "The V Distributed System," Comm. ACM, vol. 31, no. 3, pp. 314-333, Mar. 1988.
[10] E. Elnozahy, "Manetho: Fault Tolerance in Distributed Systems Using Rollbac-Recovery and Process Replication," PhD thesis, Rice Univ., Oct. 1993. Also available as Technical Report TR93-212.
[11] E. Elnozahy, D. Johnson, and Y.-M. Wang, "A Survey of Rollback-Recovery Protocols in Message Passing Systems," Technical Report CMU-CS-96-181, Dept. of Computer Science, Carnegie Mellon Univ., Sept. 1996.
[12] E.N. Elnozahy and W. Zwaenepoel, “Manetho—Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit,” IEEE Trans. Computers, vol. 41, no. 5, pp. 526–531, May 1992.
[13] E.N. Elnozahy and W. Zwaenepoel, “On the Use and Implementation of Message Logging, Digest of Papers: 24th Ann. Int'l Symp. Fault-Tolerant Computing, June 1994.
[14] A. Goldberg, A. Gopal, K. Li, R. Strom, and D. Bacon, "Transparent Recovery of Mach Applications," Proc. Usenix Mach Workshop, pp. 169-184, Oct. 1990.
[15] Y. Huang and C. Kintala, "Software Implemented Fault Tolerance: Technologies and Experience," Proc. IEEE Fault-Tolerant Computing Symp., pp. 2-9, June 1993.
[16] Y. Huang and Y.M. Wang, "Why Optimistic Message Logging Has Not Been Used in Telecommunications Systems," Proc. IEEE Fault-Tolerant Computing Symp., pp. 459-463, June 1995.
[17] D.B. Johnson, “Distributed System Fault Tolerance Using Message Logging and Checkpointing,” PhD thesis, report no. COMPTR89-101, Rice Univ., Dec. 1989.
[18] D. Johnson and W. Zwaenepoel, "Sender-Based Message Logging," Proc. 17th Int'l Symp. Fault-Tolerant Computing, pp. 14-19, June 1987.
[19] D. B. Johnson and W. Zwaenepoel,“Recovery in distributed systems using optimistic message logging and checkpointing,”J. Algorithms, vol. 11, pp. 462–491, 1990.
[20] T.Y. Juang and S. Venkatesan, “Crash Recovery with Little Overhead,” Proc. 11th Int'l Conf. Distributed Computing Systems, pp. 454-461, June 1987.
[21] J. Jump, YACSIM Reference Manual, version 2.1. Mar. 1993.
[22] H.V. Leong and D. Agrawal, "Using Message Semantics to Reduce Rollback in Optimistic Message Logging Recovery Schemes," Proc. IEEE Int;'l Conf. Distributed Computer Systems, pp. 227-234, 1994.
[23] J. Mellow-Crummey and T. LeBlanc, "A Software Instruction Counter," Proc. Third Symp. Architectural Support for Programming Languages and Operating Systems, pp. 78-86, Apr. 1989.
[24] R.H.B. Netzer and J. Xu, "Adaptive Message Logging for Incremental Program Replay," IEEE Parallel and Distributed Technology, vol. 1, no. 4, pp. 32-39, Nov. 1993.
[25] M. L. Powell and D. L. Presotto,“Publishing: A reliable broadcast communication mechanism,”inProc. 9th ACM Symp. Oper. Syst. Princip., 1983, pp. 100–109.
[26] M. Russinovich and B. Cogswell, "Replay for Concurrent Non-Deterministic Shared Memory Applications," Proc. 1996 ACM SIGPLAN Conf. Programming Language Design and Implementation, pp. 258-266, May 1996.
[27] M. Russinovich, Z. Segall, and D.P. Siewiorek, "Application Transparent Fault Management in Fault-Tolerant Mach," Proc. 23rd Ann. Int'l Symp. Fault-Tolerant Computing, FTCS-23, pp. 10-19, June 1993.
[28] A. P. Sistla and J. L. Welch,“Efficient distributed recovery using message logging,”inProc. 8th ACM Symp. Princip. Distrib. Comput., 1989, pp. 223–238.
[29] J.H. Slye, "Adding Support for Software Interrupts in Log-Based Rollback-Recovery Protocols," master's thesis, Dept. of Electrical and Computer Eng., Carnegie Mellon Univ., Dec. 1996.
[30] J.H. Slye and E.N. Elnozahy, “Supporting Nondeterministic Execution in Fault-Tolerant Systems,” Proc. IEEE Int'l Symp. Fault-Tolerant Computing, June 1996.
[31] R.E. Strom and S.A. Yemini, "Optimistic Recovery in Distributed Systems," ACM Trans. Computer Systems, vol. 3, no. 3, pp. 204-226, Aug. 1985.
[32] G. Suri, B. Janssens, and W.K. Fuchs, "Reduced Overhead Logging for Rollback Recovery in Distributed Shared Memory," Proc. IEEE Fault-Tolerant Computing Symp., pp. 279-288, June 1995.
[33] N.H. Vaidya, "Dynamic Cluster-Based Recovery: Pessimistic and Optimistic Schemes," Technical Report #93-027, Dept. of Computer Science, Texas A&M Univ., May 1993.
[34] N.H. Vaidya, "A Case for Two-Level Distributed Recovery Schemes," Proc. ACM SIGMETRICS Conf. Measurement and Modeling of Computer Systems, pp. 64-73, May 1995.
[35] Y.M. Wang, "Reducing Message Logging Overhead for Log-Based Recovery," Proc. IEEE Int'l Symp. Circuits and Systems, pp. 1,925-1,928, May 1993.
[36] Y. Wang, "Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints," IEEE Trans. Computers, vol. 46, no. 4, pp. 456-468, Apr. 1997.
[37] Y.M. Wang and W.K. Fuchs, "Optimistic Message Logging for Independent Checkpointing in Message-Passing Systems," Proc. IEEE Symp. Reliable Distributed Systems, Oct. 1992.
[38] Y.-M. Wang and W.K. Fuchs, “Scheduling Message Processing for Reducing Rollback Propagation,” Proc. IEEE Fault-Tolerant Computing Symp., pp. 204-211, July 1992.
[39] Y.M. Wang et al., “Checkpointing and Its Applications,” Digest 25th Ann. Int'l Symp. Fault-Tolerant Computing, pp. 22-31, June 1995.

Index Terms:
Checkpointing, distributed systems, instruction counters, message logging, rollback-recovery.
J. Hamilton Slye, E.n. Elnozahy, "Support for Software Interrupts in Log-Based Rollback-Recovery," IEEE Transactions on Computers, vol. 47, no. 10, pp. 1113-1123, Oct. 1998, doi:10.1109/12.729794
Usage of this product signifies your acceptance of the Terms of Use.