Issue No. 10 - October (1998 vol. 47)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/12.729794
<p><b>Abstract</b>—The piecewise deterministic execution model is a fundamental assumption in many log-based rollback-recovery protocols. Process execution in this model consists of intervals, each starting with the receipt of a message at an application-defined execution point. Execution within each interval is deterministic and messages are the only source of nondeterminism that affects the computation. This simple model excludes the nondeterminism that results when asynchronous signals or interrupts occur at arbitrary execution points. As a result, a wide range of applications cannot use log-based rollback-recovery in practice.</p><p>We present a solution that removes this restriction and allows applications to replay interrupts at the same execution points during recovery. The solution relies on using a software counter to compute the number of instructions between the asynchronous signals during normal operation. Should a failure occur, the instruction counts are used to force the replay of these signals at the same execution points. The execution of the application thus can be replayed to recreate the prefailure state while accommodating nondeterminism due to asynchronous signals. We then use the deterministic replay of interrupts to solve another problem, namely tracking nondeterminism due to interleaved shared memory access in multithreaded applications on a single processor. We use the instruction counter solution to implement a user-level thread package in which thread scheduling decisions can be replayed if a failure occurs. By repeating the scheduling decisions during an execution replay, threads access the shared memory in the same order and the execution to be reconstructed. This technique allows multithreaded applications to use log-based rollback-recovery with low overhead, which was not previously possible. We carried out two prototype implementations that have shown the overhead is no more than a 6 percent slowdown in application execution on the DEC Alpha, and from 6 percent to 18 percent on the Intel Pentium. Thus, restrictions of the piecewise deterministic execution model can be lifted at a reasonable cost.</p>
Checkpointing, distributed systems, instruction counters, message logging, rollback-recovery.
J. Hamilton Slye, E.n. Elnozahy, "Support for Software Interrupts in Log-Based Rollback-Recovery", IEEE Transactions on Computers, vol. 47, no. , pp. 1113-1123, October 1998, doi:10.1109/12.729794