Issue No. 06 - June (1998 vol. 47)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/12.689645
<p><b>Abstract</b>—Long-running applications are often subject to failures. Failures can result in significant loss of computation. Therefore, it is necessary to use a failure recovery scheme to minimize performance overhead in the presence of failures. In this paper, we argue that it is often advantageous to use "two-level" recovery schemes. A <it>two-level</it> recovery scheme tolerates the <it>more probable</it> failures with low performance overhead, while the less probable failures may possibly incur a higher overhead. By minimizing overhead for the more frequently occurring failure scenarios, the two-level approach can achieve lower performance overhead (on average) as compared to existing recovery schemes.</p><p>The paper describes two two-level recovery schemes. Performance analysis using a Markov chain shows that, in practice, a two-level scheme can perform better than its "one-level" counterpart. While the conclusions of this paper are intuitive, the work on design of appropriate recovery schemes is lacking. The objective of this paper is to motivate research into recovery schemes that can provide multiple levels of fault tolerance and achieve better performance than existing recovery schemes. The paper presents an analytical approach for evaluating performance of two-level schemes and shows that such schemes are hard to optimize analytically.</p>
Failure recovery, performance analysis, checkpointing and rollback, recovery overhead, Markov chains.
N. H. Vaidya, "A Case for Two-Level Recovery Schemes," in IEEE Transactions on Computers, vol. 47, no. , pp. 656-666, 1998.