This Article 
 Bibliographic References 
 Add to: 
A Case for Two-Level Recovery Schemes
June 1998 (vol. 47 no. 6)
pp. 656-666

Abstract—Long-running applications are often subject to failures. Failures can result in significant loss of computation. Therefore, it is necessary to use a failure recovery scheme to minimize performance overhead in the presence of failures. In this paper, we argue that it is often advantageous to use "two-level" recovery schemes. A two-level recovery scheme tolerates the more probable failures with low performance overhead, while the less probable failures may possibly incur a higher overhead. By minimizing overhead for the more frequently occurring failure scenarios, the two-level approach can achieve lower performance overhead (on average) as compared to existing recovery schemes.

The paper describes two two-level recovery schemes. Performance analysis using a Markov chain shows that, in practice, a two-level scheme can perform better than its "one-level" counterpart. While the conclusions of this paper are intuitive, the work on design of appropriate recovery schemes is lacking. The objective of this paper is to motivate research into recovery schemes that can provide multiple levels of fault tolerance and achieve better performance than existing recovery schemes. The paper presents an analytical approach for evaluating performance of two-level schemes and shows that such schemes are hard to optimize analytically.

[1] L. Alvisi, B. Hoppe, and K. Marzullo, "Nonblocking and Orphan-Free Message Logging Protocols," Digest of Papers: The 23rd Int'l Symp. Fault-Tolerant Computing, pp. 145-154, 1993.
[2] L. Alvisi and K. Marzullo, "Optimal Message Logging Protocols," technical report, Dept. of Computer Science, Cornell Univ., 1994.
[3] T. Anderson, P.A. Lee, and S.K. Srivastava, "A Model of Recoverability in Multilevel Systems," IEEE Trans. Software Eng., vol. 4, pp. 486-494, Nov. 1978.
[4] K.M. Chandy, J.C. Browne, C.W. Dissly, and W.R. Uhrig, "Analytic Models for Rollback and Recovery Strategies in Data Base Systems," IEEE Trans. Software Eng., vol. 1, pp. 100-110, Mar. 1975.
[5] K.M. Chandy and L. Lamport, "Distributed Snapshots: Determining Global States of Distributed Systems," ACM Trans. Computer Systems, Feb. 1985.
[6] E.N. Elnozahy, D.B. Johnson, and W. Zwaenepoel, "The Performance of Consistent Checkpointing," Proc. 11th Symp. Reliable Distributed Systems, pp. 86-95, Oct. 1992.
[7] S. Garg and K.F. Wong, "Analysis of an Improved Distributed Checkpointing Algorithm," Technical Report WUCS-93-37, Dept. of Computer Science, Washington Univ., June 1993.
[8] R. Geist, R. Reynolds, and J. Westall, "Selection of a Checkpoint Interval in a Critical-Task Environment," IEEE Trans. Reliability, vol. 37, pp. 395-400, Oct. 1988.
[9] E. Gelenbe and D. Derochette, "Performance of Rollback Recovery Systems Under Intermittent Failures," Comm. ACM, vol. 21, pp. 493-499, June 1978.
[10] E. Gelenbe, "A Model for Roll-Back Recovery with Multiple Checkpoints," Proc. Second Int'l Conf. Software Eng., pp. 251-255, Oct. 1976.
[11] E. Gelenbe, "Model of Information Recovery Using the Method of Multiple Checkpointing," Automation and Control, vol. 4, pp. 595-605, Apr. 1979.
[12] J. Gray and A. Reuter, Transaction Processing: Concepts and Techniques, Morgan Kauffman, 1993.
[13] J. Hennessy and D. Patterson, Computer Architecture: A Quantitative Approach. Morgan Kaufmann, 1995.
[14] D.B. Johnson and W. Zwaenepoel, "Sender-Based Message Logging," Digest of Papers: The 17th Int'l Symp. Fault-Tolerant Computing, pp. 14-19, June 1987.
[15] J.H. Kim, "Performance and Recoverability of Distributed Shared Memory Systems Using Competitive Update," PhD thesis, Computer Science Dept., Texas A&M Univ., June 1997.
[16] C.M. Krishna and A.D. Singh, "Modeling Correlated Transient Failures in Fault-Tolerant Systems," Digest of Papers: The 20th Int'l Symp. Fault-Tolerant Computing, 1990.
[17] V.G. Kulkarni, V.F. Nicola, and K.S. Trivedi, "Effects of Checkpointing and Queueing on Program Performance," Comm. Statist.-Stochastic Models, vol. 4, no. 6, pp. 615-648, 1990.
[18] J. León, A.L. Fisher, and P. Steenkiste, "Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery," Technical Report CMU-CS-93-124, School of Computer Science, Carnegie Mellon Univ., Pittsburgh, Feb. 1993.
[19] R.G. Melhem, "Bi-Level Reconfiguration of Fault Tolerant Arrays in Bi-Model Computational Environments," Digest of Papers: The 19th Int'l Symp. Fault-Tolerant Computing, pp. 488-495, June 1989.
[20] J. Eliot and B. Moss, Nested Transactions. An Approach to Reliable Distributed Computing. Information Systems Series. Cambridge, Mass.: MIT Press, 1985.
[21] V.F. Nicola, "Checkpointing and the Modeling of Program Execution Time," Software Fault Tolerance, M.R. Lyu, ed., pp. 167-188. John Wiley&Sons, 1995.
[22] V.F. Nicola and J.M. van Spanje, "Comparative Analysis of Different Models of Checkpointing and Recovery," IEEE Trans. Software Eng., vol. 16, no. 8, pp. 807-821, Aug. 1990.
[23] D. Patterson and J. Hennessy, Computer Organization and Design: The Hardware/Software Interface, Morgan Kaufmann, San Mateo, Calif., 1994; .
[24] J.S. Plank, "Efficient Checkpointing on MIMD Architectures," PhD thesis, Dept. of Computer Science, Princeton Univ., June 1993.
[25] B. Randell, "System Structure for Software Fault Tolerance," IEEE Trans. Software Eng., vol. 1, pp. 220-232, June 1975.
[26] K. Shin, T.-H. Lin, and Y.-H. Lee, "Optimal Checkpointing of Real-Time Tasks," IEEE Trans. Computers, vol. 36, no. 11, pp. 1,328-1,341, Nov. 1987.
[27] L.M. Silva, J.G. Silva, and S. Chapple, "Portable Transparent Checkpointing for Distributed Shared Memory." manuscript, Nov. 1995.
[28] A.N. Tantawi and M. Ruschitzka, "Performance Analysis of Checkpointing Strategies," ACM Trans. Computer Systems, vol. 2, pp. 123-144, May 1984.
[29] K.S. Trivedi, Probability and Statistics with Reliability, Queueing and Computer Science Applications. Prentice Hall, 1988.
[30] N.H. Vaidya, "Another Two-Level Failure Recovery Scheme: Performance Impact of Checkpoint Placement and Checkpoint Latency," Technical Report 94-068, Computer Science Dept., Texas A&M Univ., College Station, Dec. 1994. (revised Jan. 1995).
[31] N.H. Vaidya, "A Case for Multi-Level Distributed Recovery Schemes," Technical Report 94-043, Computer Science Dept., Texas A&M Univ., College Station, May 1994.
[32] N.H. Vaidya, "A Case for Two-Level Distributed Recovery Schemes," Proc. ACM SIGMETRICS Conf. Measurement and Modeling of Computer Systems, pp. 64-73, May 1995.
[33] K. Wong and M. Franklin, "Distributed Computing Systems and Checkpointing," Proc. Second Int'l Symp. High Performance Distributed Computing, pp. 224-233,Spokane, Wash., July 1993.
[34] J.W. Young, "A First Order Approximation to the Optimum Checkpoint Interval," Comm. ACM, vol. 17, pp. 530-531, Sept. 1974.
[35] A. Ziv and J. Bruck, "Analysis of Checkpointing Schemes for Multiprocessor Systems," Technical Report RJ 9593, IBM Almaden Research Center, Nov. 1993.
[36] A. Ziv and J. Bruck, "Efficient Checkpointing Over Local Area Network," Proc. IEEE Workshop Fault-Tolerant Parallel and Distributed Systems,College Station, Tex., June 1994.

Index Terms:
Failure recovery, performance analysis, checkpointing and rollback, recovery overhead, Markov chains.
Nitin H. Vaidya, "A Case for Two-Level Recovery Schemes," IEEE Transactions on Computers, vol. 47, no. 6, pp. 656-666, June 1998, doi:10.1109/12.689645
Usage of this product signifies your acceptance of the Terms of Use.