This Article 
 Bibliographic References 
 Add to: 
A Variational Calculus Approach to Optimal Checkpoint Placement
July 2001 (vol. 50 no. 7)
pp. 699-708

Abstract—Checkpointing is an effective fault-tolerant technique for improving system availability and reliability. However, a blind checkpointing placement can result in either performance degradation or expensive recovery cost. By means of the calculus of variations, we derive an explicit formula that links the optimal checkpointing frequency with a general failure rate, with the objective of globally minimizing the total expected cost of checkpointing and recovery. Theoretical result shows that the optimal checkpointing frequency is proportional to the square root of the failure rate and can be uniquely determined by the failure rate (time-varying or constant) if the recovery function is strictly increasing and the failure rate is $\lambda (\infty ) > 0$. Bruno and Coffman [2] suggest that optimal checkpointing by its nature is a function of system failure rate, i.e., the time-varying failure rate demands time-varying checkpointing in order to meet the criteria of certain optimality. The results obtained in this paper agree with their viewpoint.

[1] L.B. Boguslavsky, E.G. Coffman, E.N. Gilbert, and A.Y. Kreinin, “Scheduling Checks and Saves,” OOSA J. Computing, vol. 4, no. 1, pp. 60-69, 1992.
[2] J.L. Bruno and E.G. Coffman, “Optimal Fault-Tolerant Computing on Multiprocessor Systems,” Acta Informatica, vol. 34, pp. 881-904, 1997.
[3] J.L. Bruno, E.G. Coffman, J.C. Lagarias, T.J. Richardson, and P.W. Shor, “Processor Shadowing: Maximizing Expected Throughput in Fault-Tolerant Systems,” Math. Operations Research, vol. 24, no. 2, pp. 362-382, May 1999.
[4] K.M. Chandy, J.C. Browne, C.W. Dissly, and W.R. Uhrig, “Analytic Models for Rollback and Recovery Strategies in Database Systems,” IEEE Trans. Software Eng., vol. 1, no. 1, pp. 100-110, Mar. 1975.
[5] K.M. Chandy, “A Survey of Analytic Models for Rollback and Recovery Strategies,” Computer, vol. 8, no. 5, pp. 40-47, 1975.
[6] E.G. Coffman and E.N. Gilbert, "Optimal Strategies for Scheduling Checkpoints and Preventive Maintenance," IEEE Trans. Reliability, vol. 39, pp. 9-18, Apr. 1990.
[7] E.G. Coffman, L. Flatto, and P.E. Wright, “A Stochastic Checkpoint Optimization Problem,” SIAM J. Computing, vol. 22, no. 3, pp. 650-659, June 1993.
[8] A. Duda, “The Effects of Checkpointing on Program Execution Time,” Information Processing Letters, vol. 16, no. 5, pp. 221-229, June 1983.
[9] P. L’Ecuyer and J. Malenfant,“Computing optimal checkpointing strategies for rollback and recovery systems,” IEEE Trans. Computers, vol. 37, no. 4, pp. 491-496, 1988.
[10] E. Gelenbe and M. Hernandez, “Optimum Checkpoints with Age Dependent Failures,” Acta Informatica, vol. 27, pp. 519-531, 1990.
[11] A. Goyal, V.F. Nicola, A. Tantawi, and K. Trivedi, “Reliability of System with Limited Repairs,” IEEE Trans. Reliability, vol. 36, no. 2, pp. 202-207, 1987.
[12] V. Grassi,L. Donatiello,, and S. Tucci,“On the optimal checkpointing of critical tasks and transaction-oriented systems,” IEEE Trans. Software Eng., vol. 18, no. 1, pp. 72-77, 1992.
[13] C.M. Krishna, K.G. Shin, and Y.H. Lee, “Optimization Criteria for Checkpoint Placements,” Comm. ACM, vol. 27, no. 10, pp. 1008-1012, Oct. 1984.
[14] C.H.C. Leung and Q.H. Choo, “On the Execution of Large Batch Programs in Unreliable Computing Systems,” IEEE Trans. Software Eng., vol. 10, no. 4, pp. 444-450, July 1984.
[15] J. Mi, “Interval Estimation of Availability of a Series System,” IEEE Trans. Reliability, vol. 40, pp. 541-546, 1991.
[16] V.F. Nicola and J.M. van Spanje, "Comparative Analysis of Different Models of Checkpointing and Recovery," IEEE Trans. Software Eng., vol. 16, no. 8, pp. 807-821, Aug. 1990.
[17] V.F. Nicola, “Checkpointing and the Modeling of Program Execution Time,” Software Fault Tolerance, M.R. Lyu, ed., pp. 167-188, John Wiley&Sons, 1995.
[18] J.S. Plank, K. Li, and M.A. Puening, "Diskless Checkpointing," IEEE Trans Parallel and Distributed Systems, Vol. 9, No. 10, Oct. 1998, pp. 972-986.
[19] S.M. Ross, Stochastic Processes. New York: Wesley, 1996.
[20] K. Shin, T.-H. Lin, and Y.-H. Lee, "Optimal Checkpointing of Real-Time Tasks," IEEE Trans. Computers, vol. 36, no. 11, pp. 1,328-1,341, Nov. 1987.
[21] E. de Souza e Silva and H.R. Gail, “Calculating Cumulative Operational Time Distributions of Repairable Computer Systems,” IEEE Trans. Computers, vol. 35, no. 4, pp. 322-332, Apr. 1986.
[22] U. Sumita, N. Kaio, and P.B. Goes, “Analysis of Effective Service Time with Age Dependent Interruptions and Its Application to Optimal Rollback Policy for Database Management,” Queuing Systems: Theory and Applications, vol. 4, pp.193-212, 1989.
[23] A.N. Tantawi and M. Ruschitzka, "Performance Analysis of Checkpointing Strategies," ACM Trans. Computer Systems, vol. 2, pp. 123-144, May 1984.
[24] S. Toueg and Ö. Babaoglu, "On the Optimum Checkpoint Selection Problem," SIAM J. Computing, vol. 13, pp. 630-649, Aug. 1984.
[25] J.W. Young, "A First Order Approximation to the Optimum Checkpoint Interval," Comm. ACM, vol. 17, pp. 530-531, Sept. 1974.

Index Terms:
Aperiodic checkpointing, periodic checkpointing, system failure rate.
Yibei Ling, Jie Mi, Xiaola Lin, "A Variational Calculus Approach to Optimal Checkpoint Placement," IEEE Transactions on Computers, vol. 50, no. 7, pp. 699-708, July 2001, doi:10.1109/12.936236
Usage of this product signifies your acceptance of the Terms of Use.