This Article 
 Bibliographic References 
 Add to: 
An On-Line Algorithm for Checkpoint Placement
September 1997 (vol. 46 no. 9)
pp. 976-985

Abstract—Checkpointing enables us to reduce the time to recover from a fault by saving intermediate states of the program in a reliable storage. The length of the intervals between checkpoints affects the execution time of programs. On one hand, long intervals lead to long reprocessing time, while, on the other hand, too frequent checkpointing leads to high checkpointing overhead. In this paper, we present an on-line algorithm for placement of checkpoints. The algorithm uses knowledge of the current cost of a checkpoint when it decides whether or not to place a checkpoint. The total overhead of the execution time when the proposed algorithm is used is smaller than the overhead when fixed intervals are used. Although the proposed algorithm uses only on-line knowledge about the cost of checkpointing, its behavior is close to the off-line optimal algorithm that uses a complete knowledge of checkpointing cost.

[1] A. Brock, "An Analysis of Checkpointing," ICL Technical J., vol. 1, 1979.
[2] K.M. Chandy and C.V. Ramamoorthy, "Rollback and Recovery Strategies for Computer Programs," IEEE Trans. Computers, vol. 21, no. 6, pp. 546-556, June 1972.
[3] E.G. Coffman and E.N. Gilbert, "Optimal Strategies for Scheduling Checkpoints and Preventive Maintenance," IEEE Trans. Reliability, vol. 39, pp. 9-18, Apr. 1990.
[4] A. Duda, "The Effects of Checkpointing on Program Execution Time," Information Processing Letters, vol. 16, pp. 221-229, June 1983.
[5] E. Gelenbe, "On the Optimum Checkpoint Interval," J. ACM, vol. 26, pp. 259-270, Apr. 1979.
[6] S. Karlin and H.M. Taylor, A First Course in Stochastic Processes. Academic Press, 1975.
[7] V.G. Kulkarni, V.F. Nicola, and K.S. Trivedi, "Effects of Checkpointing and Queueing on Program Performance," Comm. Statistics—Stochastic Models, vol. 6, pp. 615-648, Apr. 1990.
[8] P. L’Ecuyer and J. Malenfant,“Computing optimal checkpointing strategies for rollback and recovery systems,” IEEE Trans. Computers, vol. 37, no. 4, pp. 491-496, 1988.
[9] C.-C.J. Li, E.M. Stewart, and W.K. Fuchs, “Compiler-Assisted Full Checkpointing,” Software—Practice and Experience, vol. 24, no. 10, pp. 871-886, Oct. 1994.
[10] V.F. Nicola, "Checkpointing and the Modeling of Program Execution Time," Software Fault-Tolerance, M.R. Lyu, ed., pp. 167-188. John Wiley, 1995.
[11] V.F. Nicola and J.M. van Spanje, "Comparative Analysis of Different Models of Checkpointing and Recovery," IEEE Trans. Software Eng., vol. 16, no. 8, pp. 807-821, Aug. 1990.
[12] S. Toueg and Ö. Babaoglu, "On the Optimum Checkpoint Selection Problem," SIAM J. Computing, vol. 13, pp. 630-649, Aug. 1984.
[13] A. Ziv, "Analysis and Performance Optimization of Checkpointing Schemes with Task Duplication," PhD thesis, Stanford Univ., 1995.

Index Terms:
Fault-tolerant computing, checkpointing, on-line algorithm, performance optimization.
Avi Ziv, Jehoshua Bruck, "An On-Line Algorithm for Checkpoint Placement," IEEE Transactions on Computers, vol. 46, no. 9, pp. 976-985, Sept. 1997, doi:10.1109/12.620479
Usage of this product signifies your acceptance of the Terms of Use.