This Article 
 Bibliographic References 
 Add to: 
Analysis of Checkpointing Schemes with Task Duplication
February 1998 (vol. 47 no. 2)
pp. 222-227

Abstract—This paper suggests a technique for analyzing the performance of checkpointing schemes with task duplication. We show how this technique can be used to derive the average execution time of a task and other important parameters related to the performance of checkpointing schemes. The analysis results are used to study and compare the performance of four existing checkpointing schemes. Our comparison results show that, in general, the number of processors used, not the complexity of the scheme, has the most effect on the scheme performance.

[1] P. Agrawal, "Fault Tolerance in Multiprocessor Systems without Dedicated Redundancy," IEEE Trans. Computers, vol. 37, no. 3, pp. 358-362, Mar. 1988.
[2] A. Bobbio, "A Multi-Reward Stochastic Model for the Completion Time of Parallel Tasks," Proc. 13th Int'l Teletraffic Congress, pp. 577-582, 1991.
[3] K.M. Chandy and C.V. Ramamoorthy, "Rollback and Recovery Strategies for Computer Programs," IEEE Trans. Computers, vol. 21, no. 6, pp. 546-556, June 1972.
[4] E.G. Coffman and E.N. Gilbert, "Optimal Strategies for Scheduling Checkpoints and Preventive Maintenance," IEEE Trans. Reliability, vol. 39, pp. 9-18, Apr. 1990.
[5] L. Donatiello and V. Grassi, "On Evaluating the Cumulative Performance Distribution of Fault-Tolerant Computer Systems," IEEE Trans. Computers, vol. 40, no. 11, pp. 1,301-1,307, Nov. 1991.
[6] A. Duda, "The Effects of Checkpointing on Program Execution Time," Information Processing Letters, vol. 16, pp. 221-229, June 1983.
[7] E. Gelenbe, "On the Optimum Checkpoint Interval," J. ACM, vol. 26, pp. 259-270, Apr. 1979.
[8] R.A. Howard, Dynamic Probabilistic Systems Vol II: Semi Markov and Decision Processes. John Wiley, 1971.
[9] L. Kleinrock., Queueing Systems, Vol. I: Theory. John Wiley, 1975.
[10] V.G. Kulkarni, V.F. Nicola, and K.S. Trivedi, "Effects of Checkpointing and Queueing on Program Performance," Comm. Statistics—Stochastic Models, vol. 6, pp. 615-648, Apr. 1990.
[11] J. Long, W.K. Fuchs, and J.A. Abraham, "Forward Recovery Using Checkpointing in Parallel Systems," Proc. 19th Int'l Conf. Parallel Processing, pp. 272-275, Aug. 1990.
[12] J. Long, W.K. Fuchs, and J.A. Abraham, "Compiler-Assisted Static Checkpoint Insertion," Proc. FTC'92, pp. 58-65, July 1992.
[13] J. Long, W.K. Fuchs, and J.A. Abraham, "Implementing Forward Recovery Using Checkpoints in Distributed Systems," Dependable Computing for Critical Applications 2, R.D. Schlichting and J.F. Meyer, eds., pp. 27-46. Springer-Verlag, 1992.
[14] D.K. Pradhan, "Redundancy Schemes for Recovery," Technical Report TR-89-cse-16, Electrical and Computer Eng. Dept., Univ. of Massachusetts, Amherst, 1989.
[15] D.K. Pradhan and N.H. Vaidya, "Roll-Forward Checkpointing Scheme: Concurrent Retry with Nondedicated Spares," Proc. IEEE Workshop Fault-Tolerant Parallel and Distributed Systems, pp. 166-174, July 1992.
[16] R.M. Smith and K.S. Trivedi, "The Analysis of Computer Systems Using Markov Reward Processes," Stochastic Analysis of Computer and Communication Systems, H. Takagi, ed., pp. 589-629. North-Holland, 1990.
[17] D. Tang and R.K. Iyer, "Dependability Measurement and Modeling of a Multicomputer System," IEEE Trans. Computers, vol. 42, no. 1, pp. 62-75, Jan. 1993.
[18] S. Toueg and Ö. Babaoglu, "On the Optimum Checkpoint Selection Problem," SIAM J. Computing, vol. 13, pp. 630-649, Aug. 1984.
[19] A. Ziv, "Analysis and Performance Optimization of Checkpointing Schemes with Task Duplication," PhD thesis, Stanford Univ., 1995.

Index Terms:
Parallel computing, fault tolerance, checkpointing, task duplication, Markov Reward Model.
Avi Ziv, Jehoshua Bruck, "Analysis of Checkpointing Schemes with Task Duplication," IEEE Transactions on Computers, vol. 47, no. 2, pp. 222-227, Feb. 1998, doi:10.1109/12.663769
Usage of this product signifies your acceptance of the Terms of Use.