This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Performance Optimization of Checkpointing Schemes with Task Duplication
December 1997 (vol. 46 no. 12)
pp. 1381-1386

Abstract—In checkpointing schemes with task duplication, checkpointing serves two purposes: detecting faults by comparing the processors' states at checkpoints, and reducing fault recovery time by supplying a safe point to rollback to. In this paper, we show that, by tuning the checkpointing schemes to a given architecture, a significant reduction in the execution time can be achieved. The main idea is to use two types of checkpoints: compare-checkpoints (comparing the states of the redundant processes to detect faults) and store-checkpoints (storing the states to reduce recovery time). With two types of checkpoints, we can use both the comparison and storage operations in an efficient way and improve the performance of checkpointing schemes. Results we obtained show that, in some cases, using compare and store checkpoints can reduce the overhead of DMR checkpointing schemes by as much as 30 percent.

[1] D.P. Siewiorek and R.S. Swarz, The Theory and Practice of Reliable System Design. Digital Press, 1982.
[2] P. Agrawal, "Fault Tolerance in Multiprocessor Systems without Dedicated Redundancy," IEEE Trans. Computers, vol. 37, no. 3, pp. 358-362, Mar. 1988.
[3] J. Long, W.K. Fuchs, and J.A. Abraham, "Forward Recovery Using Checkpointing in Parallel Systems," Proc. 19th Int'l Conf. Parallel Processing, pp. 272-275, Aug. 1990.
[4] D.K. Pradhan, "Redundancy Schemes for Recovery," Technical Report TR-89-cse-16, ECE Dept., Univ. of Massachusetts, Amherst, 1989.
[5] D.K. Pradhan and N.H. Vaidya, "Roll-Forward Checkpointing Scheme: Concurrent Retry with Nondedicated Spares," Proc. IEEE Workshop Fault-Tolerant Parallel and Distributed Systems, pp. 166-174, July 1992.
[6] A. Ziv and J. Bruck, "Analysis of Checkpointing Schemes for Multiprocessor Systems," Proc. 13th Symp. Reliable Distributed Systems, pp. 52-61, Oct. 1994.
[7] K.M. Chandy and C.V. Ramamoorthy, "Rollback and Recovery Strategies for Computer Programs," IEEE Trans. Computers, vol. 21, no. 6, pp. 546-556, June 1972.
[8] A. Duda, "The Effects of Checkpointing on Program Execution Time," Information Processing Letters, vol. 16, pp. 221-229, June 1983.
[9] N.H. Vaidya, "A Case for Two-Level Distributed Recovery Schemes," Proc. ACM SIGMETRICS Conf. Measurement and Modeling of Computer Systems, pp. 64-73, May 1995.
[10] D.K. Pradhan and N.H. Vaidya, "Roll-Forward and Rollback Recovery: Performance-Reliability Trade-Off," Proc. 24th IEEE Int'l Symp. Fault-Tolerant Computing, June 1994.
[11] A. Ziv and J. Bruck, "Efficient Checkpointing Schemes Over Local Area Networks," Proc. 1994 IEEE Workshop Fault-Tolerant Parallel and Distributed Systems, June 1994.
[12] T.M. Cover and J.A. Thomas, Elements of Information Theory. John Wiley&Sons, 1991.
[13] A.S. Tanenbaum, Computer Networks, third ed. Prentice Hall, 1996.
[14] A. Ziv, "Analysis and Performance Optimization of Checkpointing Schemes with Task Duplication," PhD thesis, Stanford Univ., 1995.
[15] B. Randell, "System Structure for Software Fault Tolerance," IEEE Trans. Software Eng., vol. 1, pp. 220-232, June 1975.

Index Terms:
Fault-tolerant computing, checkpointing, task duplication, parallel computing, performance optimization.
Citation:
Avi Ziv, Jehoshua Bruck, "Performance Optimization of Checkpointing Schemes with Task Duplication," IEEE Transactions on Computers, vol. 46, no. 12, pp. 1381-1386, Dec. 1997, doi:10.1109/12.641939
Usage of this product signifies your acceptance of the Terms of Use.