This Article 
 Bibliographic References 
 Add to: 
Damage Assessment for Optimal Rollback Recovery
May 1998 (vol. 47 no. 5)
pp. 603-613

Abstract—Conventional schemes of rollback recovery with checkpointing for concurrent processes have overlooked an important problem: contamination of checkpoints as a result of error propagation among the cooperating processes. Error propagation is unavoidable due to imperfect detection mechanisms and random interprocess communications, and it could give rise to contaminated checkpoints which, in turn, result in unsuccessful rollbacks. To counter the problem of error propagation, a damage assessment model is developed to estimate the correctness of saved checkpoints under various circumstances. Using the result of damage assessment, determination of the "optimal" checkpoints for rollback recovery—which minimize the average total recovery overhead—is formulated and solved as a nonlinear integer programming problem. Integration of damage assessment into existing recovery schemes is also discussed.

[1] B.L. Randell, P.A. Lee, and P.C. Treleaven, “Reliability Issue in Computing System Design,” ACM Computing Surveys, vol. 2, pp. 123-166, 1978.
[2] P.M. Merlin and B. Randell, "State Restoration in Distributed Systems," Digest of Papers, FTCS-8, pp. 129-134, June 1978.
[3] R.E. Strom and S.A. Yemini, "Optimistic Recovery in Distributed Systems," ACM Trans. Computer Systems, vol. 3, no. 3, pp. 204-226, Aug. 1985.
[4] B. Randell, "System Structures for Software Fault Tolerance," IEEE Trans. Software Eng., pp. 220-232, June 1975.
[5] D.L. Russell, "Process Backup in Producer-Consumer Systems," Proc. Sixth ACM Symp. Operating System Principles, pp. 151-157, Nov. 1977.
[6] D.L. Russell, "State Restoration in Systems of Communicating Processes," IEEE Trans. Software Eng., vol. 6, pp. 183-194, Mar. 1980.
[7] K. Tsuruoka, A. Kaneko, and Y. Nishihara, "Dynamic Recovery Schemes for Distributed Processes," Proc. IEEE Reliability in Distributed Software and Database Systems, pp. 124-130, July 1981.
[8] W.G. Wood, "A Decentralized Recovery Control Protocol," Digest of Papers, FTCS-11, pp. 159-164, June 1981.
[9] K. Kant and A. Silberschatz, "Error Recovery in Concurrent Processes," Proc. COMPSAC, pp. 608-614, 1980.
[10] K. Venkatesh, T. Radhakrishan, and H.F. Li, “Optimal Checkpointing and Local Recording for Domino-Free Rollback Recovery,” Information Processing Letters, vol. 25, pp. 295-303, 1987.
[11] G. Ferran, "Distributed Checkpointing in a Distributed Data Management System," Proc. Real Time Systems Symp., pp. 43-49, 1981.
[12] W.H. Kohler, "A Survey of Techniques for Synchronization and Recovery in Decentralized Computing Systems," ACM Computing Surveys, vol. 13, no. 2, pp. 149-185, June 1981.
[13] R. Koo and S. Toueg, "Checkpointing and Rollback-Recovery for Distributed Systems," IEEE Trans. Software Eng., vol. 13, no. 1, pp. 23-31, Jan. 1987.
[14] K.G. Shin and Y.-H. Lee, "Evaluation of Error Recovery Blocks Used for Cooperating Processes," IEEE Trans. Software Eng., vol. 10, no. 11, pp. 692-700, Nov. 1984.
[15] B. Bhargava and S.R. Lian, "Independent Checkpointing and Concurrent Rollback for Recovery—An Optimistic Approach," Proc. IEEE Symp. Reliable Distributed Systems, pp. 3-12, 1988.
[16] Y.M. Wang and W.K. Fuchs, "Optimistic Message Logging for Independent Checkpointing in Message-Passing Systems," Proc. IEEE Symp. Reliable Distributed Systems, Oct. 1992.
[17] A. P. Sistla and J. L. Welch,“Efficient distributed recovery using message logging,”inProc. 8th ACM Symp. Princip. Distrib. Comput., 1989, pp. 223–238.
[18] D. B. Johnson and W. Zwaenepoel,“Recovery in distributed systems using optimistic message logging and checkpointing,”J. Algorithms, vol. 11, pp. 462–491, 1990.
[19] T.Y. Juang and S. Venkatesan, “Crash Recovery with Little Overhead,” Proc. 11th Int'l Conf. Distributed Computing Systems, pp. 454-461, June 1987.
[20] A. Lowry, J.R. Russell, and A.P. Goldberg, "Optimistic Failure Recovery for Very Large Networks," Proc. IEEE Symp. Reliable Distributed Systems, pp. 66-75, 1991.
[21] K.G. Shin and T.-H. Lin, "Modeling and Measurement of Error Propagation in a Multi-Module Computing System," IEEE Trans. Computers, vol. 37, no. 9, pp. 1,053-1,066, Sept. 1988.
[22] T.-H. Lin and K.G. Shin, "Location of Faulty Module in a Computing System," IEEE Trans. Computers, vol. 39, no. 2, pp. 182-194, Feb. 1990.
[23] T.-H. Lin and K.G. Shin, "A Bayesian Approach to Fault Classification," Performance Evaluation Review, vol. 18, no. 1, pp. 58-66, 1990.
[24] Y.H. Lee and K.G. Shin, "Optimal Design and Use of Retry in Fault-Tolerant Computing Systems," J. ACM, vol. 35, pp. 45-69, Jan. 1988.
[25] T.-H Lin and K.G. Shin, "An Optimal Retry Policy Based on Fault Classification," IEEE Trans. Computers, vol. 43, no. 9, pp. 1,014-1,025, Sept. 1994.
[26] K.G. Shin and Y.-H. Lee, "Error Detection Process—Model, Design, and Its Impact on Computer Performance," IEEE Trans. Computers, vol. 33, no. 6, pp. 529-540, June 1984.

Index Terms:
Damage assessment, error propagation, rollback recovery, checkpointing, nonlinear integer programming.
Tein-Hsiang Lin, Kang G. Shin, "Damage Assessment for Optimal Rollback Recovery," IEEE Transactions on Computers, vol. 47, no. 5, pp. 603-613, May 1998, doi:10.1109/12.677255
Usage of this product signifies your acceptance of the Terms of Use.