This Article 
 Bibliographic References 
 Add to: 
Use of Common Time Base for Checkpointing and Rollback Recovery in a Distributed System
June 1993 (vol. 19 no. 6)
pp. 571-583

An approach to checkpointing and rollback recovery in a distributed computing system using a common time base is proposed. A common time base is established in the system using a hardware clock synchronization algorithm. This common time base is coupled with the idea of pseudo-recovery points to develop a checkpointing algorithm that has the following advantages: reduced wait for commitment for establishing recovery lines, fewer messages to be exchanged, and less memory requirement. These advantages are assessed quantitatively by developing a probabilistic model.

[1] D. Dolev, C. Dwork, and L. Stockmeyer, "On the minimal synchronism needed for distributed consensus,"J. ACM, vol. 34, no. 1, pp. 77-97, Jan. 1987.
[2] M. J. Fischer, N. A. Lynch, and M. S. Paterson, "Impossibility of distributed consensus with one faulty process,"J. ACM, vol. 32, no. 2, pp. 374-382, Apr. 1985.
[3] J. Horning, H. C. Lauer, P. M. Melliar-Smith, and B. Randell, "A program structure for error detection and recovery,"Lecture Notes in Computer Science, vol. 16, New York: Springer-Verlag, 1974, pp. 171-187.
[4] J. L. W. Kessels, "Two designs of a fault-tolerant clocking system,"IEEE Trans. Comput., vol. C-33, no. 10, pp. 912-919, Oct. 1984.
[5] K. H. Kim, "Approaches to mechanization of conversation scheme based on monitors,"IEEE Trans. Software Eng., vol. SE-8, no. 3, pp. 189-197, May 1982.
[6] R. Koo and S. Toueg, "Checkpointing and rollback-recovery for distributed systems,"IEEE Trans. Software Eng., vol. SE-13, pp. 23-31, Jan. 1987.
[7] C.M. Krishna, K.G. Shin, and R.W. Butler, "Ensuring Fault Tolerance of Phase-Locked Clocks,"IEEE Trans. Computers, Vol. C- 34, No. 8, Aug. 1985, pp. 752-756.
[8] L. Lamport, "Using Time Instead of Timeout for Fault-Tolerant Distributed Systems,"ACM Trans. Programming Languages and Systems, Vol. 6, No. 2, Apr. 1984, pp. 254-280.
[9] I. Lee and S. Davidson, "Adding time to synchronous process communications,"IEEE Trans. Comput., pp. 941-948, Aug. 1987.
[10] Y.-H. Lee and K. G. Shin, "Design and evaluation of a fault-tolerant multiprocessor using hardware recovery blocks,"IEEE Trans. Comput., vol. C-33, no. 2, pp. 113-124, Feb. 1984.
[11] D.G. Luenberger,Linear and Non-linear Programming, 2nd ed. Reading, MA: Addison Wesley, 1984.
[12] B. Randell, "System structure for software fault tolerance,"IEEE Trans. Software Eng., vol. SE-1, no. 2, pp. 220-232, June 1975.
[13] B. Randell, P.A. Lee, and P.C. Treleaven, "Reliability Issues in Computer System Design,"ACM Computing Surveys, Vol. 28, No. 2, Apr. 1978, pp. 123-165.
[14] K. G. Shin and Y.-H. Lee, "Evaluation of error recovery blocks used for cooperating processes,"IEEE Trans. Software Eng., vol. SE-10, no. 6, pp. 692-700, Nov. 1984.
[15] K. G. Shin, T. Lin, and Y.-H. Lee, "Optimal checkpointing of real-time tasks,"IEEE Trans. Computers, vol. C-36, no. 11, pp. 1328-1341, Nov. 1987.
[16] K.G. Shin and P. Ramanathan, "Clock Synchronization of a Large Multiprocessor System in the Presence of Malicious Faults,"IEEE Trans. Computers, Vol. C-36, No. 1, Jan. 1987, pp. 2-12.

Index Terms:
message exchange; distributed system; checkpointing; rollback recovery; common time base; hardware clock synchronization algorithm; pseudo-recovery points; recovery lines; memory requirement; probabilistic model; distributed processing; fault tolerant computing; system recovery
P. Ramanathan, K.G. Shin, "Use of Common Time Base for Checkpointing and Rollback Recovery in a Distributed System," IEEE Transactions on Software Engineering, vol. 19, no. 6, pp. 571-583, June 1993, doi:10.1109/32.232022
Usage of this product signifies your acceptance of the Terms of Use.