This Article 
 Bibliographic References 
 Add to: 
Checkpoint Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems.
May 1995 (vol. 6 no. 5)
pp. 546-554

Abstract—Uncoordinated checkpointing allows process autonomy and general nondeterministic execution, but suffers from potential domino effects and the associated space overhead. Previous to this research, checkpoint space reclamation had been based on the notion of obsolete checkpoints; as a result, a potentially unbounded number of nonobsolete checkpoints may have to be retained on stable storage. In this paper, we derive a necessary and sufficient condition for identifying all garbage checkpoints. By using the approach of recovery line transformation and decomposition, we develop an optimal checkpoint space reclamation algorithm and show that the space overhead for uncoordinated checkpointing is in fact bounded by $N(N+1)/2$ checkpoints where $N$ is the number of processes.

Index Terms—Fault tolerance, message-passing systems, uncoordinated checkpointing, rollback recovery, garbage collection.

[1] B. Randell,“System structure for software fault tolerance,”IEEE Trans. Software Eng., vol. 1, pp. 220–232, June 1975.
[2] K. Tsuruoka, A. Kaneko, and Y. Nishihara,“Dynamic recovery schemes for distributed processes,”inProc. IEEE 2nd Symp. Reliability Distrib. Software Database Syst., 1981, pp. 124–130.
[3] B. Bhargava and S. R. Lian,“Independent checkpointing and concurrent rollback for recovery—An optimistic approach,”inProc. IEEE Symp. Reliable Distrib. Syst., 1988, pp. 3–12.
[4] Y. M. Wang and W. K. Fuchs,“Optimistic message logging for independent checkpointing in message-passing systems,”inProc. IEEE Symp. Reliable Distrib. Syst., Oct. 1992, pp. 147–154.
[5] K.M. Chandy and L. Lamport, "Distributed Snapshots: Determining Global States of Distributed Systems," ACM Trans. Computer Systems, Feb. 1985.
[6] R. Koo and S. Toueg, "Checkpointing and Rollback-Recovery for Distributed Systems," IEEE Trans. Software Eng., vol. 13, no. 1, pp. 23-31, Jan. 1987.
[7] E. N. Elnozahy, D. B. Johnson, and W. Zwaenepoel,“The performance of consistent checkpointing,”inProc. IEEE Symp. Reliable Distrib. Syst., Oct. 1992, pp. 39–47.
[8] Y. M. Wang and W. K. Fuchs,“Lazy checkpoint coordination for bounding rollback propagation,”inProc. IEEE Symp. Reliable Distrib. Syst., Oct. 1993, pp. 78–85.
[9] R.E. Strom and S.A. Yemini, "Optimistic Recovery in Distributed Systems," ACM Trans. Computer Systems, vol. 3, no. 3, pp. 204-226, Aug. 1985.
[10] D. B. Johnson and W. Zwaenepoel,“Recovery in distributed systems using optimistic message logging and checkpointing,”J. Algorithms, vol. 11, pp. 462–491, 1990.
[11] A. P. Sistla and J. L. Welch,“Efficient distributed recovery using message logging,”inProc. 8th ACM Symp. Princip. Distrib. Comput., 1989, pp. 223–238.
[12] Y. M. Wang,“Space reclamation for uncoordinated checkpointing in message-passing systems,”Ph.D. dissertation, Dep. Elec. Comput. Eng., Univ. Illinois at Urbana-Champaign, Aug. 1993.
[13] Y. M. Wang, Y. Huang, and W. K. Fuchs,“Progressive retry for software error recovery in distributed systems,”inProc. IEEE Fault-Tolerant Comput. Symp., June 1993, pp. 138–144,
[14] Y. M. Wang, P. Y. Chung, I. J. Lin, and W. K. Fuchs,“Checkpoint space reclamation for uncoordinated checkpointing in message-passing systems,”Coordinated Sci. Lab., Univ. Illinois at Urbana-Champaign, Tech. Rep. CRHC-92-06, 1992.
[15] R. D. Schlichting and F. B. Schneider,“Fail-stop processors: An approach to designing fault-tolerant computing systems,”ACM Trans. Comput. Syst., vol. 1, no. 3, pp. 222–238, Aug. 1983.
[16] A. Borg, W. Blau, W. Graetsch, F. Herrmann, and W. Oberle, "Fault Tolerance Under UNIX," ACM Trans. Computer Systems, vol. 7, no. 1, pp. 1-24, Feb. 1989.
[17] M. L. Powell and D. L. Presotto,“Publishing: A reliable broadcast communication mechanism,”inProc. 9th ACM Symp. Oper. Syst. Princip., 1983, pp. 100–109.
[18] Y. M. Wang, A. Lowry, and W. K. Fuchs,“Consistent global checkpoints based on direct dependency tracking,”to appear inInform. Process. Lett., vol. 50, no. 4, pp. 223–230, May 1994.
[19] L. Lamport, "Time, clocks and the ordering of events in a distributed system," Comm. ACM, vol. 21, no. 7, pp. 558-565, July 1978.
[20] I. Anderson,Combinatorics of Finite {S}ets. Oxford, England: Clarendon, 1987.
[21] W. Shu and L. V. Kal\' e,“Chare kernel—A runtime support system for parallel computations,”J. Parallel Distrib. Comput., vol. 11, pp. 198–211, 1991.

Yi-Min Wang, Pi-Yu Chung, In-Jen Lin, W. Kent Fuchs, "Checkpoint Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems.," IEEE Transactions on Parallel and Distributed Systems, vol. 6, no. 5, pp. 546-554, May 1995, doi:10.1109/71.382324
Usage of this product signifies your acceptance of the Terms of Use.