This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints
April 1997 (vol. 46 no. 4)
pp. 456-468

Abstract—In this paper, we consider the problem of constructing consistent global checkpoints that contain a given set of checkpoints. We address three important issues related to this problem. First, we define the maximum and minimum consistent global checkpoints containing a set S, and give algorithms to construct them. These algorithms are based on reachability analysis on a rollback-dependency graph. Second, we introduce a concept called "rollback-dependency trackability" that enables this analysis to be performed efficiently for a certain class of checkpoint and communication models. We define the least stringent of these models ("FDAS"), and put it in context with other models defined in the literature. Significant in this is a way to use FDAS to provide efficient rollback recovery for applications that do not satisfy perfect piecewise determinism. Finally, we describe several applications of the theorems and algorithms derived in this paper to demonstrate the capability of our approach to unify, generalize, and extend many previous works.

[1] Y. M. Wang, A. Lowry, and W. K. Fuchs,“Consistent global checkpoints based on direct dependency tracking,”to appear inInform. Process. Lett., vol. 50, no. 4, pp. 223–230, May 1994.
[2] K.M. Chandy and L. Lamport, "Distributed Snapshots: Determining Global States of Distributed Systems," ACM Trans. Computer Systems, Feb. 1985.
[3] B. Bhargava and S.R. Lian, "Independent Checkpointing and Concurrent Rollback for Recovery—An Optimistic Approach," Proc. IEEE Symp. Reliable Distributed Systems, pp. 3-12, 1988.
[4] Y. M. Wang,“Space reclamation for uncoordinated checkpointing in message-passing systems,”Ph.D. dissertation, Dep. Elec. Comput. Eng., Univ. Illinois at Urbana-Champaign, Aug. 1993.
[5] D. B. Johnson and W. Zwaenepoel,“Recovery in distributed systems using optimistic message logging and checkpointing,”J. Algorithms, vol. 11, pp. 462–491, 1990.
[6] V. Hadzilacos, "An Algorithm for Minimizing Roll Back Cost," Proc. ACM Symp. Principles of Database Systems, pp. 93-97, 1982.
[7] Y. Wang, "Maximum and Minimum Consistent Global Checkpoints and Their Application," Proc. 14th IEEE Symp. Reliable Distributed Systems, pp. 86-95, Oct. 1995.
[8] Y.M. Wang, M. Merritt, and A.B. Romanovsky, "Guaranteed Deadlock Recovery: Deadlock Resolution with Rollback Propagation," Proc. Pacific Rim Int'l Symp. Fault-Tolerant Systems, pp. 92-97, Dec. 1995.
[9] A. Acharya and B.R. Badrinath, "Checkpointing Distributed Applications on Mobil Computers," Proc. Third Int'l Conf. Parallel and Distributed Information Systems, Sept. 1994.
[10] J. Fowler and W. Zwaenepoel, "Causal Distributed Breakpoints," Proc. 10th Int'l Conf. Distributed Computing Systems, pp. 134-141, 1990.
[11] A. P. Sistla and J. L. Welch,“Efficient distributed recovery using message logging,”inProc. 8th ACM Symp. Princip. Distrib. Comput., 1989, pp. 223–238.
[12] R.E. Strom, D.F. Bacon, and S.A. Yemini, “Volatile Logging inn-Fault-Tolerant Distributed Systems,” Proc. Third Ann. Int'l Symp. Fault-Tolerant Computing, pp. 44-49, 1988.
[13] T.Y. Juang and S. Venkatesan, “Crash Recovery with Little Overhead,” Proc. 11th Int'l Conf. Distributed Computing Systems, pp. 454-461, June 1987.
[14] E.N. Elnozahy and W. Zwaenepoel, “Manetho—Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit,” IEEE Trans. Computers, vol. 41, no. 5, pp. 526–531, May 1992.
[15] L. Alvisi and K. Marzullo, “Message Logging: Pessimistic, Optimistic and Causal,” Proc. 15th Int'l Conf. Distributed Computing Systems, pp. 229-236, 1995.
[16] R.H.B. Netzer and J. Xu, "Necessary and Sufficient Conditions for Consistent Global Snapshots," IEEE Trans. Parallel and Distributed System, vol. 6, no. 2, pp. 165-169, Feb. 1995.
[17] Y. Huang and C. Kintala, "Software Implemented Fault Tolerance: Technologies and Experience," Proc. IEEE Fault-Tolerant Computing Symp., pp. 2-9, June 1993.
[18] I. Anderson, Combinatorics of Finite Sets.Oxford: Clarendon Press, 1987.
[19] Y.M. Wang, P.Y. Chung, I.J. Lin, and W.K. Fuchs, "Checkpoint Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems," IEEE Trans. Parallel and Distributed Systems, vol. 6, no. 5, pp. 546-554, May 1995.
[20] R.E. Strom and S.A. Yemini, "Optimistic Recovery in Distributed Systems," ACM Trans. Computer Systems, vol. 3, no. 3, pp. 204-226, Aug. 1985.
[21] K.H. Kim, J.H. You, and A. Abouelnaga, "A Scheme for Coordinated Execution of Independently Designed Recoverable Distributed Processes," Proc. IEEE Fault-Tolerant Computing Symp., pp. 130-135, 1986.
[22] D.L. Russell, "State Restoration in Systems of Communicating Processes," IEEE Trans. Software Eng., vol. 6, no. 2, pp. 183-194, Mar. 1980.
[23] K.L. Wu, W.K. Fuchs, and J.H. Patel, "Error Recovery in Shared Memory Multiprocessors Using Private Caches," IEEE Trans. Parallel and Distributed Systems, vol. 1, no. 2, pp. 231-240, Apr. 1990.
[24] K.-L. Wu and W.K. Fuchs, "Recoverable Distributed Shared Virtual Memory," IEEE Trans. Computers, vol. 39, no. 4, pp. 460-469, Apr. 1990.
[25] Y.M. Wang, Y. Huang, and W.K. Fuchs, "Progressive Retry for Software Error Recovery in Distributed Systems," Proc. IEEE Fault Tolerant Computing Symp., pp. 138-144, June 1993.
[26] Y. Huang, C. Kintala, and Y.M. Wang, "Software Tools and Libraries for Fault Tolerance," Bulletin Technical Committee on Operating Systems and Application Environments (TCOS), vol. 7, no. 4, pp. 5-9, Winter 1995.
[27] Y. Huang and Y.M. Wang, "Why Optimistic Message Logging Has Not Been Used in Telecommunications Systems," Proc. IEEE Fault-Tolerant Computing Symp., pp. 459-463, June 1995.
[28] E. Cohen, Y.M. Wang, and G. Suri, "When Piecewise Determinism Is Almost True," Proc. Pacific Rim Int'l Symp. Fault-Tolerant Systems, pp. 66-71, Dec. 1995.
[29] J. Gray and D.P. Siewiorek, "High-Availability Computer Systems," Computer, pp. 39-48, Sept. 1991.
[30] I. Lee and R.K. Iyer, “Faults, Symptoms, and Software Fault Tolerance in Tandem GUARDIAN90 Operating System,” Proc. 23rd IEEE Int'l Symp. Fault-Tolerant Computing (FTCS23), pp. 20-29, Toulouse, France 1993.
[31] Y. Huang and C. Kintala, "A Software Fault Tolerance Platform," Practical Reusable Software, B. Krishnamurthy, ed., pp. 223-245. John Wiley&Sons, 1995.
[32] Y. Huang and C. Kintala, "A Software Fault Tolerance Platform," Practical Reusable Software, B. Krishnamurthy, ed., pp. 223-245. John Wiley&Sons, 1995.
[33] G. Suri, Y. Huang, Y.M. Wang, W.K. Fuchs, and C. Kintala, "An Implementation and Performance Measurement of the Progressive Retry Technique," Proc. IEEE Int'l Computer Performance and Dependability Symp., pp. 41-48, Apr. 1995.
[34] Y. Huang, C. Kintala, L. Bernstein, and Y.M. Wang, "Components for Software Fault Tolerance and Rejuvenation," AT&T Technical J., pp. 29-37, Mar. 1996.
[35] Y.M. Wang et al., “Checkpointing and Its Applications,” Digest 25th Ann. Int'l Symp. Fault-Tolerant Computing, pp. 22-31, June 1995.
[36] R. E. Strom, S. A. Yemini, and D. F. Bacon, "A Recoverable Object Store," Proc. Hawaii Int'l Conf. System Sciences, pp. II-215-II-221, Jan. 1988.
[37] E. Knapp, "Deadlock Detection in Distributed databases Systems," ACM Computing Surveys, pp. 303-328, Dec. 1987.
[38] T. Imielinski and B.R. Badrinath, “Wireless Computing: Challenges in Data Management,” Comm. ACM, vol. 37, no. 10, Oct. 1994.

Index Terms:
Algorithms, distributed systems, consistent global states, distributed debugging, deadlock recovery, fault tolerance, checkpointing, rollback recovery, message logging, vector timestamps.
Citation:
Yi-Min Wang, "Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints," IEEE Transactions on Computers, vol. 46, no. 4, pp. 456-468, April 1997, doi:10.1109/12.588059
Usage of this product signifies your acceptance of the Terms of Use.