This Article 
 Bibliographic References 
 Add to: 
Low-Latency, Concurrent Checkpointing for Parallel Programs
August 1994 (vol. 5 no. 8)
pp. 874-879

Presents the results of an implementation of several algorithms for checkpointing andrestarting parallel programs on shared-memory multiprocessors. The algorithms arecompared according to the metrics of overall checkpointing time, overhead imposed bythe checkpointer on the target program, and amount of time during which thecheckpointer interrupts the target program. The best algorithm measured achieves itsefficiency through a variation of copy-on-write, which allows the most time-consumingoperations of the checkpoint to be overlapped with the running of the program beingcheckpointed.

[1] T. Anderson and P. Lee,Fault Tolerance: Principles and Practice. London, UK: Prentice-Hall International, 1981.
[2] A. W. Appel, J. R. Ellis, and K. Li, "Real-time concurrent collection on stock multiprocessors," inProc. SIGPLAN Conf. Programming Language Design and Implementation, Atlanta, ACM Press, June 1988, pp. 11-20.
[3] K. M. Chandy and L. Lamport, "Distributed snapshots: Determining global states of distributed systems,"ACM Trans. Comput. Syst., vol. 3, no. 1, pp. 63-75, Feb. 1985.
[4] F. Jahanian and F. Cristian, "A timestamp-based checkpointing protocol for long-lived distributed computations," inProc. 10th Symp. Reliable Distributed Syst., Bologna, Italy, Sept. 1991, pp. 12-20.
[5] D. J. DeWittet al., "Implementation techniques for main memory databases," inProc. ACM Sigmod(Boston, MA), June 18-21, 1984, pp. 1-8.
[6] F. Douglis and J. Ousterhout, "Process migration in the sprite operating system," inProc. 7th Int. Conf. Distrib. Computing Syst., 1987, pp. 18-25.
[7] E.N. Elnozahy, D.B. Johnson, and W. Zwaenepoel, "The Performance of Consistent Checkpointing,"Proc. 11th Symp. Reliable Distributed Systems, IEEE Computer Society Press, Los Alamitos, Calif., 1992, pp. 39-47.
[8] S. I. Feldman and C. B. Brown, "IGOR: A system for program debugging via reversible execution," inProc. ACM SIGPLAN SIGOPS Workshop Parallel and Distributed Debugging, May 1988, pp. 112- 123.
[9] R. Fitzgerald and R. F. Rashid, "The integration of virtual memory management and interprocess communication in Accent,"ACM Trans. Comput. Syst., vol. 4, May 1986.
[10] R. B. Hagmann, "A crash recovery scheme for a memory-resident database system,"IEEE Trans. Comput., vol. C-35, no. 9, pp. 839- 843, Sept. 1986.
[11] D. B. Johnson and W. Zwaenepoel, "Recovery in distributed systems using optimistic message logging and checkpointing,"J. Algorithms, vol. 11, no. 3, pp. 462-491, Sept. 1990.
[12] M. F. Kaashoek, R. Michiels, H. E. Bal, and A. S. Tanenbaum, "Transparent fault-tolerance in parallel Orca programs," Tech. Rep. IR-258, Vrije Univ., Amsterdam, the Netherlands, Oct. 1991.
[13] R. Koo and S. Toueg, "Checkpointing and rollback-recovery for distributed systems,"IEEE Trans. Software Eng., vol. SE-13, pp. 23-31, Jan. 1987.
[14] T.H. Lai and T.H. Yang, "On Distributed Snapshots,"Information Processing Letters, Vol. 25, No. 3, May 1987, pp. 153-158.
[15] B. Lampson, "Atomic transactions," inDistributed Systems: Architecture and Implementation(Lecture Notes in Computer Science, vol. 105). Berlin: Springer-Verlag, 1981, pp. 246-265.
[16] K. Li and P. Hudak, "Memory Coherence in Shared Virtual Memory Systems,"ACM Trans. Computer Systems, Vol. 7, No. 4, Nov. 1989, pp. 321-359.
[17] K. Li and J. Naughton, "Multiprocessor main memory transaction processong," inProc. Int. Symp. Databases Parallel Distributed Syst., ACM, IEEE-CS, Austin, TX, Dec. 1988, pp. 177-187: also, CS-TR-159-88, Comput. Science Dep., Princeton Univ., June 1988.
[18] K. Li, J. F. Naughton, and J. S. Plank, "An efficient checkpointing method for multicomputers with wormhole routing,"Int. J. Parallel Processing, vol. 20, no. 3, June 1992.
[19] M. Litzkow and M. Solomon, "Supporting checkpointing and process migration outside the UNIX kernel," inConf. Proc., Usenix Winter 1992 Tech. Conf., 1992, pp. 283-290.
[20] P. R. McJones and G. F. Swart, "Evolving the UNIX system interface to support multithreaded programs," Tech. Rep. 21, DEC Syst. Res. Center, Sept. 1987.
[21] J. K. Ousterhout, A. Cherenson, F. Douglis, M. Nelson, and B. Welch, "The sprite network operating system,"IEEE Comput., vol. 21, pp. 23-36, Feb. 1988.
[22] C. Pu, "On-the-fly, incremental, consistent reading of entire databases," inProc. 11th Int. Conf. Very Large Databases, 1985, pp. 369-375.
[23] B. Randell, "System structure for software fault tolerance,"IEEE Trans. Software Eng., vol. SE-1, no. 2, pp. 220-232, 1975.
[24] P. Rovner, R. Levin, and J. Wick, "On extending modula-2 for building large, integrated systems," Res. Rep. 3, DEC Syst. Res. Center, 1985.
[25] K. Salem and H. Garcia-Molina, "Checkpointing memory-resident databases," Tech. Rep. CS-TR-126-87, Dept. of Comput. Sci., Princeton Univ., 1987.
[26] J. L. Peterson and A. Silberschaz,Operating Systems Concepts. Reading, MA: Addison-Wesley, 1986.
[27] M. E. Staknis, "Sheaved memory: Architectural support for state saving and restoration in paged systems," inProc. 3rd Int. Conf. Architectural Support for Programming Languages Operat. Syst., ACM, Apr. 1989, pp. 96-102.
[28] D. J. Taylor and M. L. Wright, "Backward error recovery in a UNIX environment," in16th Ann. Int. Symp. Fault-Tolerant Computing Syst., 1986, pp. 118-123.
[29] C. Thacker and L. Stewart. "Firefly: A multiprocessor workstation," inProc. 2nd Int. Conf. Architectural Support for Programming Languages Oper. Syst., Oct. 1987, pp. 164-172.
[30] M. Theimer, K. Lantz, and D. Cheriton, "Preemptable Remote Execution Facilities for the V-System,"Proc. 10th Symp. Operating Syst. Principles, Dec. 1985, pp. 2-12.

Index Terms:
Index Termsparallel programming; fault tolerant computing; software reliability; system recovery;program diagnostics; low latency concurrent checkpointing; parallel programs; programrestarting; shared-memory multiprocessors; metrics; overall checkpointing time;overhead; interruption time; efficiency; copy-on-write; overlapping operations; faulttolerance; backward error recovery
K. Li, J.F. Naughton, J.S. Plank, "Low-Latency, Concurrent Checkpointing for Parallel Programs," IEEE Transactions on Parallel and Distributed Systems, vol. 5, no. 8, pp. 874-879, Aug. 1994, doi:10.1109/71.298215
Usage of this product signifies your acceptance of the Terms of Use.