Issue No. 08 - August (1994 vol. 5)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/71.298215
<p>Presents the results of an implementation of several algorithms for checkpointing andrestarting parallel programs on shared-memory multiprocessors. The algorithms arecompared according to the metrics of overall checkpointing time, overhead imposed bythe checkpointer on the target program, and amount of time during which thecheckpointer interrupts the target program. The best algorithm measured achieves itsefficiency through a variation of copy-on-write, which allows the most time-consumingoperations of the checkpoint to be overlapped with the running of the program beingcheckpointed.</p>
Index Termsparallel programming; fault tolerant computing; software reliability; system recovery;program diagnostics; low latency concurrent checkpointing; parallel programs; programrestarting; shared-memory multiprocessors; metrics; overall checkpointing time;overhead; interruption time; efficiency; copy-on-write; overlapping operations; faulttolerance; backward error recovery
K. Li, J. Naughton and J. Plank, "Low-Latency, Concurrent Checkpointing for Parallel Programs," in IEEE Transactions on Parallel & Distributed Systems, vol. 5, no. , pp. 874-879, 1994.