The Community for Technology Leaders
Green Image
<p>Presents the results of an implementation of several algorithms for checkpointing andrestarting parallel programs on shared-memory multiprocessors. The algorithms arecompared according to the metrics of overall checkpointing time, overhead imposed bythe checkpointer on the target program, and amount of time during which thecheckpointer interrupts the target program. The best algorithm measured achieves itsefficiency through a variation of copy-on-write, which allows the most time-consumingoperations of the checkpoint to be overlapped with the running of the program beingcheckpointed.</p>
Index Termsparallel programming; fault tolerant computing; software reliability; system recovery;program diagnostics; low latency concurrent checkpointing; parallel programs; programrestarting; shared-memory multiprocessors; metrics; overall checkpointing time;overhead; interruption time; efficiency; copy-on-write; overlapping operations; faulttolerance; backward error recovery
K. Li, J.F. Naughton, J.S. Plank, "Low-Latency, Concurrent Checkpointing for Parallel Programs", IEEE Transactions on Parallel & Distributed Systems, vol. 5, no. , pp. 874-879, August 1994, doi:10.1109/71.298215
163 ms
(Ver 3.3 (11022016))