Reliable Distributed Systems, IEEE Symposium on (1995)
Bad Neuenahr, Germany
Sept. 13, 1995 to Sept. 15, 1995
G. Cabillic , IRISA, Rennes, France
G. Muller , IRISA, Rennes, France
I. Puaut , IRISA, Rennes, France
This paper presents the design and implementation of a consistent checkpointing scheme for distributed shared memory (DSM) systems. Our approach relies on the integration of checkpoints within synchronization barriers already existing in applications; this avoids the need to introduce an additional synchronization mechanism. The main advantage of our checkpointing mechanism is that performance degradation arises only when a checkpoint is being taken; hence, the programmer can adjust the trade-off between the cost of checkpointing and the cost of longer rollbacks by adjusting the time between two successive checkpoints. The paper compares several implementations of the proposed consistent checkpointing mechanism (incremental, non-blocking, and pre-flushing) on the Intel Paragon multicomputer for several parallel scientific applications. Performance measures show that a careful optimization of the checkpointing protocol can reduce the time overhead of checkpointing from 8% to 0.04% of the application duration for a 6 mn checkpointing interval.
synchronisation; message passing; shared memory systems; distributed memory systems; program debugging; software performance evaluation; performance; consistent checkpointing; distributed shared memory systems; synchronization barriers; performance degradation; rollbacks; Intel Paragon multicomputer; parallel scientific applications
G. Muller, G. Cabillic and I. Puaut, "The performance of consistent checkpointing in distributed shared memory systems," Reliable Distributed Systems, IEEE Symposium on(SRDS), Bad Neuenahr, Germany, 1995, pp. 96.