First IEEE International Symposium on Cluster Computing and the Grid (CCGrid'01)
A Two-Level Checkpoint Algorithm in a Highly-Available Parallel Single Level Store System
Brisbane, Australia
May 15-May 18
ISBN: 0-7695-1010-8
A Parallel Single Level Store systems (PSLS) integrates a shared virtual memory and a parallel file system. Managing globally the data, they provide programmers of scientific applications with the attractive shared memory programming model combined with a large and efficient file system in a cluster. In this paper, we present a cheap and efficient two-level checkpointing approach enabling a PSLS to tolerate failures. The first level checkpointing algorithm is very efficient and saves data in memory but requires a large amount of memory space. When memories are saturated, an alternative algorithm, saving a checkpoint on disks is implemented. Performance results present the impact of different variants of the checkpointing algorithms.
Citation:
Christine Morin, Renaud Lottiaux, Anne-Marie Kermarrec, "A Two-Level Checkpoint Algorithm in a Highly-Available Parallel Single Level Store System," ccgrid, pp.514, First IEEE International Symposium on Cluster Computing and the Grid (CCGrid'01), 2001