Issue No.04 - July/August (2011 vol.8)
Ge-Ming Chiu , National Taiwan University of Science and Technology, Taipei
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TDSC.2010.76
Diskless checkpointing is an important technique for performing fault tolerance in distributed or parallel computing systems. This study proposes a new approach to enhance neighbor-based diskless checkpointing to tolerate multiple failures using simple checkpointing and failure recovery operations, without relying on dedicated checkpoint processors. In this scheme, each processor saves its checkpoints in a set of peer processors, called checkpoint storage nodes. In return, each processor uses simple XOR operations to store a collection of checkpoints for the processors for which it is a checkpoint storage node. This study defines the concept of safe recovery criterion, which specifies the requirement for ensuring that any failed processor can be recovered in a single step using the checkpoint data stored at one of the surviving processors, as long as no more than a given number of failures occur. This study further identifies the necessary and sufficient conditions for satisfying the safe recovery criterion and presents a method for designing checkpoint storage node sets that meet these requirements. The proposed scheme allows failure recovery to be performed in a distributed manner using XOR operations.
Diskless checkpointing, multiple failures, rollback recovery, XOR.
Ge-Ming Chiu, "A New Diskless Checkpointing Approach for Multiple Processor Failures", IEEE Transactions on Dependable and Secure Computing, vol.8, no. 4, pp. 481-493, July/August 2011, doi:10.1109/TDSC.2010.76