Issue No. 04 - July/August (2011 vol. 8)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TDSC.2010.76
Ge-Ming Chiu , National Taiwan University of Science and Technology, Taipei
Jane-Ferng Chiu , Tungnan University, Taipei
Diskless checkpointing is an important technique for performing fault tolerance in distributed or parallel computing systems. This study proposes a new approach to enhance neighbor-based diskless checkpointing to tolerate multiple failures using simple checkpointing and failure recovery operations, without relying on dedicated checkpoint processors. In this scheme, each processor saves its checkpoints in a set of peer processors, called checkpoint storage nodes. In return, each processor uses simple XOR operations to store a collection of checkpoints for the processors for which it is a checkpoint storage node. This study defines the concept of safe recovery criterion, which specifies the requirement for ensuring that any failed processor can be recovered in a single step using the checkpoint data stored at one of the surviving processors, as long as no more than a given number of failures occur. This study further identifies the necessary and sufficient conditions for satisfying the safe recovery criterion and presents a method for designing checkpoint storage node sets that meet these requirements. The proposed scheme allows failure recovery to be performed in a distributed manner using XOR operations.
Diskless checkpointing, multiple failures, rollback recovery, XOR.
J. Chiu and G. Chiu, "A New Diskless Checkpointing Approach for Multiple Processor Failures," in IEEE Transactions on Dependable and Secure Computing, vol. 8, no. , pp. 481-493, 2010.