Fault-Tolerant Computing, International Symposium on (1999)
June 15, 1999 to June 18, 1999
Lorenzo Alvisi , University of Texas at Austin
Sriram Rao , University of Texas at Austin
Syed Amir Husain , University of Texas at Austin
Asanka de Mel , University of Texas at Austin
Elmootazbellah Elnozahy , IBM
Communication induced checkpointing (CIC) allows processes in a distributed computation to take independent checkpoints and to avoid the domino effect. This paper presents an analysis of CIC protocols based on a prototype implementation and validated simulations. Our result inidcate that there is sufficient evidence to suspect that much of the conventional wisdom about these protocols is questionable.
Checkpointing, Rollback Recovery, Performance Evaluation, MPI, Consistent Global States
L. Alvisi, A. de Mel, S. A. Husain, E. Elnozahy and S. Rao, "An Analysis of Communication-Induced Checkpointing," Fault-Tolerant Computing, International Symposium on(FTCS), Madison, Wisconsin, 1999, pp. 242.