Second IEEE International Conference on Engineering of Complex Computer Systems (ICECCS'96)
On Real-Time Quasi-Durable Checkpointing
Montreal, CANADA
October 21-October 25
ISBN: 0-8186-7614-0
Checkpointing is a commonly used technique for fault tolerant computing. However, most of the existing approaches focus on improving checkpointing reliability and performance. This study investigates real-time checkpointing techniques in the context of distributed process control applications where checkpointing and recovery operations must meet timing constraints, such as process deadline and plant state validity. We introduce the notion of quasi-durability, which allows one to make tradeoffs between storage device reliability and the process control and recovery timing constraints. Based on this notion, we define recoverability as the expected validity of the internal plant state used for recovery in case of process failure. We study three protocols for real-time quasi-durable checkpointing and recovery. For each protocol, we analyze its recoverability and give the sufficient and necessary conditions for a set of device to be feasible for checkpointing and recovery.
Index Terms:
Real-time fault-tolerance, real-time recovery, reliability, checkpointing, distributed process control
Citation:
J. Huang, P.J. Wan, V. Thomas, "On Real-Time Quasi-Durable Checkpointing," iceccs, pp.331, Second IEEE International Conference on Engineering of Complex Computer Systems (ICECCS'96), 1996