14th International Conference on Distributed Computing Systems (1994)
June 21, 1994 to June 24, 1994
Cheng-Ru Young , Dept. of Electr. Eng. & Technol., Nat. Taiwan Inst. of Technol., Taipei, Taiwan
Ge-Ming Chiu , Dept. of Electr. Eng. & Technol., Nat. Taiwan Inst. of Technol., Taipei, Taiwan
In this paper we propose a new mechanism for implementing checkpoint/rollback-recovery in a distributed computing system. A logical-ring structure is introduced for the maintenance of recovery-related information. Message processing order of a process is maintained by all other processes on its associated ring. It requires no time-consuming operations of writing order information into stable storage. As a result, fail-free overhead is small. When failures occur, only failed processes have to roll back to their latest checkpoints. Surviving processes continue execution without being blocked. Output commit is fast as it needs no synchronization before a message is sent to the outside world.<
message passing, system recovery, distributed processing, fault tolerant computing, software reliability
Cheng-Ru Young and Ge-Ming Chiu, "A crash recovery technique in distributed computing systems," 14th International Conference on Distributed Computing Systems(ICDCS), Pozman, Poland, 1994, pp. 235-242.