loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
7th IEEE International Conference and Workshop on the Engineering of Computer Based Systems
An Algorithm for Tolerating Crash Failures in Distributed Systems
Edinburgh, Scotland
April 03-April 07
ISBN: 0-7695-0604-6
Vincenzo de Florio, Katholieke Universiteit at Leuven
Geert Deconinck, Katholieke Universiteit at Leuven
Rudy Lauwereins, Katholieke Universiteit at Leuven
In the framework of the ESPRIT project 28620 “TIRAN” (tailorable fault tolerance frameworks for embedded applications), a toolset of error detection, isolation, and recovery components is being designed to serve as a basic means for orchestrating application-level fault tolerance. These tools will be used either as stand-alone components or as the peripheral components of a distributed application, that we call “the backbone”.The backbone is to run in the background of the user application. Its objectives include (1) gathering and maintaining error detection information produced by TIRAN components like watchdog timers, trap handlers, or by external detection services working at kernel or driver level, and (2) using this information at error recovery time. In particular, those TIRAN tools related to error detection and fault masking will forward their deductions to the backbone that, in turn, will make use of this information to orchestrate error recovery, requesting recovery and reconfiguration actions to those tools related to error isolation and recovery.Clearly a key point in this approach is guaranteeing that the backbone itself tolerates internal and external faults. In this article we describe one of the means that are used within the TIRAN backbone to fulfill this goal: a distributed algorithm for tolerating crash failures triggered by faults affecting at most all but one of the components of the backbone or at most all but one of the nodes of the system. We call this the algorithm of mutual suspicion.
Index Terms:
Fault Tolerance, Software-Implemented Fault Tolerance, Distributed Systems, Distributed Algorithms
Citation:
Vincenzo de Florio, Geert Deconinck, Rudy Lauwereins, "An Algorithm for Tolerating Crash Failures in Distributed Systems," ecbs, pp.9, 7th IEEE International Conference and Workshop on the Engineering of Computer Based Systems, 2000
Usage of this product signifies your acceptance of the Terms of Use.