loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18
Current Practice and a Direction Forward in Checkpoint/Restart Implementations for Fault Tolerance
Denver, Colorado
April 04-April 08
ISBN: 0-7695-2312-9
Jos? Carlos Sancho, Los Alamos National Laboratory, NM
Fabrizio Petrini, Los Alamos National Laboratory, NM
Kei Davis, Los Alamos National Laboratory, NM
Roberto Gioiosa, Los Alamos National Laboratory, NM
Song Jiang, Los Alamos National Laboratory, NM
Checkpoint/restart is a general idea for which particular implementations enable various functionalities in computer systems, including process migration, gang scheduling, hibernation, and fault tolerance. For fault tolerance, in current practice, implementations can be at user-level or system-level. User-level implementations are relatively easy to implement and portable, but suffer from a lack of transparency, flexibility, and efficiency, and in particular are unsuitable for the autonomic (self-managing) computing systems envisioned as the next revolutionary development in system management. In contrast, a system-level implementation can exhibit all of these desirable features, at the cost of a more sophisticated implementation, and is seen as an essential mechanism for the next generation of fault tolerant-and ultimately autonomic-large-scale computing systems. Linux is becoming the operating system of choice for the largest-scale machines, but development of system-level checkpoint/restart mechanisms for Linux is still in its infancy, with all extant implementations exhibiting serious deficiencies for achieving transparent fault tolerance. This paper provides a survey of extant implementations in a natural taxonomy, highlighting their strengths and inherent weaknesses.
Index Terms:
Fault tolerance, checkpoint/restart, autonomic computing, Linux
Citation:
Jos? Carlos Sancho, Fabrizio Petrini, Kei Davis, Roberto Gioiosa, Song Jiang, "Current Practice and a Direction Forward in Checkpoint/Restart Implementations for Fault Tolerance," ipdps, vol. 19, pp.300b, 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18, 2005
Usage of this product signifies your acceptance of the Terms of Use.