loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
2005 International Conference on Dependable Systems and Networks (DSN'05)
Cruz: Application-Transparent Distributed Checkpoint-Restart on Standard Operating Systems
Yokohama, Japan
June 28-July 01
ISBN: 0-7695-2282-3
G. (John) Janakiraman, Hewlett-Packard Laboratories
Jose Renato Santos, Hewlett-Packard Laboratories
Dinesh Subhraveti, Hewlett-Packard Laboratories
Yoshio Turner, Hewlett-Packard Laboratories
We present a new distributed checkpoint-restart mechanism, Cruz, that works without requiring application, library, or base kernel modifications. This mechanism provides comprehensive support for checkpointing and restoring application state, both at user level and within the OS. Our implementation builds on Zap, a process migration mechanism, implemented as a Linux kernel module, which operates by interposing a thin layer between applications and the OS. In particular, we enable support for networked applications by adding migratable IP and MAC addresses, and checkpoint-restart of socket buffer state, socket options, and TCP state. We leverage this capability to devise a novel method for coordinated checkpoint-restart that is simpler than prior approaches. For instance, it eliminates the need to flush communication channels by exploiting the packet re-transmission behavior of TCP and existing OS support for packet filtering. Our experiments show that the overhead of coordinating checkpoint-restart is negligible, demonstrating the scalability of this approach.
Citation:
G. (John) Janakiraman, Jose Renato Santos, Dinesh Subhraveti, Yoshio Turner, "Cruz: Application-Transparent Distributed Checkpoint-Restart on Standard Operating Systems," dsn, pp.260-269, 2005 International Conference on Dependable Systems and Networks (DSN'05), 2005
Usage of this product signifies your acceptance of the Terms of Use.