loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
22nd International Symposium on Reliable Distributed Systems (SRDS'03)
Raptor: Integrating Checkpoints and Thread Migration for Cluster Management
Florence, Italy
October 06-October 08
ISBN: 0-7695-1955-5
Hazim Shafi, IBM Research
Evan Speight, Cornell University
John K. Bennett, University of Colorado
Software distributed shared-memory (SDSM) provides the abstraction necessary to run shared-memory applications on cost-effective parallel platforms such as clusters of workstations. However, problems such as cluster component reliability and cluster management, which are not directly related to performance, need to be addressed before SDSM solutions can be widely adopted. This paper presents Raptor, an SDSM cluster management system based on checkpoint/recovery and thread migration. Raptor checkpoints decouple the runtime system and application data from application threads, allowing efficient load balancing, resource allocation, and rollback recovery. There are two important features of the system. First, it reduces checkpoint overhead by only saving application-specific data that cannot be recreated at recovery time. Second, by integrating thread migration capability both at runtime and recovery, it allows the addition or removal of computing resources from a running application while adding little or no additional burden on the SDSM application programmer.
Citation:
Hazim Shafi, Evan Speight, John K. Bennett, "Raptor: Integrating Checkpoints and Thread Migration for Cluster Management," srds, pp.141, 22nd International Symposium on Reliable Distributed Systems (SRDS'03), 2003
Usage of this product signifies your acceptance of the Terms of Use.