loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
2005 IEEE International Conference on Cluster Computing
Job-Site Level Fault Tolerance for Cluster and Grid environments
Burlington, MA
September 27-September 30
ISBN: 0-7803-9485-2
K. Limaye, Louisiana Tech Univ., Ruston, LA
B. Leangsuksun, Louisiana Tech Univ., Ruston, LA
Z. Greenwood, Louisiana Tech Univ., Ruston, LA
In order to adopt high performance clusters and grid computing for mission critical applications, fault tolerance is a necessity. Common fault tolerance techniques in distributed systems are normally achieved with checkpoint-recovery and job replication on alternative resources, in cases of a system outage. The first approach depends on the system's MTTR while the latter approach depends on the availability of alternative sites to run replicas. There is a need for complementing these approaches by proactively handling failures at a job-site level, ensuring the system high availability with no loss of user submitted jobs. This paper discusses a novel fault tolerance technique that enables the job-site recovery in Beowulf cluster-based grid environments, whereas existing techniques give up a failed system by seeking alternative resources. Our results suggest sizable aggregate performance improvement during an implementation of our method in Globus-enabled HA-OSCAR. The technique called ''smart failover" provides a transparent and graceful recovery mechanism that saves job states in a local job-manager queue and transfers those states to the backup server periodically, and in critical system events. Thus whenever a failover occurs, the backup server is able to restart the jobs from their last saved state
Index Terms:
smart failover, fault tolerance, cluster environment, grid computing, distributed system, checkpoint-recovery, job replication, job-site recovery, Beowulf cluster, Globus, HA-OSCAR
Citation:
K. Limaye, B. Leangsuksun, Z. Greenwood, S.L. Scott, C. Engelmann, R. Libby, K. Chanchio, "Job-Site Level Fault Tolerance for Cluster and Grid environments," cluster, pp.1-9, 2005 IEEE International Conference on Cluster Computing, 2005
Usage of this product signifies your acceptance of the Terms of Use.