2008 IEEE Fourth International Conference on eScience (2008)
Indianapolis, IN
Dec. 7, 2008 to Dec. 12, 2008
Large scientific computing clusters require a distributed dependability subsystem that can provide fault isolation and recovery and is capable of learning and predicting failures, to improve the reliability of scientific workflows. In this paper, we outline the key ideas in the design of a Scientific Computing Autonomic Reliability Framework (SCARF) for large computing clusters used in the Lattice Quantum Chromo Dynamics project at Fermi Lab.
Cluster Computing, Reliability, Workflows, Software fault-tolerance
