2011 IEEE 13th International Symposium on High-Assurance Systems Engineering (2011)
Boca Raton, Florida USA
Nov. 10, 2011 to Nov. 12, 2011
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/HASE.2011.58
Scientific Workflow Management Systems (S-WFMS), such as Kepler, have proven to be an important tools in scientific problem solving. Interestingly, S-WFMS fault-tolerance and failure recovery is still an open topic. It often involves classic fault-tolerance mechanisms, such as alternative versions and rollback with re-runs, reliance on the fault-tolerance capabilities provided by subcomponents and lower layers such as schedulers, Grid and cloud resources, or the underlying operating systems. When failures occur at the underlying layers, a workflow system sees this as failed steps in the process, but frequently without additional detail. This limits S-WFMS' ability to recover from failures. We describe a light weight end-to-end S-WFMS fault-tolerance framework, developed to handle failure patterns that occur in some real-life scientific workflows. Capabilities and limitations of the framework are discussed and assessed using simulations. The results show that the solution considerably increase workflow reliability and execution time stability.
Scientific workflows, fault-tolerance, end-to-end framework, Kepler
P. A. Mouallem and M. A. Vouk, "On High-Assurance Scientific Workflows," 2011 IEEE 13th International Symposium on High-Assurance Systems Engineering(HASE), Boca Raton, Florida USA, 2011, pp. 73-82.