Cluster Computing and the Grid, IEEE International Symposium on (2009)
May 18, 2009 to May 21, 2009
Complex scientific workflows are now Increasingly executed on computational grids. In addition to the challenges of managing and scheduling these workflows, reliability challenges arise because of the unreliable nature of large-scale grid infrastructure. Fault tolerance mechanisms like over-provisioning and checkpoint-recovery are used in current grid application management systems to address these reliability challenges. In this work, we propose new approaches that combine these fault tolerance techniques with existing workflow scheduling algorithms. We present a study on the effectiveness of the combined approaches by analyzing their impact on the reliability of workflow execution, workflow performance and resource usage under different reliability models, failure prediction accuracies and workflow application types.
Grid Computing, Fault Tolerant, Workflow Scheduling, Performance Modeling
C. Koelbel, K. Cooper, A. Mandal and Y. Zhang, "Combined Fault Tolerance and Scheduling Techniques for Workflow Applications on Computational Grids," Cluster Computing and the Grid, IEEE International Symposium on(CCGRID), Shanghai, China, 2009, pp. 244-251.