The Community for Technology Leaders
RSS Icon
Issue No.05 - May (2012 vol.23)
pp: 890-901
Kassian Plankensteiner , University of Innsbruck, Innsbruck
We propose a new heuristic called Resubmission Impact to support fault tolerant execution of scientific workflows in heterogeneous parallel and distributed computing environments. In contrast to related approaches, our method can be effectively used on new or unfamiliar environments, even in the absence of historical executions or failure trace models. On top of this method, we propose a dynamic enactment and rescheduling heuristic able to execute workflows with a high degree of fault tolerance, while taking into account soft deadlines. Simulated experiments of three real-world workflows in the Austrian Grid demonstrate that our method significantly reduces the resource waste compared to conservative task replication and resubmission techniques, while having a comparable makespan and only a slight decrease in the success probability. On the other hand, the dynamic enactment method manages to successfully meet soft deadlines in faulty environments in the absence of historical failure trace information or models.
Scientific workflows, fault tolerance, scheduling, cloud computing, grid computing.
Kassian Plankensteiner, "Meeting Soft Deadlines in Scientific Workflows Using Resubmission Impact", IEEE Transactions on Parallel & Distributed Systems, vol.23, no. 5, pp. 890-901, May 2012, doi:10.1109/TPDS.2011.221
[1] A. Iosup, M. Jan, O. Sonmez, and D. Epema, "On the Dynamic Resource Availability in Grids," Proc. IEEE/ACM Eighth Int'l Conf. Grid Computing, pp. 26-33, 2007.
[2] G. Kandaswamy, A. Mandal, and D.A. Reed, "Fault Tolerance and Recovery of Scientific Workflows on Computational Grids," Proc. IEEE Eighth Int'l Symp. Cluster Computing and the Grid (CGDRID '08), pp. 777-782, 2008.
[3] A. Luckow and B. Schnor, "Adaptive Checkpoint Replication for Supporting the Fault Tolerance of Applications in the Grid," Proc. IEEE Seventh Int'l Symp. Network Computing and Applications (NCA '08), pp. 299-306, 2008.
[4] K. Plankensteiner, R. Prodan, T. Fahringer, A. Kertesz, and P. Kacsuk, "Fault-Tolerant Behavior in State-of-the-Art Grid Worklow Management Systems," Technical Report TR-0091, Inst. on Grid Information, Resource and Worklow Monitoring Services, CoreGRID—Network of Excellence, Oct. 2007.
[5] G. Kandaswamy, A. Mandal, and D. Reed, "Fault Tolerance and Recovery of Scientific Workflows on Computational Grids," Proc. IEEE Eighth Int'l Symp. Cluster Computing and the Grid (CCGRID '08), pp. 777-782, 2008.
[6] Y. Zhang, A. Mandal, C. Koelbel, and K. Cooper, "Combined Fault Tolerance and Scheduling Techniques for Workflow Applications on Computational Grids," Proc. IEEE/ACM Ninth Int'l Symp. Cluster Computing and the Grid (CCGRID '09), pp. 244-251, 2009.
[7] T. Zheng and M. Woodside, "Heuristic Optimization of Scheduling and Allocation for Distributed Systems with Soft Deadlines," Proc. 13th Int'l Conf. Computer Performance Evaluations, Modelling Techniques and Tools, pp. 169-181, 2003.
[8] I. Brandic, S. Pllana, and S. Benkner, "Specification, Planning, and Execution of Qos-Aware Grid Workflows Within the Amadeus Environment," Concurrency and Computation: Practice and Experience, vol. 20, pp. 331-345, Mar. 2008.
[9] L. Guo, A. McGough, A. Akram, D. Colling, and J. Martyniak, "Qos for Service Based Workflow on Grid," Proc. Conf. UK e-Science 2007 All Hands Meeting, Jan. 2007.
[10] M. Wieczorek, M. Siddiqui, A. Villazon, R. Prodan, and T. Fahringer, "Applying Advance Reservation to Increase Predictability of Workflow Execution on the Grid," Proc. IEEE Second Int'l Conf. e-Science and Grid Computing (E-SCIENCE '06), 2006.
[11] J. Yu, R. Buyya, and C. Tham, "Qos-Based Scheduling of Workflow Appl. on Service Grids," Proc. IEEE First Int'l Conf. e-Science and Grid Computing (eScience '05), Jan. 2005.
[12] J. Yu and R. Buyya, "A Taxonomy of Scientific Workflow Systems for Grid Computing," ACM SIGMOD Record, vol. 34, no. 3, pp. 44-49, 2005.
[13] T. Fahringer et al., "Frameworks and Tools: Workflow Generation, Refinement and Execution," ASKALON: A Development and Grid Computing Environment for Scientific Workflows, ser. Workflows for e-Science, Springer Verlag, http:/, 2007.
[14] H. Topcuoglu, S. Hariri, and M. Wu, "Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing," IEEE Trans. Parallel and Distributed Systems, vol. 13, no. 3, pp. 260-274, Mar. 2002.
[15] S. Ostermann, K. Plankensteiner, R. Prodan, and T. Fahringer, "Groudsim: An Event-Based Simulation Framework for Computational Grids and Clouds," CoreGRID/ERCIM Workshop Grids, Clouds and P2P Computing, Springer, Aug. 2010.
[16] S. Ostermann, R. Prodan, T. Fahringer, A. Iosup, and D. Epema, "A Trace-Based Investigation of the Characteristics of Grid Workflows," From Grids to Service and Pervasive Computing, pp. 191-203, Springer, fulltext.pdf, Aug. 2008.
26 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool