2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) (2018)
Washington, DC, USA
May 1, 2018 to May 4, 2018
Cloud computing is continuously increasing its popularity as key features such as scalability, pay-per-use and availability continue to evolve. It is also becoming a competitive platform for running high performance computing (HPC) and parallel applications due to the increasing performance of virtualized, highly-available instances. However, migrating HPC applications to cloud still requires native fault-tolerant solutions to fully leverage cloud features and maximize the resource utilization at the best cost - particularly for long-running parallel applications where faults can cause invalid states or data loss. This requires re-executing applications which increases completion time and cost. We propose Resilience as a Service (RaaS), a fault tolerant framework for HPC applications running in cloud. In this paper RADIC architecture (Redundant Array of Distributed Independent Fault Tolerance Controllers) is used to provide clouds with a highly available, distributed and scalable fault-tolerant service. The paper explores how traditional HPC protection and recovery mechanisms must be redesigned to natively leverage cloud properties and its multiple alternatives for implementing rollback recovery protocols using virtual machines, containers, object and block storage or database services. Results show that RaaS restores and completes the application execution using available resources while reducing overhead up to 8% for different fault-tolerant configuration alternatives.
cloud computing, fault tolerance, fault tolerant computing, parallel processing, protocols, virtual machines
J. Villamayor, D. Rexachs, E. Luque and D. Lugones, "RaaS: Resilience as a Service," 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), Washington, DC, USA, 2018, pp. 356-359.