2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) (2018)
Washington, DC, USA
May 1, 2018 to May 4, 2018
The mean time between failures of upcoming exascale systems is expected to be one hour or less. To be able to successfully complete execution of applications in such scenarios, several improved checkpoint/restart mechanisms such as multi-level checkpointing are being developed. Today, resource management systems handle job interruptions due to node failures by restarting the affected job from a checkpoint on a fresh set of nodes. This method, however, will add non-negligible overhead and will not allow taking full advantage of multi-level checkpointing in future systems. Alternatively, some spare nodes can be allocated for each job so that only processes that die on the failed nodes need to be restarted on spare nodes. However, given the magnitude of the expected failure rates, the number of spare nodes to be allocated for each job would be high, causing significant resource wastage. This work proposes a dynamic way handling node failures by enabling on-the-fly replacement of failed nodes with healthy ones. We propose a dynamic node replacement algorithm that finds replacement nodes by utilizing the flexibility of moldable and malleable jobs. Our evaluation with a simulator shows that this approach can maintain high throughput even when a system is experiencing frequent node failures, thereby making it a perfect technique to complement multi-level checkpointing.
checkpointing, resource allocation, software fault tolerance
S. Prabhakaran, M. Neumann and F. Wolf, "Efficient Fault Tolerance Through Dynamic Node Replacement," 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), Washington, DC, USA, 2018, pp. 163-172.