Issue No. 06 - June (2013 vol. 62)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TC.2012.57
Mohamed-Slim Bouguerra , INRIA, Grenoble
Denis Trystram , Grenoble institute of Technology, Grenoble
Frédéric Wagner , Grenoble Institute of Technology, Grenoble
The parallel computing platforms available today are increasingly larger and thus, more and more subject to failures. Consequently it is necessary to develop efficient strategies providing safe and reliable completion for HPC parallel applications. Checkpointing is one of the most popular and efficient technique for developing fault-tolerant applications on such context. However, checkpoint operations are costly in terms of time, computation, and network communication. This will certainly affect the global performance of the application. In this work, we propose a performance model that expresses formally the checkpoint scheduling problem. This model exhibits the tradeoff between the impact of the checkpoints operations and the lost computation due to failures. Based on this model, we study the computational complexity of the problem of scheduling checkpoints with variable costs for general failure distributions. More precisely, we provide a new computational complexity analysis that explicits in depth the relations between the probabilistic failure model, the checkpoint cost, and the computational model. In particular, we prove that the checkpoint scheduling problem is NP-hard even in the simple case of uniform failure distribution. We also present a dynamic programming scheme for determining the optimal checkpointing times in all the variants of the problem.
Program processors, Computational modeling, Checkpointing, Processor scheduling, History, Maintenance engineering, Optimization, failure detection, Fault tolerance, checkpoint scheduling
D. Trystram, M. Bouguerra and F. Wagner, "Complexity Analysis of Checkpoint Scheduling with Variable Costs," in IEEE Transactions on Computers, vol. 62, no. , pp. 1269-1275, 2013.