Subscribe

Issue No.06 - June (2013 vol.62)

pp: 1269-1275

Mohamed-Slim Bouguerra , INRIA, Grenoble

Denis Trystram , Grenoble institute of Technology, Grenoble

Frédéric Wagner , Grenoble Institute of Technology, Grenoble

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TC.2012.57

ABSTRACT

The parallel computing platforms available today are increasingly larger and thus, more and more subject to failures. Consequently it is necessary to develop efficient strategies providing safe and reliable completion for HPC parallel applications. Checkpointing is one of the most popular and efficient technique for developing fault-tolerant applications on such context. However, checkpoint operations are costly in terms of time, computation, and network communication. This will certainly affect the global performance of the application. In this work, we propose a performance model that expresses formally the checkpoint scheduling problem. This model exhibits the tradeoff between the impact of the checkpoints operations and the lost computation due to failures. Based on this model, we study the computational complexity of the problem of scheduling checkpoints with variable costs for general failure distributions. More precisely, we provide a new computational complexity analysis that explicits in depth the relations between the probabilistic failure model, the checkpoint cost, and the computational model. In particular, we prove that the checkpoint scheduling problem is NP-hard even in the simple case of uniform failure distribution. We also present a dynamic programming scheme for determining the optimal checkpointing times in all the variants of the problem.

INDEX TERMS

Program processors, Computational modeling, Checkpointing, Processor scheduling, History, Maintenance engineering, Optimization, failure detection, Fault tolerance, checkpoint scheduling

CITATION

Mohamed-Slim Bouguerra, Denis Trystram, Frédéric Wagner, "Complexity Analysis of Checkpoint Scheduling with Variable Costs",

*IEEE Transactions on Computers*, vol.62, no. 6, pp. 1269-1275, June 2013, doi:10.1109/TC.2012.57REFERENCES