The Community for Technology Leaders
RSS Icon
Issue No.06 - June (2013 vol.62)
pp: 1269-1275
Mohamed-Slim Bouguerra , INRIA, Grenoble
Denis Trystram , Grenoble institute of Technology, Grenoble
Frédéric Wagner , Grenoble Institute of Technology, Grenoble
The parallel computing platforms available today are increasingly larger and thus, more and more subject to failures. Consequently it is necessary to develop efficient strategies providing safe and reliable completion for HPC parallel applications. Checkpointing is one of the most popular and efficient technique for developing fault-tolerant applications on such context. However, checkpoint operations are costly in terms of time, computation, and network communication. This will certainly affect the global performance of the application. In this work, we propose a performance model that expresses formally the checkpoint scheduling problem. This model exhibits the tradeoff between the impact of the checkpoints operations and the lost computation due to failures. Based on this model, we study the computational complexity of the problem of scheduling checkpoints with variable costs for general failure distributions. More precisely, we provide a new computational complexity analysis that explicits in depth the relations between the probabilistic failure model, the checkpoint cost, and the computational model. In particular, we prove that the checkpoint scheduling problem is NP-hard even in the simple case of uniform failure distribution. We also present a dynamic programming scheme for determining the optimal checkpointing times in all the variants of the problem.
Program processors, Computational modeling, Checkpointing, Processor scheduling, History, Maintenance engineering, Optimization, failure detection, Fault tolerance, checkpoint scheduling
Mohamed-Slim Bouguerra, Denis Trystram, Frédéric Wagner, "Complexity Analysis of Checkpoint Scheduling with Variable Costs", IEEE Transactions on Computers, vol.62, no. 6, pp. 1269-1275, June 2013, doi:10.1109/TC.2012.57
[1] R.E. Barlow, F. Proschan, and L.C. Hunter, Mathematical Theory of Reliability. SIAM, 1996.
[2] X. Besseron, M.S. Bouguerra, T. Gautier, É. Saule, and D. Trystram, "Numerical Analysis and Scientific Computing," Fault Tolerance and Availability Awareness in Computational Grids, ch. 5, Chapman and Hall/CRC Press, 2009.
[3] X. Besseron and T. Gautier, "Optimised Recovery with a Coordinated Checkpoint/Rollback Protocol for Domain Decomposition Applications," Proc. Second Int'l Conf. Modelling, Computation and Optimization in Information Systems and Management Sciences (MCO), pp. 497-506, 2008.
[4] M. Bougeret, H. Casanova, M. Rabie, Y. Robert, and F. Vivien, "Checkpointing Strategies for Parallel Jobs," Research Report RR-7520, INRIA, 04, 2011.
[5] M.S. Bouguerra, T. Gautier, D. Trystram, and J.M. Vincent, "A Flexible Checkpoint/Restart Model in Distributed Systems," Proc. Eighth Int'l Conf. Parallel Processing and Applied Math., pp. 206-215, 2010.
[6] A. Bouteiller, T. Hérault, G. Krawezik, P. Lemarinier, and F. Cappello, "MPICH-V Project: A Multiprotocol Automatic Fault Tolerant MPI," The Int'l J. High Performance Computing Applications, vol. 20, pp. 319-333, 2006.
[7] F. Cappello, "Fault Tolerance in Petascale/Exascale Systems: Current Knowledge, Challenges and Research Opportunities," Int'l J. High Performance Computing Applications, vol. 23, no. 3, pp. 212-226, 2009.
[8] Z. Chen and J. Dongarra, "Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing," IEEE Trans. Computers, vol. 58, no. 11, pp. 1512-1524, Nov. 2009.
[9] J.T. Daly, "A Higher Order Estimate of the Optimum Checkpoint Interval for Restart Dumps," Future Generation Computer Systems, vol. 22, no. 3, pp. 303-312, 2006.
[10] T. Dohi, T. Ozaki, and N. Kaio, "Optimal Checkpoint Placement with Equality Constraints," Proc. IEEE Second Int'l Symp. Dependable, Autonomic and Secure Computing, pp. 77-84, 2006.
[11] A. Duda, "Effects of Checkpointing on Program Execution Time," Information Processing Letters, vol. 16, no. 5, pp. 221-229, 1983.
[12] E.N. Elnozahy and J.S. Plank, "Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery," IEEE Trans. Dependable Secure Computing, vol. 1, no. 2, pp. 97-108, Apr.-June 2004.
[13] M.R. Garey and D.S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness. WH Freeman & Co., 1979.
[14] S. Garg, Y. Huang, C. Kintala, and K.S. Trivedi, "Minimizing Completion Time of a Program by Checkpointing and Rejuvenation," Proc. ACM SIGMETRICS Int'l Conf. Measurement and Modeling of Computer Systems, pp. 252-261, 1996.
[15] L.A.B. Gomez, N. Maruyama, F. Cappello, and S. Matsuoka, "Distributed Diskless Checkpoint for Large Scale Systems," Proc. IEEE/ACM 10th Int'l Conf. Cluster, Cloud and Grid Computing, pp. 63-72, 2010.
[16] R. Gupta, H. Naik, and P. Beckman, "Understanding Checkpointing Overheads on Massive-Scale Systems: Analysis of the IBMS Blue Gene/P System," Int'l J. High Performance Computing Applications, vol. 25, no. 2, pp. 180-192, 2011.
[17] E. Heien, D. Kondo, A. Gainaru, D. LaPine, B. Kramer, and F. Cappello, "Modeling and Tolerating Heterogeneous Failures in Large Parallel Systems," Proc. Int'l Conf. for High Performance Computing, Networking, Storage and Analysis, p. 45, 2011.
[18] J. Hursey, J.M. Squyres, T.I. Mattox, and A. Lumsdaine, "The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI," Proc. IEEE Int'l Parallel and Distributed Processing Symp. (IPDPS), Mar. 2007.
[19] H. Jin, Y. Chen, H. Zhu, and X.H. Sun, "Optimizing HPC Fault-Tolerant Environment: An Analytical Approach," Proc. 39th Int'l Conf. Parallel Processing, pp. 525-534, 2010.
[20] D. Nurmi, R. Wolski, and J. Brevik, "Model-Based Checkpoint Scheduling for Volatile Resource Environments," Proc. Cluster, 2004.
[21] T. Ozaki, T. Dohi, and N. Kaio, "Numerical Computation Algorithms for Sequential Checkpoint Placement," Performance Evaluation, vol. 66, no. 6, pp. 311-326, 2009.
[22] T. Ozaki, T. Dohi, H. Okamura, and N. Kaio, "Distribution-Free Checkpoint Placement Algorithms Based on Min-Max Principle," IEEE Trans. Dependable Secure Computing, vol. 3, no. 2, pp. 130-140, Apr.-June 2006.
[23] J.S. Plank and M.G. Thomason, "Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems," J. Parallel and Distributed Computing, vol. 61, no. 11, pp. 1570-1590, 2001.
[24] X. Ren, R. Eigenmann, and S. Bagchi, "Failure-Aware Checkpointing in Fine-Grained Cycle Sharing Systems," Proc. 16th Int'l Symp. High Performance Distributed Computing, pp. 33-42, 2007.
[25] S. Sankaran, J.M. Squyres, B. Barrett, A. Lumsdaine, J. Duell, P. Hargrove, and E. Roman, "The LAM/MPI Checkpoint/Restart Framework: System-Initiated Checkpointing," Int'l J. High Performance Computing Applications, vol. 19, no. 4, pp. 479-493, 2005.
[26] E. Saule and D. Trystram, "Analyzing Scheduling with Transient Failures," Information Processing Letters, vol. 109, no. 11, pp. 539-542, 2009.
[27] B. Schroeder and G.A. Gibson, "Understanding Failures in Petascale Computers," J. Physics: Conf. Series, vol. 78, pp. 012-022, 2007.
[28] B. Schroeder and G.A. Gibson, "A Large-Scale Study of Failures in High-Performance Computing Systems," IEEE Trans. Dependable and Secure Computing, vol. 7, no. 4, pp. 337-351, Oct. 2010.
[29] S. Toueg and Ö. Babaoğlu, "On the Optimum Checkpoint Selection Problem," SIAM J. Computing, vol. 13, no. 3, pp. 630-649, 1984.
[30] N.H. Vaidya, "Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme," IEEE Trans. Computers, vol. 46, no. 8, pp. 942-947, Aug. 1997.
[31] L.G. Valiant, "A Bridging Model for Parallel Computation," Comm. ACM, vol. 33, no. 8, pp. 103-111, 1990.
[32] X. Lin, Y. Ling, and J. Mi, "A Variational Calculus Approach to Optimal Checkpoint Placement," IEEE Trans. Computers, vol. 50, no. 7, pp. 699-708, July 2001.
[33] J.W. Young, "A First Order Approximation to the Optimum Checkpoint Interval," Comm. ACM, vol. 17, no. 9, pp. 530-531, 1974.
32 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool