Subscribe

Issue No.02 - February (2009 vol.20)

pp: 180-190

Maria Chtepen , Ghent University - IBBT, Gent

Filip H.A. Claeys , CTO MOSTforWATER N.V., Belgium

Bart Dhoedt , Ghent University - IBBT, Gent

Filip De Turck , Ghent University - IBBT, Gent

Piet Demeester , Ghent University - IBBT, Gent

Peter A. Vanrolleghem , Université Laval, Québec

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TPDS.2008.93

ABSTRACT

A grid is a distributed computational and storage environment often composed of heterogeneous autonomously managed subsystems. As a result, varying resource availability becomes commonplace, often resulting in loss and delay of executing jobs. To ensure good grid performance, fault tolerance should be taken into account. Commonly utilized techniques for providing fault tolerance in distributed systems are periodic job checkpointing and replication. While very robust, both techniques can delay job execution if inappropriate checkpointing intervals and replica numbers are chosen. This paper introduces several heuristics that dynamically adapt the abovementioned parameters based on information on grid status to provide high job throughput in the presence of failure while reducing the system overhead. Furthermore, a novel fault-tolerant algorithm combining checkpointing and replication is presented. The proposed methods are evaluated in a newly developed grid simulation environment Dynamic Scheduling in Distributed Environments (DSiDE), which allows for easy modeling of dynamic system and job behavior. Simulations are run employing workload and system parameters derived from logs that were collected from several large-scale parallel production systems. Experiments have shown that adaptive approaches can considerably improve system performance, while the preference for one of the solutions depends on particular system characteristics, such as load, job submission patterns, and failure frequency.

INDEX TERMS

Distributed systems, performance of systems, fault tolerance, availability.

CITATION

Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt, Filip De Turck, Piet Demeester, Peter A. Vanrolleghem, "Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids",

*IEEE Transactions on Parallel & Distributed Systems*, vol.20, no. 2, pp. 180-190, February 2009, doi:10.1109/TPDS.2008.93REFERENCES

- [1] M. Chtepen, F. Claeys, B. Dhoedt, F. De Turck, P. Vanrolleghem, and P. Demeester, “Dynamic Scheduling of Computationally Intensive Applications on Unreliable Infrastructures,”
Proc. Second European Modeling and Simulation Symp. (EMSS '06), Oct. 2006.- [2] D. Feitelson,
Parallel Workloads Archive, http://www.cs.huji.ac.il/labs/parallelworkload /, 2008.- [3] B. Schroeder and G. Gibson, “A Large-Scale Study of Failures in High-Performance-Computing Systems,”
Proc. Int'l Conf. Dependable Systems and Networks (DSN '06), June 2006.- [4] S. Hwang and C. Kesselman, “A Flexible Framework for Fault Tolerance in the Grid,”
J. Grid Computing, vol. 1, no. 3, pp. 251-272, Sept. 2003.- [5] A. Subbiah and D. Blough, “Distributed Diagnosis in Dynamic Fault Environments,”
Parallel and Distributed Systems, vol. 15, no. 5, pp. 453-467, 2004.- [6] Y. Derbal, “A New Fault-Tolerance Framework for Grid Computing,”
Multiagent and Grid Systems, vol. 2, no. 2, pp. 115-133, 2006.- [7] Y. Zhang, M. Squillante, A. Sivasubramaniam, and R. Sahoo, “Performance Implications of Failures in Large-Scale Cluster Scheduling,”
Proc. 10th Workshop Job Scheduling Strategies for Parallel Processing (JSSPP '04), pp. 233-252, 2004.- [8] A. Oliner and J. Stearley, “What Supercomputers Say: A Study of Five System Logs,”
Proc. 37th Ann. IEEE/IFIP Int'l Conf. Dependable Systems and Networks (DSN '07), pp. 575-584, June 2007.- [9] A. Dogan and F. Osgunger, “Matching and Scheduling Algorithms for Minimizing Execution Time and Failure Probability of Applications in Heterogeneous Computing,”
Parallel and Distributed Systems, vol. 13, no. 3, pp. 308-323, 2002.- [10] D. Silva, W. Cirne, and F. Brasileiro, “Trading Cycles for Information: Using Replication to Schedule Bag-of-Tasks Applications on Computational Grids,”
Proc. Int'l Conf. Parallel and Distributed Computing (Euro-Par '03), pp. 169-180, Aug. 2003.- [11] R. De Camargo, A. Goldchleger, F. Kon, and A. Goldman, “Checkpointing-Based Rollback Recovery for Parallel Applications on the InteGrade Grid Middleware,”
Proc. Second Workshop Middleware for Grid Computing (MGC '04), pp. 35-40, 2004.- [12] A. Oliner, R. Sahoo, J. Moreira, and M. Gupta, “Performance Implications of Periodic Checkpointing on Large-Scale Cluster Systems,”
Proc. 19th IEEE Int'l Parallel and Distributed Processing Symp. (IPDPS '05), Apr. 2005.- [13] Y. Li and M. Mascagni, “Improving Performance via Computational Replication on a Large-Scale Computational Grid,”
Proc. Third Int'l Symp. Cluster Computing and the Grid (CCGrid '03), May 2003.- [14] C. Bossie and P. Fiorini, “On Checkpointing and Heavy-Tails in Unreliable Computing Environments,”
SIGMETRICS Performance Evaluation Rev., vol. 34, no. 2, pp. 13-15, 2006.- [15] S. Agarwal, R. Garg, M. Gupta, and J. Moreira, “Adaptive Incremental Checkpointing for Massively Parallel Systems,”
Proc. 18th Ann. Int'l Conf. Supercomputing (SC '04), Nov. 2004.- [16] S. Chakravorty and L. Kale, “A Fault Tolerance Protocol with Fast Fault Recovery,”
Proc. IEEE Int'l Parallel and Distributed Processing Symp. (IPDPS '07), Mar. 2007.- [17] J. Young, “A First Order Approximation to the Optimum Checkpoint Interval,”
Comm. ACM, vol. 17, no. 9, pp. 530-531, Sept. 1974.- [18] E. Gelenbe, “On the Optimum Checkpoint Interval,”
J. ACM, vol. 26, no. 2, pp. 259-270, Apr. 1979.- [19] A. Tantawi and M. Ruschitzka, “Performance Analysis of Checkpointing Strategies,”
ACM Trans. Computer Systems, vol. 2, no. 2, pp. 123-144, May 1984.- [20] T. Ozaki, T. Dohi, H. Okamura, and N. Kaio, “Min-Max Checkpoint Placement under Incomplete Failure Information,”
Proc. Int'l Conf. Dependable Systems and Networks (DSN '04), June-July 2004.- [21] A. Oliner, L. Rudolph, and R. Sahoo, “Cooperative Checkpointing: A Robust Approach to Large-Scale Systems Reliability,”
Proc. 20th Ann. Int'l Conf. Supercomputing (SC '06), June-July 2006.- [22] A. Oliner and R. Sahoo, “Evaluating Cooperative Checkpointing for Supercomputing Systems,”
Proc. 20th Int'l Parallel and Distributed Processing Symp. (IPDPS '06), Apr. 2006.- [23] Y. Xiang, Z. Li, and H. Chen, “Optimizing Adaptive Checkpointing Schemes for Grid Workflow Systems,”
Proc. Fifth Int'l Conf. Grid and Cooperative Computing (GCC '06), Oct. 2006.- [24] P. Katsaros, L. Angelis, and C. Lazos, “Performance and Effectiveness Trade-Off for Checkpointing in Fault-Tolerant Distributed Systems,”
Concurrency and Computation: Practice and Experience, vol. 19, no. 1, pp. 37-63, 2007.- [25] Y. Li and Z. Lan, “Using Adaptive Fault Tolerance to Improve Application Robustness on the TeraGrid,”
Proc. TeraGrid Conf., June 2007.- [26] C. Hou and K. Shin, “Replication and Allocation of Task Modules in Distributed Real-Timesystems,”
Proc. 24th Int'l Symp. Fault-Tolerant Computing (FTCS '94), June 1994.- [27] S. Choi, M. Baik, J. Gil, C. Park, S. Jung, and C. Hwang, “Group-Based Dynamic Computational Replication Mechanism in Peer-to-Peer Grid Computing,”
Proc. Sixth IEEE Int'l Symp. Cluster Computing and the Grid (CCGRID '06), May 2006.- [28] A. Ziv and J. Bruck, “Performance Optimization of Checkpointing Schemes with Task Duplication,”
IEEE Trans. Computers, vol. 46, no. 12, pp. 1381-1386, Dec. 1997.- [29] D. Pradhan and N. Vaidya, “Roll-Forward Checkpointing Scheme: A Novel Fault-Tolerant Architecture,”
IEEE Trans. Computers, vol. 43, no. 10, pp. 1163-1174, Oct. 1994.- [30] M. Hajdukovic, Z. Suvajdzin, Z. Zivanov, and E. Hodzic, “A Problem of Program Execution Time Measurement,”
Novi Sad J.Math., vol. 33, no. 1, pp. 67-73, 2003.- [31] R. Buyya and M. Murshed, “GridSim: A Toolkit for the Modeling and Simulation of Distributed Resource Management and Scheduling for Grid Computing,”
J. Concurrency and Computation: Practice and Experience, vol. 14, nos. 13-15,Wiley, Nov.-Dec. 2002.- [32] A. Legrand, L. Marchal, and H. Casanova, “Scheduling Distributed Applications: The SimGrid Simulation Framework,”
Proc. Third Int'l Symp. Cluster Computing and the Grid (CCGrid '03), May 2003.- [33] P. Thysebaert, B. Volckaert, F. De Turck, B. Dhoedt, and P. Demeester, “Evaluation of Grid Scheduling Strategies through NSGrid: A Network-Aware Grid Simulator,”
J. Neural, Parallel and Scientific Computations, special issue on grid computing, vol. 12, no. 3, pp. 353-378, 2004.- [34] F. Kelly, “Charging and Rate Control for Elastic Traffic,”
European Trans. Telecomm., vol. 8, pp. 33-37, 1997.- [35] H. Casanova and L. Marchal, “A Network Model for Simulation of Grid Application,” technical report, École Normale Supérieure de Lyon, Laboratoire de l'Informatique du Parallélisme, 2002.
- [36] U. Lublin and D. Feitelson, “The Workload on Parallel Supercomputers: Modeling the Characteristics of Rigid Jobs,”
Parallel and Distributed Computing, vol. 63, no. 11, pp. 1105-1122, Nov. 1992, 2003. |