This Article 
 Bibliographic References 
 Add to: 
Flexible Rollback Recovery in Dynamic Heterogeneous Grid Computing
January-March 2009 (vol. 6 no. 1)
pp. 32-44
Samir Jafar, University of Damascus, Damascus
Axel Krings, University of Idaho, Moscow
Thierry Gautier, INRIA, (Projet MOASIS), LIG, St. Martin
Large applications executing on Grid or cluster architectures consisting of hundreds or thousands of computational nodes create problems with respect to reliability. The source of the problems are node failures and the need for dynamic configuration over extensive run-time. This paper presents two fault-tolerance mechanisms called Theft Induced Checkpointing and Systematic Event Logging. These are transparent protocols capable of overcoming problems associated with both, benign faults, i.e., crash faults, and node or subnet volatility. Specifically, the protocols base the state of the execution on a dataflow graph, allowing for efficient recovery in dynamic heterogeneous systems as well as multi-threaded applications. By allowing recovery even under different numbers of processors, the approaches are especially suitable for applications with need for adaptive or reactionary configuration control. The low-cost protocols offer the capability of controlling or bounding the overhead. A formal cost model is presented, followed by an experimental evaluation. It is shown that the overhead of the protocol is very small and the maximum work lost by a crashed process is small and bounded.

[1] L. Alvisi and K. Marzullo, “Message Logging: Pessimistic, Optimistic, Causal and Optimal,” IEEE Trans. Software Eng., vol. 24, no. 2, pp. 149-159, Feb. 1998.
[2] K. Anstreicher, N. Brixius, J.-P. Goux, and J. Linderoth, “Solving Large Quadratic Assignment Problems on Computational Grids,” Math. Programming, vol. 91, no. 3, 2002.
[3] R. Baldoni, “A Communication-Induced Checkpointing Protocol That Ensures Rollback-Dependency Trackability,” Proc. 27th Int'l Symp. Fault-Tolerant Computing (FTCS '97), p. 68, 1997.
[4] F. Baude, D. Caromel, C. Delb, and L. Henrio, “A Hybrid Message Logging-CIC Protocol for Constrained Checkpointability,” Proc. European Conf. Parallel Processing (EuroPar '05), pp. 644-653, 2005.
[5] G. Bosilca et al., “MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes,” Proc. ACM/IEEE Conf. Supercomputing (SC '02), Nov. 2002.
[6] A. Bouteiller et al., “MPICH-V2: A Fault Tolerant MPI for Volatile Nodes Based on the Pessimistic Sender Based Message Logging,” Proc. ACM/IEEE Conf. Supercomputing (SC '03), pp. 1-17, 2003.
[7] A. Bouteiller, P. Lemarinier, G. Krawezik, and F. Cappello, “Coordinated Checkpoint versus Message Log for Fault Tolerant MPI,” Proc. Fifth IEEE Int'l Conf. Cluster Computing (Cluster '03), p.242, 2003.
[8] S. Chakravorty and L.V. Kale, “A Fault Tolerant Protocol for Massively Parallel Machines,” Proc. 18th IEEE Int'l Parallel and Distributed Processing Symp. (IPDPS '04), p. 212a, 2004.
[9] K.M. Chandy and L. Lamport, “Distributed Snapshots: Determining Global States of Distributed Systems,” ACM Trans. Computer Systems, vol. 3, no. 1, pp. 63-75, 1985.
[10] E.N. Elnozahy, L. Alvisi, Y.-M. Wang, and D.B. Johnson, “A Survey of Rollback-Recovery Protocols in Message-Passing Systems,” ACM Computing Surveys, vol. 34, no. 3, pp. 375-408, Sept. 2002.
[11] M. Frigo, C.E. Leiserson, and K.H. Randall, “The Implementation of the Cilk-5 Multithreaded Language,” Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation (PLDI '98), pp.212-223, 1998.
[12] F. Galilée, J.-L. Roch, G. Cavalheiro, and M. Doreille, “Athapascan-1: On-Line Building Data Flow Graph in a Parallel Language,” Proc. Seventh Int'l Conf. Parallel Architectures and Compilation Techniques (PACT '98), pp. 88-95, 1998.
[13] A Large Scale Nation-Wide Infrastructure for Grid Research, Grid5000, https:/, 2006.
[14] S. Jafar, A. Krings, T. Gautier, and J.-L. Roch, “Theft-Induced Checkpointing for Reconfigurable Dataflow Applications,” Proc. IEEE Electro/Information Technology Conf. (EIT '05), May 2005.
[15] S. Jafar, T. Gautier, A. Krings, and J.-L. Roch, “A Checkpoint/Recovery Model for Heterogeneous Dataflow Computations Using Work-Stealing,” Proc. European Conf. Parallel Processing (EuroPar '05), pp. 675-684, Aug.-Sept. 2005.
[16] A.W. Krings, J.-L. Roch, S. Jafar, and S. Varrette, “A Probabilistic Approach for Task and Result Certification of Large-Scale Distributed Applications in Hostile Environments,” Proc. European Grid Conf. (EGC '05), P. Sloot et al., eds., Feb. 2005.
[17] A.W. Krings, J.-L. Roch, and S. Jafar, “Certification of Large Distributed Computations with Task Dependencies in Hostile Environments,” Proc. IEEE Electro/Information Technology Conf. (EIT '05), May 2005.
[18] L. Lamport, M. Pease, and R. Shostak, “The Byzantine Generals Problem,” ACM Trans. Programming Languages and Systems, vol. 4, no. 3, pp. 382-401, July 1982.
[19] M. Litzkow, T. Tannenbaum, J. Basney, and M. Livny, “Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System,” Technical Report CS-TR-97-1346, Univ. of Wisconsin, Madison, 1997.
[20] A. Nguyen-Tuong, A. Grimshaw, and M. Hyett, “Exploiting Data-Flow for Fault-Tolerance in a Wide-Area Parallel System,” Proc. 15th Symp. Reliable Distributed Systems (SRDS '96), pp. 2-11, 1996.
[21] D.A. Patterson, G. Gibson, and R.H. Katz, “A Case for Redundant Arrays of Inexpensive Disks (RAID),” Proc. ACM SIGMOD '88, pp. 109-116, 1988.
[22] D.K. Pradhan, Fault-Tolerant Computer System Design. Prentice Hall, 1996.
[23] B. Randell, “System Structure for Software Fault Tolerance,” Proc. Int'l Conf. Reliable Software, pp. 437-449, 1975.
[24] L. Sarmenta, “Sabotage-Tolerance Mechanisms for Volunteer Computing Systems,” Future Generation Computer Systems, vol. 18, no. 4, 2002.
[25] J. Silc, B. Robic, and T. Ungerer, “Asynchrony in Parallel Computing: from Dataflow to Multithreading,” Progress in Computer Research, pp. 1-33, 2001.
[26] G. Stellner, “CoCheck: Checkpointing and Process Migration for MPI,” Proc. 10th Int'l Parallel Processing Symp. (IPPS '96), pp. 526-531, Apr. 1996.
[27] R. Strom and S. Yemini, “Optimistic Recovery in Distributed Systems,” ACM Trans. Computer Systems, vol. 3, no. 3, pp. 204-226, 1985.
[28] V. Strumpen, “Portable and Fault-Tolerant Software Systems,” IEEE Micro, vol. 18, no. 5, pp. 22-32, Sept./Oct. 1998.
[29] P. Thambidurai and Y.-K. Park, “Interactive Consistency with Multiple Failure Modes,” Proc. Seventh Symp. Reliable Distributed Systems (SRDS '88), pp. 93-100, Oct. 1988.
[30] K.S. Trivedi, Probability and Statistics with Reliability, Queuing, and Computer Science Applications. John Wiley & Sons, 2001.
[31] G. Wrzesinska, R. van Nieuwpoort, J. Maassen, and H.E. Bal, “Fault-Tolerance, Malleability and Migration for Divide-and-Conquer Applications on the Grid,” Proc. 19th Int'l Parallel and Distributed Processing Symp. (IPDPS '05), p. 13a, Apr. 2005.
[32] J.J. Wylie et al., “Selecting the Right Data Distribution Scheme for a Survivable Storage System,” Technical Report CMU-CS-01-120, Carnegie Mellon Univ., May 2001.
[33] G. Zheng, L. Shi, and L.V. Kalé, “FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI,” Proc. Sixth IEEE Int'l Conf. Cluster Computing (Cluster '04), pp. 93-103, Sept. 2004.

Index Terms:
Distributed architectures, Fault tolerance, Dataflow
Samir Jafar, Axel Krings, Thierry Gautier, "Flexible Rollback Recovery in Dynamic Heterogeneous Grid Computing," IEEE Transactions on Dependable and Secure Computing, vol. 6, no. 1, pp. 32-44, Jan.-March 2009, doi:10.1109/TDSC.2008.17
Usage of this product signifies your acceptance of the Terms of Use.