This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Process Recovery in Heterogeneous Systems
February 2003 (vol. 52 no. 2)
pp. 126-138

Abstract—Heterogeneous computing environments, where computers may have different instruction set architectures, data representations, and operating systems, complicate checkpointing and recovery of processes. This paper describes an approach to recovery and an implementation, PREACHES, that provides portable checkpointing of single-process applications in heterogeneous systems using checkpoint propagation. The checkpoint propagation mechanism creates machine-dependent checkpoints for different architectures in the heterogeneous environment. A process is restored on a specific machine with the checkpoint that is appropriate for the architecture. An implementation of PREACHES has been evaluated on a heterogeneous network of workstations, including Sun, HP, and Pentium machines. The experimental results show that PREACHES achieves efficient checkpointing and rapid recovery.

[1] E.N. Elnozahy, D.B. Johnson, and W. Zwaenepoel, "The Performance of Consistent Checkpointing," Proc. 11th Symp. Reliable Distributed Systems, pp. 86-95, Oct. 1992.
[2] M. Litzkow and M. Solomon, “Supporting Checkpointing and Process Migration outside the Unix Kernel,” Proc. Usenix Winter 1992 Technical Conf., pp. 283-290, Jan. 1992.
[3] Y.-M. Wang and W.K. Fuchs, “Scheduling Message Processing for Reducing Rollback Propagation,” Proc. IEEE Fault-Tolerant Computing Symp., pp. 204-211, July 1992.
[4] J. Xu and R.H.B. Netzer, “Adaptive Independent Checkpointing for Reducing Rollback Propagation,” Proc. IEEE Parallel and Distributed Processing Symp., pp. 754-761, Dec. 1993.
[5] C.-C.J. Li, E.M. Stewart, and W.K. Fuchs, “Compiler-Assisted Full Checkpointing,” Software—Practice and Experience, vol. 24, no. 10, pp. 871-886, Oct. 1994.
[6] J.S. Plank, M. Beck, G. Kingsley, and K. Li, “Libckpt: Transparent Checkpointing under Unix,” Proc. Usenix Winter Technical Conf., pp. 213-223, Jan. 1995.
[7] Y.M. Wang et al., “Checkpointing and Its Applications,” Digest 25th Ann. Int'l Symp. Fault-Tolerant Computing, pp. 22-31, June 1995.
[8] N. Neves and W.K. Fuchs, “Using Time to Improve the Performance of Coordinated Checkpointing,” Proc. IEEE Int'l Computer Performance&Dependability Symp., pp. 282-291, Sept. 1996.
[9] G. Cao and M. Singhal, “Low-Cost Checkpointing with Mutable Checkpoints in Mobile Computing Systems,” Proc. 18th Int'l Conf. Distributed Computing Systems, pp. 464-471, May 1998.
[10] K.F. Ssu, B. Yao, W.K. Fuchs, and N. Neves, “Adaptive Checkpointing with Storage Management for Mobile Environments,” IEEE Trans. Reliability, vol. 48, no. 4, pp. 315-324, Dec. 1999.
[11] B. Yao, K.F. Ssu, and W.K. Fuchs, “Message Logging in Mobile Computing,” Proc. IEEE Fault-Tolerant Computing Symp., pp. 294-301, June 1999.
[12] C.Y. Lin, S.Y. Kuo, and Y. Huang, “A Checkpointing Tools for Palm Operating System,” Proc. Int'l Conf. Dependable Systems and Networks, pp. 71-76, July 2001.
[13] Y. Hollander and G.M. Silberman, “A Mechanism for the Migration of Tasks in Heterogeneous Distributed Processing Systems,” Proc. Int'l Conf. Parallel Processing and Applications, pp. 93-98, Sept. 1988.
[14] M.H. Theimer and B. Hayes, Heterogeneous Process Migration by Recompilation Proc. 11th IEEE Int'l Conf. Distributed Computing Systems, pp. 18-25, June 1991.
[15] A. Beguelin, E. Seligman, and P. Stephan, “Application Level Fault Tolerance in Heterogeneous Networks of Workstations,” Technical Report CMU-CS-96-157, Carnegie Mellon Univ., Aug. 1996.
[16] P.E. Chung, Y. Huang, S. Yajnik, G. Fowler, K.-P. Vo, and Y.-M. Wang, “Checkpointing in CosMiC: A User-Level Process Migration Environment,” Proc. Pacific Rim Int'l Symp. Fault-Tolerant Systems, pp. 187-193, Dec. 1997.
[17] B. Ramkumar and V. Strumpen, “Portable Checkpointing for Heterogeneous Architectures,” Proc. IEEE Fault-Tolerant Computing Symp., pp. 58-67, June 1997.
[18] F. Karablieh, R.A. Bazzi, and M. Hicks, “Compiler-Assisted Heterogeneous Checkpointing,” Proc. IEEE Reliable Distributed Systems Symp., pp. 56-65, Oct. 2001.
[19] R.K.K. Ma, C.-L. Wang, and F.C.M. Lau, “M-JavaMPI: A Java-MPI Binding with Process Migration Support,” Proc. Int'l Symp. Cluster Computing and the Grid, pp. 240-247, May 2002.
[20] A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek,, and V. Sunderam,PVM: Parallel Virtual Machine—A Users' Guide and Tutorial for Networked Parallel Computing. The MIT Press, 1994.
[21] C.-C.J. Li and W.K. Fuchs, “CATCH—Compiler-Assisted Techniques for Checkpointing,” Proc. IEEE Fault-Tolerant Computing Symp., pp. 74-81, June 1990.
[22] E. Seligman and A. Beguelin, “High-Level Fault Tolerance in Distributed Programs,” Technical Report CMU-CS-94-223, Carnegie Mellon Univ., Dec. 1994.
[23] L. Alvisi and K. Marzullo, “Message Logging: Pessimistic, Optimistic, Causal, and Optimal,” IEEE Trans. Software Eng., vol. 24, no. 2, pp. 149-159, Feb. 1998.
[24] E. Elnozahy, D. Johnson, and Y.-M. Wang, “A Survey of Rollback-Recovery Protocols in Message-Passing Systems,” Technical Report CMU-CS-96-181, Carnegie Mellon Univ., Oct. 1996.
[25] N. Neves and W.K. Fuchs, “RENEW: A Tool for Fast and Efficient Implementation of Checkpoint Protocols,” Proc. IEEE Fault-Tolerant Computing Symp., pp. 58-67, June 1998.
[26] N. Neves and W.K. Fuchs, “Fault Detection Using Hints from the Socket Layer,” Proc. IEEE Reliable Distributed Systems Symp., pp. 64-71, Oct. 1997.
[27] Standard Performance Evaluation Corp.,http:/www.specbench.org, 2002.
[28] Bench Web,http://www.netlib.orgbenchweb/, 2002.

Index Terms:
Heterogeneous systems, portable checkpointing, rollback recovery, process migration.
Citation:
Kuo-Feng Ssu, W. Kent Fuchs, Hewijin C. Jiau, "Process Recovery in Heterogeneous Systems," IEEE Transactions on Computers, vol. 52, no. 2, pp. 126-138, Feb. 2003, doi:10.1109/TC.2003.1176981
Usage of this product signifies your acceptance of the Terms of Use.