This Article 
 Bibliographic References 
 Add to: 
Low-Cost Checkpointing and Failure Recovery in Mobile Computing Systems
October 1996 (vol. 7 no. 10)
pp. 1035-1048

Abstract—A mobile computing system consists of mobile and stationary nodes, connected to each other by a communication network. The presence of mobile nodes in the system places constraints on the permissible energy consumption and available communication bandwidth. To minimize the lost computation during recovery from node failures, periodic collection of a consistent snapshot of the system (checkpoint) is required. Locating mobile nodes contributes to the checkpointing and recovery costs. Synchronous snapshot collection algorithms, designed for static networks, either force every node in the system to take a new local snapshot, or block the underlying computation during snapshot collection. Hence, they are not suitable for mobile computing systems. If nodes take their local checkpoints independently in an uncoordinated manner, each node may have to store multiple local checkpoints in stable storage. This is not suitable for mobile nodes as they have small memory. This paper presents a synchronous snapshot collection algorithm for mobile systems that neither forces every node to take a local snapshot, nor blocks the underlying computation during snapshot collection. If a node initiates snapshot collection, local snapshots of only those nodes that have directly or transitively affected the initiator since their last snapshots need to be taken. We prove that the global snapshot collection terminates within a finite time of its invocation and the collected global snapshot is consistent. We also propose a minimal rollback/recovery algorithm in which the computation at a node is rolled back only if it depends on operations that have been undone due to the failure of node(s). Both the algorithms have low communication and storage overheads and meet the low energy consumption and low bandwidth constraints of mobile computing systems.

[1] A. Acharya, B.R. Badrinath, and T. Imielinski, "Checkpointing Distributed Applications on Mobile Computers," technical report, Dept. of Computer Science, Rutgers Univ., 1994.
[2] B. Awerbuch and D. Peleg, “Concurrent Online Tracking of Mobile Users,” SIGCOM Symp. Communication Architectures and Protocols, Oct. 1991.
[3] B.R. Badrinath, A. Acharya, and T. Imielinski, "Structuring Distributed Algorithms for Mobile Hosts," Proc. 14th Int'l Conf. Distributed Computing Systems, June 1994.
[4] P. Bhagwat and C.E. Perkins, "A Mobile Networking System Based on Internet Protocol(IP)," Proc. USENIX Symp. Mobile and Location-Independent Computing, pp. 69-82, Aug. 1993.
[5] B. Bhargava and S.R. Lian, "Independent Checkpointing and Concurrent Rollback for Recovery—An Optimistic Approach," Proc. IEEE Symp. Reliable Distributed Systems, pp. 3-12, 1988.
[6] S. Chandrasekaran and S. Venkatesan, "A Message-Optimal Algorithm for Distributed Termination Detection," J. Parallel and Distributed Computing, vol. 8, pp. 245-252, 1990.
[7] K.M. Chandy and L. Lamport, "Distributed Snapshots: Determining Global States of Distributed Systems," ACM Trans. Computer Systems, Feb. 1985.
[8] E.N. Elnozahy and W. Zwaenepoel, “Manetho—Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit,” IEEE Trans. Computers, vol. 41, no. 5, pp. 526–531, May 1992.
[9] J. Fidge, "Timestamps in Message-Passing Systems that Preserve the Partial Ordering," Proc. 11th Australian Computer Science Conf., pp. 56-66, Feb. 1988.
[10] G.H. Forman and J. Zahorjan, “The Challenges of Mobile Computing,” Computer, pp. 38-47, Apr. 1994.
[11] S.T. Huang, "Detecting Termination of Distributed Computations by External Agents," Proc. Ninth Int'l Conf. Distributed Computing Systems, pp. 79-84, 1989.
[12] J. Ioannidis, D. Duchamp, and G. Maguire, "IP-Based Protocols for Mobile Internetworking," Proc. ACM SIGCOMM Symp. Comm. Architecture and Protocols, pp. 235-245,Zurich, 1991.
[13] R. Koo and S. Toueg, "Checkpointing and Rollback-Recovery for Distributed Systems," IEEE Trans. Software Eng., vol. 13, no. 1, pp. 23-31, Jan. 1987.
[14] T.H. Lai and T.H. Yang, "On Distributed Snapshots," Information Processing Letters, pp. 153-158, May 1987.
[15] L. Lamport, "Time, clocks and the ordering of events in a distributed system," Comm. ACM, vol. 21, no. 7, pp. 558-565, July 1978.
[16] P.Y. Leu and B. Bhargava, "Concurrent Robust Checkpointing and Recovery in Distributed Systems," Proc. Fourth IEEE Int'l Conf. Data Eng., pp. 154-163, 1988.
[17] F. Mattern, "Virtual Time and Global States of Distributed Systems," Proc. Workshop Parallel and Distributed Algorithms, M. Cosnard et al., eds., pp. 215-226. North-Holland: Elsevier Science Publishers B.V., 1989.
[18] F. Mattern, "Efficient Distributed Snapshots and Global Virtual Time Algorithms for Non-FIFO Systems," Technical Report SFB124-24/90, Univ. of Kaiserslautern, 1990.
[19] R. Prakash and M. Singhal, "Maximal Global Snapshot with Concurrent Initiators," Proc. Sixth IEEE Symp. Parallel and Distributed Processing, pp. 344-351, Oct. 1994.
[20] A. P. Sistla and J. L. Welch,“Efficient distributed recovery using message logging,”inProc. 8th ACM Symp. Princip. Distrib. Comput., 1989, pp. 223–238.
[21] M. Spezialetti and P. Kearns, "Efficient Distributed Snapshots," Proc. Sixth Int'l Conf. Distributed Computing Systems, pp. 382-388, 1986.
[22] R.E. Strom and S.A. Yemini, "Optimistic Recovery in Distributed Systems," ACM Trans. Computer Systems, vol. 3, no. 3, pp. 204-226, Aug. 1985.
[23] F. Teraoka, Y. Yokote, and M. Tokoro, "A Network Architecture Providing Host Migration Transparency," Proc. ACM SIGCOMM Symp. Comm. Architecture and Protocols,Zurich, 1991.
[24] S. Venkatesan, "Message-Optimal Incremental Snapshots," J. Computer and Software Eng., vol. 1, no. 3, pp. 211-231, 1993.
[25] S. Venkatesan, "Optimistic Crash Recovery Without Rolling Back Non-Faulty Processors," Information Sciences—An Int'l J., 1993.
[26] S. Venkatesan and T.T.-Y. Juang, "Low Overhead Optimistic Crash Recovery," Preliminary version appears in Proc. 11th Int'l Conf. Distributed Computing Systems as "Crash Recovery with Little Overhead," pp. 454-461, 1991.
[27] H. Wada, T. Yozawa, T. Ohnishi, and Y. Tanaka, "Mobile Computing Environment Based on Internet Packet Forwarding," 1991 Winter USENIX, 1993.
[28] R. Prakash and M. Singhal, "A Dynamic Approach to Location Management in Mobile Computing Systems," Proc. Eighth Int'l Conf. Software Eng. and Knowledge Eng. (SEKE '96), pp. 488-495, June 1996.

Index Terms:
Checkpointing, causal dependency, global snapshot, mobile computing systems, portable computers, recovery.
Ravi Prakash, Mukesh Singhal, "Low-Cost Checkpointing and Failure Recovery in Mobile Computing Systems," IEEE Transactions on Parallel and Distributed Systems, vol. 7, no. 10, pp. 1035-1048, Oct. 1996, doi:10.1109/71.539735
Usage of this product signifies your acceptance of the Terms of Use.