This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Recovering from Multiple Process Failures in the Time Warp Mechanism
December 1992 (vol. 41 no. 12)
pp. 1504-1514

A recovery protocol for distributed systems using the time warp control mechanism is described. The proposed protocol is fault tolerant to multiple process failures. Time warp is an optimistic execution technique in which synchronization is achieved using rollback. The recovery protocol exploits the redundancy already available to implement process rollback in the time warp mechanism. Thus, the protocol has little additional bookkeeping overhead, which contrasts with many other recovery protocols.

[1] D. Agrawal, A. J. Bernstein, P. Gupta, and S. Sengupta, "Distributed optimistic concurrency control with reduced rollback,"J. Distributed Comput., Springer-Verlag, vol. 2, no. 1, pp. 45-59, Jan. 1987.
[2] J. R. Agre, "Simulation of time warp distributed simulations," inProc. SCS Eastern Multiconf. Distributed Simulation, Tampa, FL, Mar. 1989. SCS, pp. 85-90.
[3] S. Bellenot, "Global virtual time algorithms," inProc. SCS Multiconf. Distributed Simulation, San Diego, CA, Jan. 1990, SCS, pp. 122-130.
[4] E. H. Bensley, T. J. Brando, and M. J. Prelle, "An execution model for distributed object-oriented computation," inOOPSLA'88 Conf. Proc., N. Meyrowitz, Ed., San Diego, CA, Sept. 1988, ACM, pp. 316-322.
[5] P. A. Bernstein and N. Goodman, "The failure and recovery problem for replicated databases," inProc. 2nd Ann. Symp. Principles of Distributed Computing, 1983, pp. 114-122.
[6] O. Berry, "Performance evaluation of the time warp distributed simulation mechanism," Ph.D. dissertation, Univ. of Southern California, May 1986.
[7] K. M. Chandy and J. Misra, "Distributed simulation: A case study in design and verification of distributed programs,"IEEE Trans. Software Eng., vol. SE-5, no. 5, pp. 440-452, Sept. 1979.
[8] R. M. Fujimoto, "Time warp on a shared-memory multiprocessor," Tech. Rep. UUCS-88-021, Univ. of Utah, Salt Lake City, UT 84112, Dec. 1988.
[9] R. M. Fujimoto, "The virtual time machine." Tech. Rep. UUCS-88-019, Univ. of Utah, Salt Lake City, UT 84112, Dec. 1988.
[10] R. M. Fujimoto, "Parallel discrete event simulation,"Commun. ACM, vol. 33, no. 10, pp. 30-53, Oct. 1990.
[11] A. Gafni, "Rollback mechanisms for optimistic distributed simulation systems," inProc. SCS Multiconf. Distributed Simulation, San Diego, CA, July 1988, SCS, pp. 61-67.
[12] D. Gifford, "Weighted voting for replicated data," inProc. 7th ACM Symp. Oper. Syst. Principles, Dec. 1979, pp. 150-162.
[13] A. P. Goldberg, "Optimistic distributed algorithms for load management and fault tolerance," Ph.D. dissertation, Univ. of California at Los Angeles, 1987.
[14] P. Hontalas, B. Beckman, M. DiLorento, L. Blume, P. Reiher, K. Sturdevant, L. Van Warren, J. Wedel, F. Wieland, and D. Jefferson, "Performance of colliding pucks simulation on the time warp operating systems (Part 1: Asynchronous behavior and sectoring)," inProc. SCS Eastern Multiconf. Distributed Simulation, Tampa, FL, Mar. 1989, SCS, pp. 3-7.
[15] D. Jefferson, "Virtual Time,"ACM Trans. Programming Languages, Vol. 7, No. 3, July 1985, pp. 404-425.
[16] D. R. Jefferson and H. A. Sowizral, "Fast concurrent simulation using the time warp mechanism,"SCS Simulation, vol. 15, no. 2, pp. 63-69, 1985.
[17] D. B. Johnson and W. Zwaenepoel, "Sender-based message logging," inProc. Seventeenth Annu. Int. Symp. Fault-tolerant Computing: Dig. Papers, June 1987, pp. 14-19.
[18] D.B. Johnson and W. Zwaenepoel, "Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing,"Proc. 7th Ann. ACM Symp. Principles of Distributed Computing, ACM Press, New York, 1988, pp. 171-181.
[19] H. T. Kung and J. T. Robinson, "On optimistic methods for concurrency control,"ACM Trans. Database Syst., vol. 6, pp. 213-226, June 1981.
[20] L. Lamport, "Time, clocks, and the ordering of events in a distributed system,"Commun. ACM, vol. 21, no. 7, pp. 558-565, July 1978.
[21] B. W. Lampson and H. E. Sturgis, "Crash recovery in a distributed data storage system," Tech. rep., Computer Science Laboratory, Xerox Palo Alto Research Center, 1979.
[22] Y. B. Lin and E. Lazowska, "Optimality considerations for 'time warp' parallel simulation," Tech. Rep. 89-09-07, Dep. of Comput. Sci., Univ. of Washington, Seattle, WA 98195, 1989.
[23] Y. B. Lin and E. Lazowska, "Optimality considerations for 'time warp' parallel simulation," inProc. SCS Multiconf. Distributed Simulation, San Diego, CA, Jan. 1990, SCS, pp. 29-34.
[24] J. Misra, "Distributed discrete-event simulation,"Comput. Surveys, vol. 18, no. 1, Mar. 1986.
[25] R. D. Schlichting and F.B. Schneider, "Fail-stop processors: An approach to designing fault-tolerant computing systems,"ACM Trans. Comput. Syst., vol. 1, no. 3, pp. 222-238, Aug. 1983.
[26] A. P. Sistla and J. L. Welch, "Efficient distributed recovery using message logging," inProc. 8th Annu. ACM Symp. Principles Distributed Comput., Aug. 1989, pp. 223-238.
[27] M. Stonebraker, "Concurrency control and consistency in multiple copies of data in distributed ingres,"IEEE Trans. Software Eng., vol. 3, no. 3, May 1979.
[28] R. E. Strom and S. Yemini, "Optimistic recovery in distributed systems,"ACM Trans. Comput. Syst., vol. 3, no. 3, pp. 204-226, Aug. 1985.
[29] R. Thomas, "A majority consensus approach to concurrency control,"ACM Trans. Database Syst., vol. 4, pp. 180-209, June 1979.
[30] P. A. Tinker, "Task scheduling for general rollback computing," inProc. Int. Conf. Parallel Processing, vol. 2, Univ. of Pennsylvania, St. Charles, IL, Aug 1989, pp. 180-183.
[31] P. A. Tinker and J. R. Agre, "Object handling, messaging, and state manipulation in a time warp system," inProc. SCS Eastern Multiconf. Distributed Simulation, Tampa, FL, Mar. 1989, SCS, pp. 79-84.
[32] P. A. Tinker and M. Katz, "Parallel execution of sequential Scheme with ParaTran," inProc. Conf. Lisp and Functional Programming, Snowbird, Utah, July 1988, ACM, pp. 28-39.

Index Terms:
multiple process failures; recovery protocol; distributed systems; time warp control mechanism; fault tolerant; optimistic execution; synchronization; redundancy; process rollback; distributed algorithms; distributed processing; fault tolerant computing; protocols.
Citation:
D. Agrawal, J.R. Agre, "Recovering from Multiple Process Failures in the Time Warp Mechanism," IEEE Transactions on Computers, vol. 41, no. 12, pp. 1504-1514, Dec. 1992, doi:10.1109/12.214658
Usage of this product signifies your acceptance of the Terms of Use.