This Article 
 Bibliographic References 
 Add to: 
Virtual Checkpoints: Architecture and Performance
May 1992 (vol. 41 no. 5)
pp. 516-525

Checkpoint and rollback recovery is a technique that allows a system to tolerate a failure by periodically saving the entire state and, if an error is detected, rolling back to the prior checkpoint. A technique that embeds the support for checkpoint and rollback recovery directly into the virtual memory translation hardware is presented. The scheme is general enough to be implemented on various scopes of data such as a portion of an address space, a single address space, or multiple address spaces. The technique can provide a high-performance scheme for implementing checkpoint and rollback recovery. The performance. of the scheme is analyzed using a trace-driven simulation. The overhead is a function of the interval between checkpoints and becomes very small for intervals greater than 10/sup 6/ references. However, the scheme is shown to be feasible for intervals as small as 1000 references under certain conditions.

[1] R. E. Ahmed, R. C. Frazier, and P. N. Marinos, "Cache-aided rollback error recovery (CARER) algorithms for shared-memory multiprocessor systems," inProc. 20th Symp. Fault-Tolerant Comput., June 1990, pp. 82-88.
[2] M. Banâtre and G. Muller, "Ensuring data security and integrity with a fast stable storage," inProc. 4th Int. Conf. Data Eng., Feb. 1988, pp. 285-293.
[3] M. Banâtre and P. Joubert, "Cache management in tightly coupled fault tolerant multiprocessor," inProc. 20th Symp. Fault-Tolerant Comput., June 1990, pp. 89-96.
[4] L. A. Belady and C. J. Kuehner, "Dynamic space sharing in computer system,"Commun. ACM, vol. 12, pp. 282-288, May 1969.
[5] P. A. Bernstein, "Sequoia: A fault-tolerant tighly coupled multiprocessor for transaction processing,"IEEE Comput. Msg., vol. 21, pp. 37-45, Feb. 1988.
[6] N. S. Bowen and D. K. Pradhan, "A virtual memory translation mechanism to support checkpoint and rollback recovery," inProc. Supercomputing '91, Nov. 1991, pp. 890-899.
[7] A. Chang and M. F. Mergen, "801 Storage: Architecture and programming,"ACM Trans. Comput. Syst., vol. 6, no. 1, pp. 28-50, Feb. 1988.
[8] D. W. Clark and J. S. Emer, "Performance of the VAX-11/780 translation buffer: Simulation and measurement,"ACM Trans. Comput. Syst., vol. 3, pp. 31-62, Feb. 1985.
[9] P. J. Denning, "Working sets past and present,"IEEE Trans. Software Eng., vol. SE-6, pp. 64-84, Jan. 1980.
[10] D. B. Hunt and P. N. Marinos, "A general purpose cache-aided rollback error recovery (CARER) technique," inProc. 17th Symp. Fault-Tolerant Comput., IEEE Computer Society, June 1987, pp. 170-175.
[11] P. A. Lee, N. Ghani, and K. Heron, "A recovery cache for the PDP-11,"IEEE Trans. Comput., vol. C-29, pp. 546-549, June 1980.
[12] Y.-H. Lee and K. G. Shin, "Design and evaluation of a fault-tolerant multiprocessor using hardware recovery blocks,"IEEE Trans. Comput., vol. C-33, pp. 113-124, Feb. 1984.
[13] K. Li, J. F. Naughton, and J. S., Plank, "Real-time, concurrent checkpoint for parallel programs," inProc. Second ACM SIGPLAN Symp. Principles and Practice Parallel Programming (PPOPP), SIGPLAN Notices, vol. 25, no. 3, Mar. 1990, pp. 79-88.
[14] R. A. Lorie, "Physical integrity in a large segmented database,"ACM Trans. Database Syst., vol. 2, pp. 91-104, Mar. 1977.
[15] V.F. Nicola and J. M. V. Spanje, "Comparative analysis of different models of checkpointing and recovery,"IEEE Trans. Software Eng., vol. 16, pp. 807-821, Aug. 1990.
[16] A. Reuter, "A fast transaction-oriented logging scheme for undo recovery,"IEEE Trans. Software Eng., vol. SE-6, pp. 348-356, July 1980.
[17] K. So and R.N. Rechtschaffen, "Cache operations by MRU change,"IEEE Trans. Comput., vol. 37, pp. 700-709, June 1988.
[18] M. E. Staknis, "Sheaved memory: Architectural support for state saving and restoration in paged systems," inProc. 3rd Int. Conf. Architectural Support for Programming Languages Operat. Syst., ACM, Apr. 1989, pp. 96-102.
[19] S. M. Thatte, "Persistent memory: A storage architecture for object-oriented database systems," inProc. 1986 Int. Workshop Object-Oriented Database Syst., Sept. 1986, pp. 148-159.
[20] D. Thiébaut, "On the fractal dimension of computer programs and its application to the prediction of the cache miss ratio,"IEEE Trans. Comput., vol. 38, July 1989.
[21] K.-L. Wu and W. K. Fuchs, "Rapid transaction-undo recovery using twin-page storage management," Tech. Rep. RC-15912, IBM Research Division, July 1990.
[22] K.-L. Wu and W. K. Fuchs, "Recoverable distributed shared virtual memory,"IEEE Trans. Comput., vol. 39, pp. 460-469, Apr. 1990.
[23] K.-L. Wu, W. K. Fuchs, and J. H. Patel, "Error recovery in shared memory multiprocessors using private caches,"IEEE Trans. Parallel Distributed Syst., vol. 1, pp. 231-240, Apr. 1990.

Index Terms:
virtual checkpoints; failure tolerance; performance analysis; rollback recovery; virtual memory translation hardware; address space; trace-driven simulation; fault tolerant computing; performance evaluation.
N.S. Bowen, D.K. Pradhan, "Virtual Checkpoints: Architecture and Performance," IEEE Transactions on Computers, vol. 41, no. 5, pp. 516-525, May 1992, doi:10.1109/12.142677
Usage of this product signifies your acceptance of the Terms of Use.