This Article 
 Bibliographic References 
 Add to: 
Roll-Forward Checkpointing Scheme: A Novel Fault-Tolerant Architecture
October 1994 (vol. 43 no. 10)
pp. 1163-1174

We propose a novel architecture for a fault-tolerant multiprocessor environment. It is assumed that the multiprocessor organization consists of a pool of active processing modules and either a small number of spare modules or active modules with some spare processing capacity. A fault-tolerance scheme is developed for duplex systems using checkpoints. Our scheme, unlike traditional checkpointing schemes, requires no rollbacks for recovering from single faults. The objective is to achieve performance of a triple modular redundant system using duplex system redundancy.

[1] P. Agrawal, "Fault tolerance in multiprocessor systems without dedicated redundancy,"IEEE Trans. Comput., vol. 31, pp. 358-362, Mar. 1988.
[2] M. Banâtre and G. Muller, "Ensuring data security and integrity with a fast stable storage," inProc. 4th Int. Conf. Data Eng., Feb. 1988, pp. 285-293.
[3] P. A. Bernstein, "Sequoia: A fault-tolerant tightly coupled multiprocessor for transaction processing,"Computer, pp. 37-45, Feb. 1988.
[4] Y. Deswarte, "A high safety multi-processor architecture," inDig. of papers: The 6th Int. Symp. Fault-Tolerant Computing., 1976, pp. 171-175.
[5] C. I. Dimmer, "The Tandem non-stop system," inResilient Computing Systems, T. Anderson, Ed. New York-John Wiley, 1985.
[6] E.N. Elnozahy, D.B. Johnson, and W. Zwaenepoel, "The Performance of Consistent Checkpointing,"Proc. 11th Symp. Reliable Distributed Systems, IEEE Computer Society Press, Los Alamitos, Calif., 1992, pp. 39-47.
[7] C.-C.J. Li and W.K. Fuchs, "Catch: Compiler-Assisted Techniques for Checkpointing,"Proc. 20th Int'l Symp. Fault-Tolerant Computing, IEEE Computer Society Press, Los Alamitos, Calif., 1990, pp. 74-81.
[8] J. Long, W. K. Fuchs, and J. A. Abraham, "Forward recovery using checkpointing in parallel systems," inProc. Int. Conf. Parallel Processing, Aug. 1990, pp. 272-275.
[9] D. K. Pradhan, "Redundancy schemes for recovery," Tech. Rep. TR-89- CSE-16, ECE Dep., Univ. of Massachusetts, 1989.
[10] Sequoia Systems, "The Series 400," Product information.
[11] Tandem Computers Inc., "Nonstop Cyclone/RSystem," Product information.
[12] N. H. Vaidya, "Low-cost schemes for fault tolerance," Ph.D. dissertation, Univ. Mass., Amherst, Feb. 1993.
[13] N. H. Vaidya and D. K. Pradhan, "Concurrent retry with nondedicated spares: A fault-tolerant checkpointing scheme without rollback," Tech. Rep. TR-91-CSE-23, ECE Dep., Univ. of Massachusetts, Oct. 1991.
[14] N. H. Vaidya and D. K. Pradhan, "A fault tolerance scheme for a system of duplicated communicating processes," inIEEE Workshop on Fault Tolerant Parallel and Distrib. Syst., July 1992, pp. 98-104.
[15] D. K. Pradhan and N. H. Vaidya, "Roll-forward and rollback recovery: Performance-reliability trade-off," inDig. of Pap., 24th Int. Symp. Fault-Tolerant Computing, 1994, pp. 186-195.

Index Terms:
multiprocessing systems; fault tolerant computing; redundancy; roll-forward checkpointing scheme; fault-tolerant architecture; multiprocessor environment; active processing modules; triple modular redundant system; duplex system redundancy.
D.K. Pradhan, N.H Vaidya, "Roll-Forward Checkpointing Scheme: A Novel Fault-Tolerant Architecture," IEEE Transactions on Computers, vol. 43, no. 10, pp. 1163-1174, Oct. 1994, doi:10.1109/12.324542
Usage of this product signifies your acceptance of the Terms of Use.