This Article 
 Bibliographic References 
 Add to: 
The Performance of Cache-Based Error Recovery in Multiprocessors
October 1994 (vol. 5 no. 10)
pp. 1033-1043

Several variations of cache-based checkpointing for rollback error recovery from transient errors in shared-memory multiprocessors have been recently developed. By modifying the cache replacement policy, these techniques use the inherent redundancy in the memory hierarchy to periodically checkpoint the computation state. Three schemes, different in the manner in which they avoid rollback propagation, are evaluated in this paper. By simulation with address traces from parallel applications running on an Encore Multimax shared-memory multiprocessor, we evaluate the performance effect of integrating the recovery schemes in the cache coherence protocol. Our results indicate that the cache-based schemes can provide checkpointing capability with low performance overhead, but with uncontrollable high variability in the checkpoint interval.

[1] R. E. Ahmed, R. C. Frazier, and P. N. Marinos, "Cache-aided rollback error recovery (CARER) algorithms for shared-memory multiprocessor systems," inProc. 20th Symp. Fault-Tolerant Comput., June 1990, pp. 82-88.
[2] N. J. Alewine, S.-K. Chen, C.-C. Li, W. K. Fuchs, and W.-M. Hwu, "Branch recovery with compiler-assisted multiple instruction retry," in22th Int. Symp. on Fault-Tolerant Computing, June 1992, pp. 66-73.
[3] M. S. Algudady, C. R. Das, and M. J. Thazhuthaveetil, "A cache-based checkpointing scheme for MIN-based multiprocessors," inProc. Int. Conf. Parallel Processing, 1991, pp. 1-497-1-500.
[4] M. Banâtre and P. Joubert, "Cache management in tightly coupled fault tolerant multiprocessor," inProc. 20th Symp. Fault-Tolerant Comput., June 1990, pp. 89-96.
[5] M. Banâtreet al., "An architecture for tolerating processor failures in shared-memory multiprocessors," Tech. Rep. 707, IRISA, Rennes, France, Mar. 1993.
[6] K. P. Belkhale and P. Banerjee, "Parallel algorithms for VLSI circuit extraction,"IEEE Trans. Computer-Aided Design, vol. 10, no. 2, pp. 604-618, May 1991.
[7] K. P. Belkhale, R. J. Brouwer, and P. Banerjee, "Task scheduling for exploiting parallelism and hierarchy in VLSI CAD algorithms,"IEEE Trans. Comput-Aided Design Integrated Circuits Syst., vol. 12, pp. 557-567, May 1993.
[8] M. Bellon, Motorola Urbana Design Center, Urbana, IL, USA, personal commun., 1990.
[9] P. A. Bernstein, "Sequoia: A fault-tolerant tightly coupled multiprocessor for transaction processing,"Comput., vol. 21, pp. 37-45, Feb. 1988.
[10] N. S. Bowen and D. J. Pradhan, "Virtual checkpoints: Architecture and performance,"IEEE Trans. Comput., vol. 41, pp. 516-525, May 1992.
[11] M.L. Ciacelli, "Fault handling on the IBM 4341 processor," inProc. 11th Int. Symp. Fault-Tolerant-Computing, 1981, pp. 9-12.
[12] J.L. Hennessy and David A. Patterson,Computer Architecture: A Quantitative Approach, Morgan Kaufmann, San Mateo, Calif., 1990.
[13] D. B. Hunt and P. N. Marinos, "A general purpose cache-aided error recovery (CARER) technique," inProc. 17th- Int. Symp. on Fault-Tolerant Computing, 1987, pp. 170-175.
[14] B. Janssens and W.K. Fuchs, "Experimental evaluation of multiprocessor cache-based error recovery," inProc. Int. Conf. on Parallel Processing, 1991, pp. I-505-I-508.
[15] B. Janssens and W.K. Fuchs, "Relaxing consistency in recoverable distributed shared memory,"Proc. 23rd Int. Symp. Fault-Tolerant Computing, 1993, pp. 155-163.
[16] R. Katz, S. Eggers, D. Wood, C.L. Perkins, and R. Sheldon, "Implementing a cache consistency protocol," inProc. 12th Annu. Int. Symp. Comput. Architecture, vol. 13, June 1985, pp. 276-283.
[17] P. A. Lee, N. Ghani, and K. Heron, "A recovery cache for the PDP-11,"IEEE Trans. Comput., vol. C-29, no. 6, pp. 546-549, June 1980.
[18] C.-C. J. Li, S.-K. Chen, W. K. Fuchs, and W.-M. W. Hwu, "Compilerassisted multiple instruction retry,"IEEE Trans. Comput., to appear, 1994 (also available as Tech. Rep. CRHC-91-31, Univ. of Illinois, Urbana, IL, USA, Dec. 1991).
[19] M. S. Papamarcos and J. H. Patel, "A low-overhead coherence solution for multiprocessors with private cache memories," inProc. 11th Int. Symp. Comput. Architecture, June 1984, pp. 348-354.
[20] J. H. Patel, "Analysis of multiprocessors with private cache memories,"IEEE Trans. Comput., vol. C-31, no. 4, pp. 296-304, Apr. 1982.
[21] S. Patil, "Parallel algorithms for test generation and fault simulation," Ph.D. dissertation, Tech. Rep. CRHC-90-12, Univ. of Illinois, Urbana, IL, USA, Sept. 1990.
[22] B. Randell, "System structure for software fault tolerance,"IEEE Trans. Software Eng., vol. SE-1, no. 2, pp. 220-232, June 1975.
[23] L. Spainhower et al., "Design for Fault Tolerance in ES 9000 Model 900,"Proc. Fault-Tolerant Computing Symp., IEEE Computer Society Press, Los Alamitos, Calif., 1992, pp. 38-47.
[24] C. B. Stunkel, B. Janssens, and W. K. Fuchs, "Address tracing of parallel systems in TRAPEDS,"Microprocessors and Microsyst., vol. 16, pp. 249-261, 1992.
[25] Y. Tamir, and M. Tremblay, "High-performance fault-tolerant VLSI systems using micro rollback,"IEEE Trans. Comput., vol. 39, pp. 548-554, Apr. 1990.
[26] K.-L. Wu, W.K. Fuchs, and J.H. Patel, "Error recovery in shared memory multiprocessors using private cache,"IEEE Trans. Parallel Distrib. Syst., vol. 1, pp. 231-240, Apr. 1990.

Index Terms:
Index Termsbuffer storage; shared memory systems; virtual machines; redundancy; system recovery;performance evaluation; cache-based error recovery performance; multiprocessors;cache-based checkpointing; rollback error recovery; transient errors; shared-memorymultiprocessors; cache replacement policy; inherent redundancy; memory hierarchy;computation state; rollback propagation; address traces; parallel applications; EncoreMultimax; performance evaluation; recovery schemes; cache coherence protocol;cache-based schemes; low performance overhead; checkpoint interval
B. Janssens, W.K. Fuchs, "The Performance of Cache-Based Error Recovery in Multiprocessors," IEEE Transactions on Parallel and Distributed Systems, vol. 5, no. 10, pp. 1033-1043, Oct. 1994, doi:10.1109/71.313120
Usage of this product signifies your acceptance of the Terms of Use.