This Article 
 Bibliographic References 
 Add to: 
Error Recovery in Shared Memory Multiprocessors Using Private Caches
April 1990 (vol. 1 no. 2)
pp. 231-240

The problem of recovering from processor transient faults in shared memory multiprocessor systems is examined. A user-transparent checkpointing and recovery scheme using private caches is presented. Processes can recover from errors due to faulty processors by restarting from the checkpointed computation state. Implementation techniques using checkpoint identifiers and recovery stacks are examined as a means of reducing performance degradation in processor utilization during normal execution. This cache-based checkpointing technique prevents rollback propagation, provides rapid recovery, and can be integrated into standard cache coherence protocols. An analytical model is used to estimate the relative performance of the scheme during normal execution. Extensions to take error latency into account are presented.

[1] C. P. Thacker, L. C. Stewart, and E. H. Satterthwaite, Jr., "Firefly: A multiprocessor workstation,"IEEE Trans. Comput., vol. 37, pp. 909-920, Aug. 1988.
[2] G. F. Pfister,et al., "The IBM Research Parallel Processor Prototype (RP3): Introduction and architecture," inProc. IEEE Int. Conf. Parallel Process., 1985, pp. 764-770.
[3] J. Archibald and J. L. Baer, "Cache-coherence protocols: Evaluation using a multiprocessor simulation model,"ACM Trans. Comput. Syst., vol. 4, no. 4, pp. 273-298, Nov. 1986.
[4] W. C. Yen, D. W. L. Yen, and K.-S. Fu, "Data coherence problem in a multicache system,"IEEE Trans. Comput., vol. C-34, pp. 56-65, Jan. 1985.
[5] R. Koo and S. Toueg, "Checkpointing and rollback-recovery for distributed systems,"IEEE Trans. Software Eng., vol. SE-13, pp. 23-31, Jan. 1987.
[6] Y. Tamir and C. H. Sequin, "Error recovery in multicomputers using global checkpoints," inProc. 1984 Int. Conf. Parallel Process., 1984, pp. 32-41.
[7] R. E. Strom and S. Yemini, "Optimistic recovery in distributed systems,"ACM Trans. Comput. Syst., vol. 3, no. 3, pp. 204-226, Aug. 1985.
[8] K. H. Kim, "Programmer-transparent coordination of recovering concurrent processes: Philosophy and rules for efficient implementation,"IEEE Trans. Software Eng., vol. 14, pp. 810-821, June 1988.
[9] C. A. R. Hoare, "Monitors: an operating system structuring concept,"Commun. ACM, vol. 17, no. 10, pp. 549-557, Oct. 1974.
[10] G. Barigazzi and L. Strigini, "Application-transparent setting of recovery points," inProc. 13th Symp. Fault-Tolerant Comput., 1983, pp. 48-55.
[11] Y.-H. Lee and K. G. Shin, "Design and evaluation of a fault-tolerant multiprocessor using hardware recovery blocks,"IEEE Trans. Comput., vol. C-33, pp. 113-124, Feb. 1984.
[12] P. A. Lee, N. Ghani, and K. Heron, "A recovery cache for the PDP- 11,"IEEE Trans. Comput., vol. C-29, pp. 546-549, June 1980.
[13] S. J. Upadhyaya and K. K. Saluja, "An experimental study to determine task size for rollback recovery systems,"IEEE Trans. Comput., vol. 37, pp. 872-877, July 1988.
[14] D. B. Hunt and P. N. Marinos, "A general purpose cache-aided rollback error recovery (CARER) technique," inProc. 17th Symp. Fault-Tolerant Comput., 1987, pp. 170-175.
[15] Y. Tamir, M. Tremblay, and D. A. Rennels, "The implementation and application of micro rollback in fault-tolerant VLSI systems," inProc. 18th Int. Symp. Fault-Tolerant Computing, 1988, pp. 234-239.
[16] P. A. Bernstein, "Sequoia: A fault-tolerant tightly coupled multiprocessor for transaction processing,"IEEE Computer, vol. 21, pp. 37- 45, Feb. 1988.
[17] J. E. Smith and A. R. Pleszkun, "Implementing precise interrupts in pipelined processors,"IEEE Trans. Comput., vol. 37, pp. 562-573, May 1988.
[18] W.M. Hwu and Y.N. Patt, "Checkpoint Repair for High-Performance Out-of-Order Execution Machines,"IEEE Trans. Computers, Vol. 36, No. 12, Dec. 1987, pp. 1496-1514.
[19] P. Bitar and A. Despain, "Multiprocessor Cache Synchronization: Issues, Innovations, Evolution,"Proc. 13th ISCA, June 1986, pp. 424-442.
[20] M. Dubois, C. Scheurich, and F. A. Briggs, "Synchronization, coherence, and event ordering in multiprocessors,"IEEE Computer, vol. 21, pp. 9-21, Feb. 1988.
[21] L. M. Censier and P. Feautrier, "A new solution to coherence problems in multicache systems,"IEEE Trans. Comput., vol. C-27, pp. 1112-1118, Dec. 1978.
[22] J. Archibald and L.-L. Baer, "An Economical Solution to the Cache Coherence Problem,"Proc. 11th Ann. Symp. Computer Architecture, CS Press, Los Alamitos, Calif., Order No. 538 (microfiche only), 1984, pp. 355- 362.
[23] M. S. Papamarcos and J. H. Patel, "A low-overhead coherence solution for multiprocessors with private cache memories," inProc. 11th Int. Symp. Comput. Architecture, June 1984, pp. 348-354.
[24] J. H. Patel, "Analysis of multiprocessors with private cache memories,"IEEE Trans. Comput., vol. C-31, pp. 296-304, Apr. 1982.
[25] S.J. Eggers and R.H. Katz, "A Characterization of Sharing in Parallel Programs and its Application to Coherency Protocol Evaluation,"Proc. 15th Int'l Symp. Computer Architecture, 1988, IEEE CS Press, Los Alamitos, Calif. Order No. 861, pp. 373-382.
[26] R. J. Eickemeyer, "Performance evaluation of multiple register set architectures and cache memories," Ph.D. dissertation, Tech. Rep. CSG-82, Coordinated Science Laboratory, Univ. Illinois, Urbana, 1987.
[27] J. Hennessy et al., "Hardware/Software Trade-offs in Cache Design,"Symp. Architectural Support Programming Languages and Operating Systems, IEEE CS Press, Los Alamitos, CA, Order No. 1,936, 1989, pp. 2-11.

Index Terms:
Index Termsfault tolerance; shared memory multiprocessors; private caches; processor transient faults; user-transparent checkpointing; checkpointed computation state; recovery stacks; performance degradation; processor utilization; rollback propagation; rapidrecovery; cache coherence protocols; error latency; buffer storage; fault tolerant computing; multiprocessing systems; multiprocessor interconnection networks; system recovery
K.L. Wu, W.K. Fuchs, J.H. Patel, "Error Recovery in Shared Memory Multiprocessors Using Private Caches," IEEE Transactions on Parallel and Distributed Systems, vol. 1, no. 2, pp. 231-240, April 1990, doi:10.1109/71.80134
Usage of this product signifies your acceptance of the Terms of Use.