This Article 
 Bibliographic References 
 Add to: 
An Architecture for Tolerating Processor Failures in Shared-Memory Multiprocessors
October 1996 (vol. 45 no. 10)
pp. 1101-1115

Abstract—This paper focuses on the problem of fault tolerance in shared memory multiprocessors, and describes an architecture designed for transparently tolerating processor failures. The Recoverable Shared Memory (RSM) is the novel component of this architecture, providing a hardware supported backward error recovery mechanism which minimizes the propagation of recovery when a processor fails. The RSM permits a shared memory multiprocessor to be constructed using standard caches and cache coherence protocols, and does not require any changes to be made to applications software. The performance of the recovery scheme supported by the RSM is evaluated and compared with other schemes that have been proposed for fault tolerant shared memory multiprocessors. The performance study has been conducted by simulation using address traces collected from real parallel applications.

[1] R.E. Ahmed, R. Frazier, and P.N. Marinos, “Cache-Aided Rollback Error Recovery Algorithms for Shared-Memory Multiprocessor Systems,” Proc. 20th Symp. Fault-Tolerant Computing, pp. 82-88, 1990.
[2] J.P. Banâtre, M. Banâtre, G. Lapalme, and F. Ployette, "The Design and Building of Enchere, a Distributed Electronic Marketing System," Comm. ACM, vol. 29, pp. 19-29, Jan. 1986.
[3] M. Banâtre and P. Joubert,“Cache management in tightly coupled fault tolerant multiprocessor,” 20th Symp. on Fault-Tolerant Computing, June 1990, pp. 89-96.
[4] M. Banatre, G. Muller, B. Rochat, and P. Sanchez, “Design Decisions for the FTM: A General Purpose Fault Tolerant Machine,” Proc. 1991 Int'l Symp. Fault-Tolerant Computing, pp. 71-78, June 1991.
[5] M. Banâtre, A. Gefflaut, P. Joubert, C. Morin, and P.A. Lee, "An Architecture for Tolerating Processor Failures in Shared-Memory Multiprocessors," Research Report 1965, INRIA, Mar. 1993.
[6] J. Bartlett et al., "The Tandem Case: Fault Tolerance in Tandem Computer Systems," Reliable Computer Systems Design and Evaluation, D. Siewiorek and R. Swarz, eds., second edition, Digital Press, pp. 586-648, 1992.
[7] P.A. Bernstein,"Sequoia: A Fault-Tolerant Tightly Coupled Multiprocessor for Transaction Processing," Computer, pp. 37-45, Feb. 1988.
[8] K.M. Chandy and L. Lamport, "Distributed Snapshots: Determining Global States of Distributed Systems," ACM Trans. Computer Systems, Feb. 1985.
[9] H. Davis, S. Goldschmidt, and J. Hennessy, "Multiprocessor Simulation Using Tango," Proc. 1991 Int'l Conf. Parallel Processing, vol. II, pp. 99-107, 1991.
[10] A. Gefflaut and P. Joubert, "SPA: A Multiprocessor Execution Driven Simulation Kernel," J. Computer Simulation, vol. 6, pp. 69-87 1996.
[11] K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and J. Hennessy, “Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors,” Proc. 17th Ann. Int'l Symp. Computer Architecture, 1990.
[12] J.N. Gray, "Notes on Database Operating Systems" Operating Systems: An Advanced Course, R. Bayer, R.M. Graham, and G. Seegmuller, eds., Lecture Notes in Computer Science 60, Springer-Verlag, Heidelberg, Germany, 1978.
[13] J. Gray, "A Census of Tandem System Availability Between 1985 and 1990," IEEE Trans. Reliability, vol. 39, no. 4, pp. 409-418, Oct. 1990.
[14] L. Gunaseelan and R.J. LeBlanc, "Event Ordering in a Shared Memory Distributed System," Proc. 13th Int'l Conf. Distributed Computing Systems, pp. 256-263, 1993.
[15] D.B. Hunt and P.N. Marinos, "A General Purpose Cache-aided Rollback Error Recovery (CARER) Technique," Proc. 17th Int'l Symp. Fault-Tolerant Computing Systems, pp. 170-175, 1987.
[16] B. Janssens and W. Fuchs, "Experimental Evaluation of Multiprocessor Cache-based Error Recovery," Proc. 1991 Int'l Conf. Parallel Processing, vol. I, pp. 505-508, 1991.
[17] B. Janssens and W. Fuchs, "The Performance of Cache-Based Error Recovery in Multiprocessors," IEEE Trans. Parallel and Distributed Systems, vol. 5, no. 10, pp. 1,033-1,043, Oct. 1994.
[18] D. Jewett, “Integrity S2: A Fault-Tolerant Unix Platform,” Proc. 21st Int'l Symp. Fault-Tolerant Computing (FTCS-21), pp. 512-519, Montreal, June 1991.
[19] P. Joubert, "Conception etévaluation d'une architecture multiprocesseuràmémoire partagée tolérante aux fautes," PhD thesis, universitéde Rennes I, Jan. 1993.
[20] R.H. Katz et al., "Implementing a Cache Consistency Protocol," Proc. 12th Ann. Int'l Symp. Computer Architecture, June 1985, pp. 158-166.
[21] M. Kitagawa, "Understanding MBus," The Sparc Technical Papers, B. Catanzaro, ed. pp. 425-442, Springer-Verlag, 1991.
[22] R. Koo and S. Toueg, "Checkpointing and Rollback-Recovery for Distributed Systems," IEEE Trans. Software Eng., vol. 13, no. 1, pp. 23-31, Jan. 1987.
[23] M. Heinrich et al. “The Stanford FLASH Multiprocessor,” Proc. 21th Int'l Symp. Computer Architecture, pp. 302-313, April 1994.
[24] L. Lamport, "How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs," IEEE Trans. Computers, vol. 28, no. 9, pp. 690-691, Sept. 1979.
[25] J. Larus, “Abstract Execution: A Technique for Efficiently Tracing Programs,” Software Practice and Experience, vol. 20, no. 12, pp. 1,251-1,258, Dec. 1990.
[26] P.A. Lee and T. Anderson, "Fault Tolerance: Principles and Practic," second revised edition, vol. 3of Dependable Computing and Fault-Tolerant Systems. Springer-Verlag, 1990.
[27] P.A. Lee, N. Ghani, and K. Heron, "A Recovery Cache for the PDP-11," IEEE Trans. Computers, vol. 29, no. 6, pp. 546-549, June 1980.
[28] D. Lenoski et al., “The Stanford DASH Multiprocessor,” Computer, pp. 63-79, Mar. 1992.
[29] K. Li, J.F. Naughton, and J.S. Plank, "Low-Latency, Concurrent Checkpointing for Parallel Programs," IEEE Trans. Parallel and Distributed Systems, vol. 5, pp. 874-879, Aug. 1994.
[30] C. Morin, A. Gefflaut, M. Banâtre, and A.-M. Kermarrec, “COMA: An Opportunity for Building Fault-Tolerant Scalable Shared Memory Multiprocessors,” Proc. 23rd Ann. Int'l Symp. Computer Architecture, May 1996.
[31] M. Paramarcos and J. Patel,“A low-overhead coherence solution for multiprocessors with private cache memories,” Proc. 11th Int’l Symp. Computer Architecture, pp. 348-354, June 1984.
[32] L. Rudolph and Z. Segall,“Dynamic decentralized cache schemes for mimd parallel processors,” Proc. Int’l Symp. Computer Architecture, pp. 340-347, 1984.
[33] F.B. Schneider, "The Fail-Stop Processor Approach," Concurrency Control and Reliability in Distributed Systems, chapter 13, pp. 370-394. Barghava, 1987.
[34] D. Siewiorek and R. Swarz, Reliable Computer Systems: Design and Evaluation. Digital Press, 1992.
[35] J.P. Singh, W.-D. Weber,, and A. Gupta, “Splash: Stanford Parallel Applications for Shared Memory,” Technical Report CSL-TR-91-469, Stanford Univ., Apr. 1991.
[36] R.E. Strom and S.A. Yemini, "Optimistic Recovery in Distributed Systems," ACM Trans. Computer Systems, vol. 3, no. 3, pp. 204-226, Aug. 1985.
[37] Sun Microsystems Inc., "SPARCsystem 600MP. New Technology for Flexibility, Scalability and Growth," Technical white paper, Sept. 1991.
[38] S. Weber and J. Beirne, "The Stratus Architecture," Proc. 21st Int'l Symp. Fault-Tolerant Computing Systems, pp. 79-85, 1991.
[39] D. Wilson, "The Stratus Computer System," Resilient Computer Systems, T. Anderson, ed., pp. 208-231, 1985.
[40] K.L. Wu, W.K. Fuchs, and J.H. Patel, "Error Recovery in Shared Memory Multiprocessors Using Private Caches," IEEE Trans. Parallel and Distributed Systems, vol. 1, no. 2, pp. 231-240, Apr. 1990.

Index Terms:
Shared memory multiprocessor, fault tolerance, stable storage, backward error recovery, simulation, performance.
Michel Banâtre, Alain Gefflaut, Philippe Joubert, Christine Morin, Peter A. Lee, "An Architecture for Tolerating Processor Failures in Shared-Memory Multiprocessors," IEEE Transactions on Computers, vol. 45, no. 10, pp. 1101-1115, Oct. 1996, doi:10.1109/12.543705
Usage of this product signifies your acceptance of the Terms of Use.