This Article 
 Bibliographic References 
 Add to: 
An Efficient and Scalable Approach for Implementing Fault-Tolerant DSM Architectures
May 2000 (vol. 49 no. 5)
pp. 414-430

Abstract—Distributed Shared Memory (dsm) architectures are attractive to execute high performance parallel applications. Made up of a large number of components, these architectures have however a high probability of failure. We propose a protocol to tolerate node failures in cache-based dsm architectures. The proposed solution is based on backward error recovery and consists of an extension to the existing coherence protocol to manage data used by processors for the computation and recovery data used for fault tolerance. This approach can be applied to both Cache Only Memory Architectures (coma) and Shared Virtual Memory (svm) systems. The implementation of the protocol in a coma architecture has been evaluated by simulation. The protocol has also been implemented in an svm system on a network of workstations. Both simulation results and measurements show that our solution is efficient and scalable.

[1] P.A. Lee and T. Anderson, Fault Tolerance: Principles and Practice, second revised ed. Springer Verlag, 1990.
[2] E. Hagersten, A. Landin, and S. Haridi,“DDM—A cache-only memory architecture,”IEEE Comput. Mag., vol. 25, pp. 44–54, Sept. 1992.
[3] J. Franck, H. Burkhard III, and J. Rothnie, “The KSR1: Bridging the Gap between Shared Memory and MMPs,” Proc. Compcon93 38th IEEE CS Int'l Conf., pp. 285-294, Feb. 1993.
[4] Intel Corporation, Paragon User's Guide, 1993.
[5] T.E. Anderson, D.E. Culler, and D.A. Patterson, “The Berkeley Networks of Workstations (NOW) Project,” Proc. COMP Spring, pp. 322-326, 1995.
[6] C. Thacker, L. Stewart, and E. Satterthwaite, “Firefly: A Multiprocessor Workstation,” IEEE Trans. Computers, vol. 37, no. 8, pp. 909-920, Aug. 1988.
[7] R.H. Katz et al., "Implementing a Cache Consistency Protocol," Proc. 12th Ann. Int'l Symp. Computer Architecture, June 1985, pp. 158-166.
[8] M. Banâtre, A. Gefflaut, P. Joubert, C. Morin, and P.A. Lee, “An Architecture for Tolerating Processor Failures in Shared-Memory Multiprocessors,” IEEE Trans. Computers, vol. 45, no. 10, Oct. 1996.
[9] M.D. Cin, A. Grygier, H. Hessenauer, U. Hildebrand, J. Höonig, W. Hohl, E. Michel, and A. Pataricza, “Fault Tolerance in Distributed Shared Memory Multiprocessors,” Parallel Computer Architectures, pp. 31-48, Springer-Verlag, 1994.
[10] P.A. Bernstein,"Sequoia: A Fault-Tolerant Tightly Coupled Multiprocessor for Transaction Processing," Computer, pp. 37-45, Feb. 1988.
[11] K.-L. Wu and W.K. Fuchs, "Recoverable Distributed Shared Virtual Memory," IEEE Trans. Computers, vol. 39, no. 4, pp. 460-469, Apr. 1990.
[12] B.D. Fleisch, “Reliable Distributed Shared Memory,” Proc. Second Workshop Experimental Distributed Systems, pp. 102-105, 1990.
[13] M. Stumm and S. Zhou, "Fault Tolerant Distributed Shared Memory Algorithms," Proc. Second IEEE Symp. Parallel and Distributed Processing, pp. 719-724. Dec. 1990.
[14] T.J. Wilkinson, “Implementing Fault Tolerance in a 64-bit Distributed Operating System,” PhD thesis, City Univ., London, July 1993.
[15] E.N. Elnozahy, D.B. Johnson, and W. Zwaenepoel, "The Performance of Consistent Checkpointing," Proc. 11th Symp. Reliable Distributed Systems, pp. 86-95, Oct. 1992.
[16] K.L. Wu, W.K. Fuchs, and J.H. Patel, "Error Recovery in Shared Memory Multiprocessors Using Private Caches," IEEE Trans. Parallel and Distributed Systems, vol. 1, no. 2, pp. 231-240, Apr. 1990.
[17] D. Chaiken et al., “Directory-Based Cache Coherence in Large Scale Multiprocessors,” Computer, vol. 23, no. 6, pp. 49-58, June 1990.
[18] P. Stenstrom, T. Joe, and A. Gupta, "Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures," Proc. 19th Int'l Symp. Computer Architecture, IEEE CS Press, Los Alamitos, Calif., 1992, pp. 80-91.
[19] A. Saulsbury et al., "An Argument for Simple COMA," Proc. 1st Symp. High-Performance Computer Architecture, IEEE CS Press, Los Alamitos, Calif., 1995, pp. 276-285.
[20] K. Li and P. Hudak, "Memory Coherence in Shared Virtual Memory Systems," ACM Trans. Computer Surveys, vol. 7, no. 4, Nov. 1989.
[21] Z. Lahjomri and T. Priol, “Koan: A Shared Virtual Memory for the ipsc/2 Hypercube,” Proc. Compar/VAAP92, Sept. 1992.
[22] C. Amza, A.L. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, W. Yu, and W. Zwaenepoel, “TreadMarks: Shared Memory Computing on Networks of Workstations,” Computer, vol. 29, no. 2, Feb. 1996.
[23] A.-M. Kermarrec, “Une approche globale fondée sur la réplication pour la disponibilitéet l'efficacitédes systèmes extensiblesàmémoire partagée,” PhD thesis, Universitéde Rennes 1, 1996.
[24] M. Rozier, V. Abrossimov, F. Armand, I. Boule, M. Gien, M. Guillemont, F. Hermann, P. Leonard, S. Langlois, and W. Neuhauser, “Chorus Distributed Operating Systems,” Computing Systems, vol. 1, no. 4, pp. 305-370, Oct. 1988.
[25] A. Gefflaut, C. Morin, and M. Banâtre, “Tolerating Node Failures in Cache Only Memory Architectures,” Proc. Supercomputing '94, Nov. 1994.
[26] A.-M. Kermarrec, “Contrôle de la réplication des données dans une mémoire virtuelle partagée recouvrable efficace,” Technique et Science Informatiques, vol. 15, no. 5, May 1996.
[27] C. Morin, A. Gefflaut, M. Banâtre, and A.-M. Kermarrec, “COMA: An Opportunity for Building Fault-Tolerant Scalable Shared Memory Multiprocessors,” Proc. 23rd Ann. Int'l Symp. Computer Architecture, May 1996.
[28] A. Gefflaut and P. Joubert, “Spam: A Multiprocessor Execution Driven Simulation Kernel,” Int'l J. Computer Simulation, vol. 6, no. 1, pp. 69-88, Jan. 1996.
[29] H. Davis, S.R. Goldsmidt, and J. Hennessy, “Multiprocessor Simulation Using Tango,” Proc. 1991 Int'l Conf. Parallel Processing, Aug. 1991.
[30] J. Larus, “Abstract Execution: A Technique for Efficiently Tracing Programs,” Software Practice and Experience, vol. 20, no. 12, pp. 1,251-1,258, Dec. 1990.
[31] H. Schwetman, “Csim User's Guide,” Rev. 2 ACT-126-90, MCC, 1992.
[32] J.P. Singh, W.-D. Weber,, and A. Gupta, “Splash: Stanford Parallel Applications for Shared Memory,” Technical Report CSL-TR-91-469, Stanford Univ., Apr. 1991.
[33] S.C. Woo et al., "The SPLASH-2 Programs: Characterization and Methodological Considerations," Proc. 22nd Annual Int'l Symp. Computer Architecture, IEEE CS Press, Los Alamitos, Calif., June 1995, pp. 24-36.
[34] A. Gefflaut, “Proposition etévaluation d'une architecture multiprocesseur extensibleàmémoire partagée tolérante aux fautes,” PhD thesis, Rennes Univ., Jan. 1995.
[35] M. Heinrich et al. “The Stanford FLASH Multiprocessor,” Proc. 21th Int'l Symp. Computer Architecture, pp. 302-313, April 1994.
[36] A. Kermarrec, G. Cabillic, A. Gefflaut, C. Morin, and I. Puaut, “A Recoverable Distributed Shared Memory Integrating Coherency and Recoverability,” Proc. 25th Int'l Symp. Fault-Tolerant Computing Systems (FTCS-25), June 1995.
[37] A.-M. Kermarrec, C. Morin, and M. Banâtre, “Design, Implementation and Evaluation of Icare: An Efficient Recoverable DSM,” Software Practice and Experience, vol. 28, no. 9, pp. 981-1,001, July 1998.
[38] C. Morin and I. Puaut, "A Survey of Recoverable Distributed Shared Virtual Memory Systems," IEEE Trans. Parallel and Distributed Systems, Vol. 8, No. 9, Sept. 1997, pp. 959-969.
[39] N. Neves, M. Castro, and P. Guedes, "A Checkpoint Protocol for an Entry Consistent Shared Memory System," Proc. 13th ACM Symp. Principles of Distributed Computing, Aug. 1994.
[40] M. Costa, P. Guedes, M. Sequeira, N. Neves, and M. Castro, "Lightweight Logging for Lazy Release Consistent Distributed Shared Memory," Proc. Second Symp. Operating Systems Design and Implementation, Oct. 1996.
[41] A.-M. Kermarrec and C. Morin, An Efficient Recoverable DSM on a Network of Workstations: Design and Implementation, chapter 7, pp. 123-153, Kluwer Academic, 1997.
[42] W. Bolosky, R. Fitzgerald, and M. Scott, “Simple But Effective Techniques for NUMA Memory Management,” Proc. 12th ACM Symp. Operating Systems Principles, Dec. 1989.
[43] A. Cox and R. Fowler, “The Implementation of a Coherent Memory Abstraction on a NUMA Multiprocessor: Experiences with Platinum,” Proc. 12th ACM Symp. Operating Systems Principles, Dec. 1989.

Index Terms:
Distributed shared memory, fault tolerance, coherence protocol, backward error recovery, scalability, performance, coma, svm.
Christine Morin, Anne-Marie Kermarrec, Michel Banâtre, Alain Gefflaut, "An Efficient and Scalable Approach for Implementing Fault-Tolerant DSM Architectures," IEEE Transactions on Computers, vol. 49, no. 5, pp. 414-430, May 2000, doi:10.1109/12.859537
Usage of this product signifies your acceptance of the Terms of Use.