The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.05 - May (2000 vol.49)
pp: 414-430
ABSTRACT
<p><b>Abstract</b>—Distributed Shared Memory (<scp>dsm</scp>) architectures are attractive to execute high performance parallel applications. Made up of a large number of components, these architectures have however a high probability of failure. We propose a protocol to tolerate node failures in cache-based <scp>dsm</scp> architectures. The proposed solution is based on backward error recovery and consists of an extension to the existing coherence protocol to manage data used by processors for the computation and recovery data used for fault tolerance. This approach can be applied to both Cache Only Memory Architectures (<scp>coma</scp>) and Shared Virtual Memory (<scp>svm</scp>) systems. The implementation of the protocol in a <scp>coma</scp> architecture has been evaluated by simulation. The protocol has also been implemented in an <scp>svm</scp> system on a network of workstations. Both simulation results and measurements show that our solution is efficient and scalable.</p>
INDEX TERMS
Distributed shared memory, fault tolerance, coherence protocol, backward error recovery, scalability, performance, coma, svm.
CITATION
Christine Morin, Anne-Marie Kermarrec, Michel Banâtre, Alain Gefflaut, "An Efficient and Scalable Approach for Implementing Fault-Tolerant DSM Architectures", IEEE Transactions on Computers, vol.49, no. 5, pp. 414-430, May 2000, doi:10.1109/12.859537
13 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool