<p><b>Abstract</b>—Distributed Shared Virtual Memory (<scp>DSVM</scp>) systems provide a shared memory abstraction on distributed memory architectures. Such systems ease parallel application programming because the shared-memory programming model is often more natural than the message-passing paradigm. However, the probability of failure of a <scp>DSVM</scp> increases with the number of sites. Thus, fault tolerance mechanisms must be implemented in order to allow processes to continue their execution in the event of a failure. This paper gives an overview of <it>recoverable</it><scp>DSVM</scp>s (<scp>RDSVM</scp>s) that provide a checkpointing mechanism to restart parallel computations in the event of a site failure.</p>
Distributed systems, distributed shared virtual memory, availability, backward error recovery, consistent states.

