14th International Conference on Distributed Computing Systems (1994)
June 21, 1994 to June 24, 1994
L. Brown , Dept. of Comput. Sci. & Eng., Florida Atlantic Univ., Boca Raton, FL, USA
Jie Wu , Dept. of Comput. Sci. & Eng., Florida Atlantic Univ., Boca Raton, FL, USA
Distributed shared memory (DSM) allows multicomputer systems with no physically shared memory to be programmed using a shared memory paradigm. However, as the number of nodes in a system increases the probability of a failure that can corrupt the DSM increases. This paper presents a fault-tolerant DSM (FTDSM) algorithm that can tolerate single node failures. Each page in the DSM is assigned a snooper that keeps a backup copy of the page and can take over if the page owner fails. The snooper is dynamic because the responsibility for snooping a page can migrate front node to node. The FTDSM presented is an improvement over other FTDSMs because it is scalable, is based on the efficient dynamic distributed manager (DDM) DSM algorithm, does not require the repair of a failed processor to access the DSM, and does not query all nodes to rebuild the state of the DSM. It is shown that any single node failure can be tolerated because either the owner or the snooper of a page can always be found.<
shared memory systems, distributed memory systems, fault tolerant computing, reliability, software reliability, multiprocessing programs, distributed algorithms, transaction processing
L. Brown and Jie Wu, "Dynamic snooping in a fault-tolerant distributed shared memory," 14th International Conference on Distributed Computing Systems(ICDCS), Pozman, Poland, 1994, pp. 218-226.