14th International Conference on Distributed Computing Systems (1994)
June 21, 1994 to June 24, 1994
M. Ahuja , Dept. of Comput. Sci. & Eng., California Univ., San Diego, La Jolla, CA, USA
S. Mishra , Dept. of Comput. Sci. & Eng., California Univ., San Diego, La Jolla, CA, USA
We develop a framework that helps in developing understanding of a fault-tolerant distributed system and so helps in designing such systems. We define a unit of computation in such systems, referred to as a molecule, that has a well defined interface with other molecules, i.e. has minimal dependence on other molecules. The smallest such unit-an indivisible molecule-is termed as an atom. We show that any execution of a fault-tolerant distributed computation can be seen as an execution of molecules/atoms in a partial order, and such a view provides insights into understanding the computation, particularly for a fault tolerant system where it is important to guarantee that a unit of computation is either completely executed or not at all and system designers need to reason about the states after execution of such units. We prove different properties satisfied by molecules and atoms, and present algorithms to detect atoms in an ongoing computation and to force the completion of a molecule. We illustrate the uses of the developed work in application areas such as debugging, checkpointing, and reasoning about stable properties.<
fault tolerant computing, reliability, distributed algorithms, distributed processing, program debugging
M. Ahuja and S. Mishra, "Units of computation in fault-tolerant distributed systems," 14th International Conference on Distributed Computing Systems(ICDCS), Pozman, Poland, 1994, pp. 626-633.