Issue No. 02 - March/April (2000 vol. 12)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/69.842263
<p><b>Abstract</b>—This paper proposes a hierarchical error detection framework for a Software Implemented Fault Tolerance (SIFT) layer of a distributed system. A four-level error detection hierarchy is proposed in the context of Chameleon, a software environment for providing adaptive fault-tolerance in an environment of commercial off-the-shelf (COTS) system components and software. The design and implementation of a software-based distributed signature monitoring scheme, which is central to the proposed four-level hierarchy, is described. Both intralevel and interlevel optimizations that minimize the overhead of detection and are capable of adapting to runtime requirements are proposed. The paper presents results from a prototype implementation of two levels of the error detection hierarchy and results of a detailed simulation of the overall environment. The results indicate a substantial increase in availability due to the detection framework and help in understanding the trade-offs between overhead and coverage for different combinations of techniques.</p>
Software implemented fault tolerance, hierarchical error detection, distributed systems, data and control signatures, speculative execution.
Z. Kalbarczyk, S. Bagchi, R. K. Iyer, B. Srinivasan and K. Whisnant, "Hierarchical Error Detection in a Software Implemented Fault Tolerance (SIFT) Environment," in IEEE Transactions on Knowledge & Data Engineering, vol. 12, no. , pp. 203-224, 2000.