Design of Algorithm-Based Fault-Tolerant Multiprocessor Systems for Concurrent Error Detection and Fault Diagnosis
Issue No. 10 - October (1994 vol. 5)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/71.313125
<p>Algorithm-based fault tolerance (ABPT) is a low-overhead system-level concurrent errordetection and fault location scheme for multiprocessor systems. We present new methodsfor the design of ABFT systems. Our design procedure is applicable to a wide range ofsystems in which processors share data elements. A feature of our design approach isthat the type of checks to be used in the final system can be controlled by the systemdesigner. We also present some new bounds on the number of checks needed in ABFTsystem design.</p>
Index Termsfault tolerant computing; reliability; multiprocessing systems; fault location; parallelarchitectures; system recovery; fault-tolerant multiprocessor systems; algorithm-basedmultiprocessor systems; concurrent error detection; fault diagnosis; algorithm-based faulttolerance; low-overhead system-level error detection; fault location scheme; ABFTsystems; design procedure; data element sharing; ABFT system design
N. Jha and V. Vinnakota, "Design of Algorithm-Based Fault-Tolerant Multiprocessor Systems for Concurrent Error Detection and Fault Diagnosis," in IEEE Transactions on Parallel & Distributed Systems, vol. 5, no. , pp. 1099-1106, 1994.