Issue No. 05 - May (1994 vol. 5)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/71.282563
<p>Algorithm-based fault tolerance has been proposed as a technique to detect incorrectcomputations in multiprocessor systems. In algorithm-based fault tolerance, processorsproduce data elements that are checked by concurrent error detection mechanisms. Weinvestigate the efficacy of this approach for diagnosis of processor faults. Becausechecks are performed on data elements, the problem of location of data errors must firstbe solved. We propose a probabilistic model for the faults and errors in a multiprocessorsystem and use it to evaluate the probabilities of correct error location and faultdiagnosis. We investigate the number of checks that are necessary to guarantee errorlocation with high probability. We also give specific check assignments that accomplishthis goal. We then consider the problem of fault diagnosis when the locations oferroneous data elements are known. Previous work on fault diagnosis required that thedata sets produced by different processors be disjoint. We show, for the first time, thatfault diagnosis is possible with high probability, even in systems where processorscombine to produce individual data elements.</p>
Index Termserror detection; multiprocessing systems; fault tolerant computing; reliability; failureanalysis; almost certain fault diagnosis; algorithm-based fault tolerance; multiprocessorsystems; incorrect computations; concurrent error detection mechanisms; erroneous dataelements; concurrent error detection; probabilistic analysis
D. Blough and A. Pelc, "Almost Certain Fault Diagnosis Through Algorithm-Based Fault Tolerance," in IEEE Transactions on Parallel & Distributed Systems, vol. 5, no. , pp. 532-539, 1994.