
This Article  
 
Share  
Bibliographic References  
Add to:  
Digg Furl Spurl Blink Simpy Del.icio.us Y!MyWeb  
Search  
 
ASCII Text  x  
D.M. Blough, A. Pelc, "Almost Certain Fault Diagnosis Through AlgorithmBased Fault Tolerance," IEEE Transactions on Parallel and Distributed Systems, vol. 5, no. 5, pp. 532539, May, 1994.  
BibTex  x  
@article{ 10.1109/71.282563, author = {D.M. Blough and A. Pelc}, title = {Almost Certain Fault Diagnosis Through AlgorithmBased Fault Tolerance}, journal ={IEEE Transactions on Parallel and Distributed Systems}, volume = {5}, number = {5}, issn = {10459219}, year = {1994}, pages = {532539}, doi = {http://doi.ieeecomputersociety.org/10.1109/71.282563}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, }  
RefWorks Procite/RefMan/Endnote  x  
TY  JOUR JO  IEEE Transactions on Parallel and Distributed Systems TI  Almost Certain Fault Diagnosis Through AlgorithmBased Fault Tolerance IS  5 SN  10459219 SP532 EP539 EPD  532539 A1  D.M. Blough, A1  A. Pelc, PY  1994 KW  Index Termserror detection; multiprocessing systems; fault tolerant computing; reliability; failureanalysis; almost certain fault diagnosis; algorithmbased fault tolerance; multiprocessorsystems; incorrect computations; concurrent error detection mechanisms; erroneous dataelements; concurrent error detection; probabilistic analysis VL  5 JA  IEEE Transactions on Parallel and Distributed Systems ER   
Algorithmbased fault tolerance has been proposed as a technique to detect incorrectcomputations in multiprocessor systems. In algorithmbased fault tolerance, processorsproduce data elements that are checked by concurrent error detection mechanisms. Weinvestigate the efficacy of this approach for diagnosis of processor faults. Becausechecks are performed on data elements, the problem of location of data errors must firstbe solved. We propose a probabilistic model for the faults and errors in a multiprocessorsystem and use it to evaluate the probabilities of correct error location and faultdiagnosis. We investigate the number of checks that are necessary to guarantee errorlocation with high probability. We also give specific check assignments that accomplishthis goal. We then consider the problem of fault diagnosis when the locations oferroneous data elements are known. Previous work on fault diagnosis required that thedata sets produced by different processors be disjoint. We show, for the first time, thatfault diagnosis is possible with high probability, even in systems where processorscombine to produce individual data elements.
[1] P. Banerjee and J. A. Abraham, "Bounds on algorithmbased fault tolerance in multiple processor systems,"IEEE Trans. Comput., vol. C35, pp. 296306, Apr. 1986.
[2] P. Banerjee and J. Abraham, "Concurrent fault diagnosis in multiple processor systems,"Digest of the 16th Int. Symp. FaultTolerant Computing, 1986, pp. 298303.
[3] D. Blough, G. Sullivan, and G. Masson, "Efficient diagnosis of multiprocessor systems under probabilistic models,"IEEE Trans. Comput., vol. 41, pp. 11261136, Sept. 1992.
[4] D. Blough, G. Sullivan, and G. Masson, "Intermittent fault diagnosis in multiprocessor systems,"IEEE Trans. Comput., vol. 41, pp. 14301441, Nov. 1992.
[5] Y.H. Choi and M. Malek, "A faulttolerant systolic sorter,"IEEE Trans. Comput., vol. 37, pp. 621624, May 1988.
[6] A. T. Dahbura, "Systemlevel diagnosis: A perspective for the third decade,"Concurrent Computations: Algorithms, Architecture and Technology, S. Tewksbury, B. Dickinson, S. Schwartz, Eds. New York: Plenum, 1988.
[7] W. Feller,An Introduction to Probability Theory and Its Applications. New York: Wiley, 1966.
[8] S. Rangarajan and D. Fussell, "Diagnosing arbitrarily connected parallel computers with high probability,"IEEE Trans. Comput., vol. 41, pp. 606615, May 1992.
[9] D. Gu, J. Rosenkrantz, and S. S. Ravi, "Design and analysis of test schemes for algorithmbased fault tolerance," inProc. 20th Int. Symp. FaultTolerant Comput., Newcastle, England, June 2628, 1990, pp. 106113.
[10] T. Hagerup and C. Rüb, "A guided tour of Chernoff bounds,"Inform. Processing Lett., vol. 33, pp. 305308, Feb. 1990.
[11] K. Huang and J. Abraham, "Algorithmbased fault tolerance for matrix operations,"IEEE Trans. Comput., vol. C33, no. 6, pp. 518528, June 1984.
[12] F. T. Luk and H. Park, "Analysis of algorithmbased fault tolerance techniques," inJ. Parallel Distribut. Comput., vol. 5, pp. 172184, 1988.
[13] F. Luk and H. Park, "Faulttolerant matrix triangulation on systolic arrays,"IEEE Trans. Comput., vol. 37, pp. 14341438, Nov. 1988.
[14] A. Mahmood and E. McCluskey, "Concurrent error detection using watchdog processorsA survey,"IEEE Trans. Comput., vol. 37, pp. 160174, Feb. 1988.
[15] V. Nair and J. Abraham, "A model for the analysis of faulttolerant signal processing architectures,"Proc. SPIE, Advanced Algorithms and Architectures for Signal Processing Applications, 1988, pp. 246257.
[16] V. S. S. Nair and J. A. Abraham, "Hierarchical design and analysis of faulttolerant multiprocessor systems using concurrent error detection," inProc. 20th Int. Symp. FaultTolerant Comput., (FTCS20), Newcastle upon Tyne, June 1990, pp. 130137.
[17] V. Nair, Y. Hoskote, and J. Abraham, "Probabilistic evaluation of online checks in faulttolerant multiprocessor systems,"IEEE Trans. Comput., vol. 41, pp. 532541, May 1992.
[18] D. Rosenkrantz and S. Ravi, "Improved upper bounds for algorithmbased fault tolerance,"Proc. 26th Allerton Conf. Communic., Control and Computing, 1988, pp. 388397.
[19] R. K. Sitaraman and N. K. Jha, "Optimal design of checks for error detection and location in faulttolerant multiprocessor systems," inProc. 5th Int. Conf. FaultTolerant Comput. Syst., Nurnberg, Germany, Sept. 1991.
[20] D. L. Tao, C. R. P. Hartmann, and Y. S. Chen, "A novel concurrent error detection scheme for FFT networks," inProc. Int. Symp. Fault Tolerant Comput., NewcastleuponTyne, U.K., June 1990, pp. 114121.
[21] B. Vinnakota and N. K. Jha, "A dependence graphbased approach to the design of algorithmbased fault tolerant systems," inProc. Int. Symp. Fault Tolerant Comput., NewcastleuponTyne, U.K., June 1990, pp. 122129.
[22] B. Vinnakota and N. K. Jha, "Design of multiprocessor systems for concurrent error detection and fault diagnosis," inProc. Int. Symp. Fault Tolerant Comput., Montreal, June 1991.