This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Threshold-Based Mechanisms to Discriminate Transient from Intermittent Faults
March 2000 (vol. 49 no. 3)
pp. 230-245

Abstract—This paper presents a class of count-and-threshold mechanisms, collectively named $\alpha$-count, which are able to discriminate between transient faults and intermittent faults in computing systems. For many years, commercial systems have been using transient fault discrimination via threshold-based techniques. We aim to contribute to the utility of count-and-threshold schemes, by exploring their effects on the system. We adopt a mathematically defined structure, which is simple enough to analyze by standard tools. $\alpha$-count is equipped with internal parameters that can be tuned to suit environmental variables (such as transient fault rate, intermittent fault occurrence patterns). We carried out an extensive behavior analysis for two versions of the count-and-threshold scheme, assuming, first, exponentially distributed fault occurrencies and, then, more realistic fault patterns.

[1] J.C. Laprie, “Dependability—Its Attributes, Impairments and Means,” Predictably Dependable Computing Systems, B. Randell, J.C. Laprie, H. Kopetz, and B. Littlewood, eds., pp. 1-28, Springer-Verlag, 1995.
[2] D. Siewiorek and R. Swarz, Reliable Computer Systems: Design and Evaluation. Digital Press, 1992.
[3] A. Bondavalli, S. Chiaradonna, F. Di Giandomenico, and F. Grandoni, “Discriminating Fault Rate and Persistency to Improve Fault Treatment,” Proc. 27th IEEE FTCS—Int'l Symp. Fault-Tolerant Computing, pp. 354-362, 1997.
[4] W.H. Sanders and J.F. Meyer, “A Unified Approach for Specifying Measures of Performance, Dependabiliy and Performability,” Dependable Computing for Critical Applications, A. Avizienis and J. Laprie, eds., vol. 4 of Dependable Computing and Fault-Tolerant Systems, pp. 215-237, Springer-Verlag, 1991.
[5] W.H. Sanders, W.D. Obal II, M.A. Qureshi, and F.K. Widjanarko, “TheUltraSANModeling Environment,” Performance Evaluation, vol. 24, no. 1, pp. 89-115, 1995.
[6] H.E. Ascher, T.-T.Y. Lin, and D.P. Siewiorek, “Modification of: Error Log Analysis: Statistical Modeling and Heuristic Trend Analysis,” IEEE Trans. Reliability, vol. 41, pp. 599-601, 1992.
[7] M.M. Tsao and D.P. Siewiorek, “Trend Analysis on System Error Files,” Proc. 13th IEEE FTCS—Int'l Symp. Fault-Tolerant Computing, pp. 116-119, 1983.
[8] T.Y. Lin and D.P. Siewiorek, “Error Log Analysis: Statistical Modeling and Heuristic Trend Analysis,” IEEE Trans. Reliability, vol. 39, no. 4, pp. 419-432, 1990.
[9] R.K. Iyer, L.T. Young, and P.V.K. Iyer, “Automatic Recognition of Intermittent Failures: An Experimental Study of Field Data,” IEEE Trans. Computers, vol. 39, pp. 525-537, 1990.
[10] G. Mongardi, “Dependable Computing for Railway Control Systems,” Proc. DCCA-3—Dependable Computing for Critical Applications, pp. 255-277, 1993.
[11] N.N. Tendolkar and R.L. Swann, “Automated Diagnostic Methodology for the IBM 3081 Processor Complex,” IBM J. Research and Development, vol. 26, pp. 78-88, 1982.
[12] L. Spainhower,J. Isenberg,R. Chillarege,, and J. Berding,“Design for fault-tolerance in system ES/9000 model 9000,” Proc. 22nd Int’l Symp. Fault-Tolerant Computing, pp. 38-47, July 1992.
[13] J. Sosnowski, “Transient Fault Tolerance in Digital Systems,” IEEE Micro, vol. 14, pp. 24-35, 1994.
[14] J. H. Lala and L. S. Alger,“Hardware and software fault tolerance: A unified architectural approach,” Proc. 18th Int’l Symp. on Fault-Tolerant Computing,Tokyo, Japan, June 1988, pp. 240-245.
[15] P. Agrawal, "Fault Tolerance in Multiprocessor Systems without Dedicated Redundancy," IEEE Trans. Computers, vol. 37, no. 3, pp. 358-362, Mar. 1988.
[16] D. Powell et al., “GUARDS: A Generic Upgradable Architecture for Real-Time Dependable Systems,” IEEE Trans. Parallel and Distributed Systems, vol. 10, no. 6, pp. 580-599, June 1999.

Index Terms:
Fault discrimination, threshold-based identification, transient and intermittent faults, modeling and evaluation, fault diagnosis.
Citation:
Andrea Bondavalli, Silvano Chiaradonna, Felicita Di Giandomenico, Fabrizio Grandoni, "Threshold-Based Mechanisms to Discriminate Transient from Intermittent Faults," IEEE Transactions on Computers, vol. 49, no. 3, pp. 230-245, March 2000, doi:10.1109/12.841127
Usage of this product signifies your acceptance of the Terms of Use.