This Article 
 Bibliographic References 
 Add to: 
Automatic Recognition of Intermittent Failures: An Experimental Study of Field Data
April 1990 (vol. 39 no. 4)
pp. 525-537

A methodology is proposed for recognizing the symptoms of persistent problems in large systems. The system error rate is used to identify the error states among which relationships may exist. Statistical techniques are used to validate and quantify the strength of the relationship among these error states. As input, the approach takes the raw error logs containing a single entry for each error that is detected as an isolated event. As output, it produces a list of symptoms that characterize persistent errors. Thus, given a failure, it is determined whether the failure is an intermittent manifestation of a common fault or whether it is an isolated (transient) incident. The technique is shown to work on two CYBER systems and on IBM 3081 multiprocessor system. Comparisons to real failure/repair information obtained from field engineers show that, in about 85% of the cases, the error symptoms recognized by this approach correspond to real problems. The remaining 15% of the cases, although not directly supported by field data, are confirmed as being valid problems.

[1] X. Castillo and D. P. Siewiorek, "A workload dependent software reliability prediction model," inProc. 12th Int. Symp. Fault-Tolerant Comput., Santa Monica, CA, June 1982, pp. 279-285.
[2] Control Data Corp.,CDC CYBER 170 Computer Systems, 170 Systems Hardware Handbook, Publication No. 60447600, 1975.
[3] Control Data Corp.,Hardware Performance Analyzer, User Reference Manual, Publication No. 60459460, 1982.
[4] A. T. Dahbura and G. M. Masson, "AnO(n2.5) fault identification algorithm for diagnosable systems,"IEEE Trans. Comput., pp. 486-492, June 1984.
[5] IBM Corp.,Environmental Record Editing&Printing Program, International Business Machines Corp., 1984.
[6] R. K. Iyer and D. J. Rossetti, "A statistical load dependency model for CPU errors at SLAC," inProc. 12th Int. Symp. Fault Tolerant Comput., Santa Monica, CA, June 1982, pp. 363-372.
[7] R. K. Iyer and P. Velardi, "Hardware-related software errors: Measurement and analysis,"IEEE Trans. Software Eng., vol. SE-11, pp. 223-231, Feb. 1985.
[8] R. Iyer, D. Rossetti, and M. Hsueh, "Measurement and modeling of computer reliability as affected by system activity,"ACM Trans. Comput. Syst., vol. 4, pp. 214-237, Aug. 1986.
[9] T. Lin, "Design and evaluation of an on-line predictive diagnostic system," Ph.D. dissertation, Dep. Elec. Comput. Eng., Carnegie-Mellon Univ., Apr. 1988.
[10] R. A. Maxion, "Unanticipated behavior as a cue for system-level diagnosis," inProc. 8th IEEE Int. Phoenix Conf. Comput. Commun., Mar. 1989.
[11] W. Mendenhall and T. Sincich,Statistics for the Engineering and Computer Sciences, 2nd ed. San Francisco, CA: Dellen, 1988.
[12] D. Sanders, "Automatic detection of error patterns in computer systems," Masters thesis, Univ. of Illinois at Urbana Champaign, 1986.
[13] H. Späth,Cluster Analysis Algorithms. West Sussex: Ellis Horwood, 1980.
[14] V. Sridhar, R. K. Iyer, and D. Sanders, "Recognition of error symptoms in large systems," Tech. Rep. CSG-46, Coordinated Sci. Lab., Univ. of Illinois at Urbana-Champaign, Sept. 1985.
[15] K.H. Thearling and R.K. Iyer, "Diagnostic Reasoning in Digital Systems,"Proc. 18th Int'l Symp. Fault-Tolerant Computing, IEEE CS Press, 1988, pp. 286-291.
[16] M. M. Tsao, "Trend analysis and fault prediction," Tech. Rep. CMUCS-83-130, Deps. Elec. Eng. Comput. Sci., Carnegie-Mellon University, Pittsburgh, 1983.
[17] M. M. Tsao and D. P. Siewiorek, "Trend analysis on system error files," inProc. FTCS-13, June 28-30, 1983.
[18] P. Velardi and R. K. Iyer, "A study of software failures and recovery in MVS,"IEEE Trans. Comput., vol. C-33, pp. 343-349, July 1984.

Index Terms:
automatic recognition; statistical techniques; intermittent failures; error rate; raw error logs; CYBER systems; IBM 3081 multiprocessor system; software reliability.
R.K. Iyer, L.T. Young, P.V.K. Iyer, "Automatic Recognition of Intermittent Failures: An Experimental Study of Field Data," IEEE Transactions on Computers, vol. 39, no. 4, pp. 525-537, April 1990, doi:10.1109/12.54845
Usage of this product signifies your acceptance of the Terms of Use.