This Article 
 Bibliographic References 
 Add to: 
Software Dependability in the Tandem GUARDIAN System
May 1995 (vol. 21 no. 5)
pp. 455-467
Based on extensive field failure data for Tandem’s GUARDIAN operating system, this paper discusses evaluation of the dependability of operational software. Software faults considered are major defects that result in processor failures and invoke backup processes to take over. The paper categorizes the underlying causes of software failures and evaluates the effectiveness of the process pair technique in tolerating software faults. A model to describe the impact of software faults on the reliability of an overall system is proposed. The model is used to evaluate the significance of key factors that determine software dependability and to identify areas for improvement. An analysis of the data shows that about 77% of processor failures that are initially considered due to software are confirmed as software problems. The analysis shows that the use of process pairs to provide checkpointing and restart (originally intended for tolerating hardware faults) allows the system to tolerate about 75% of reported software faults that result in processor failures. The loose coupling between processors, which results in the backup execution (the processor state and the sequence of events) being different from the original execution, is a major reason for the measured software fault tolerance. Over two-thirds (72%) of measured software failures are recurrences of previously reported faults. Modeling, based on the data, shows that, in addition to reducing the number of software faults, software dependability can be enhanced by reducing the recurrence rate.

[1] C. V. Ramamoothy and F. B. Bastani,“Software reliability—status and perspectives,” IEEE Trans. on Software Engineering, vol. 8, no. 4, pp. 354-371, July 1982.
[2] J.D. Musa,A. Iannino,, and K. Okumoto,Software Reliability: Measurement, Prediction and Application.New York: McGraw-Hill, 1987.
[3] S. Yamada,M. Ohba,, and S. Osaki,“S-shaped software reliability growth models and their applications,” IEEE Trans. on Reliability, vol. 33, no. 4, pp. 289-292, Oct. 1984.
[4] A. Endres,“An analysis of errors and their causes in system programs,” IEEE Trans. on Software Engineering, vol. 1, no. 2, pp. 140-149, June 1975.
[5] T. A. Thayer,M. Lipow,, and E. C. Nelson,Software Reliability.New York, N.Y.: Elsevier North-Holland Publishing Company, Inc., 1978.
[6] D. M. Weiss,“Evaluating software development by error analysis: The data from the architecture research facility,” J. System and Software, vol. 1, pp. 57-70, Mar. 1979.
[7] V.R. Basili and B.T. Perricone,“Software errors and complexity: An empirical investigation,” Comm. ACM, vol. 27, no. 1, pp. 42-52, Jan. 1984.
[8] R. Chillarege et al., "Orthogonal Defect Classification: A Concept for In-Process Measurements," IEEE Trans. Software Eng., Vol. 18, No. 11, Nov. 1992, pp. 943-956.
[9] X. Castillo and D. P. Siewiorek,“A comparable hardware/software reliability model,” Ph.D. dissertation, Carnegie-Mellon University, Pittsburgh, Pa., July 1981.
[10] R. K. Iyer and D. J. Rossetti,“Effect of system workload on operating system reliability: A study on IBM 3081,” IEEE Trans. on Software Engineering, vol. 11, no. 12, pp. 1,438-1,448, Dec. 1985.
[11] M. C. Hsueh and R. K. Iyer,“A measurement-based model of software reliability in a production environment,” Proc. 11th Ann. Int’l Computer Software&Applications Conf.,Tokyo, Japan, Oct. 1987, pp. 354-360.
[12] M. Sullivan and R. Chillarege, "Software Defects and Their Impact on System Availability—A Study of Field Failures in Operating Systems," Proc. Int'l Symp. Fault-Tolerant Computing, pp. 2-9, 1991.
[13] J. Gray, "A Census of Tandem System Availability Between 1985 and 1990," IEEE Trans. Reliability, vol. 39, no. 4, pp. 409-418, Oct. 1990.
[14] I. Lee et al., “Measurement‐Based Evaluation of Operating‐System Fault Tolerance,” IEEE Trans. Reliability, Vol. 42, No. 2, June 1993, pp. 238‐249.
[15] A. Avizienis and J. P. J. Kelly,“Fault tolerance by design diversity: Concepts and experiments,” IEEE Computer, pp. 67-80, Aug. 1984.
[16] B. Randell,“System structure for software fault tolerance,” IEEE Trans. on Software Engineering, vol. 1, no. 1, pp. 220-232, June 1975.
[17] J. Arlat et al., "Fault Injection for Dependability Validation: A Methodology and Some Applications," IEEE Trans. Software Eng., Feb. 1990, pp. 166-182.
[18] P. Velardi and R. K. Iyer,“A study of software failures and recovery in the MVS operating system,” IEEE Trans. on Computers, vol. 33, no. 6, pp. 564-568, June 1984.
[19] J. Gray,“Why do computers stop and what can we do about it?” Tandem Computers Inc., Cupertino, Calif., Tandem Technical Report 85.7, June 1985.
[20] J. Bartlett,W. Bartlett,R. Carr,D. Garcia,J. Gray,R. Horst,R. Jardine,D. Lenoski,, and D. McGuire,“Fault tolerance in Tandem computer systems,” Tandem Computers Inc., Cupertino, Calif., Tandem Technical Report 90.5, May 1990.
[21] K.H. Kim, and H.O. Welch,“Distributed execution of recovery blocks: An approach for uniform treatment of hardware and software faults in real-time applications,” IEEE Trans. Computers, vol. 38, no. 5, pp. 626-636, May 1989.
[22] J. H. Lala and L. S. Alger,“Hardware and software fault tolerance: A unified architectural approach,” Proc. 18th Int’l Symp. on Fault-Tolerant Computing,Tokyo, Japan, June 1988, pp. 240-245.
[23] J.-C. Laprie, J. Arlat, C. Béounes, and K. Kanoun, “Definition and Analysis of Hardware-and-Software Fault-Tolerant Architectures,” Computer, vol. 23, no. 7, pp. 39-51, July 1990.
[24] I. Lee and R.K. Iyer, “Faults, Symptoms, and Software Fault Tolerance in Tandem GUARDIAN90 Operating System,” Proc. 23rd IEEE Int'l Symp. Fault-Tolerant Computing (FTCS23), pp. 20-29, Toulouse, France 1993.
[25] E. N. Adams,“Optimizing preventive service of software products,” IBM J. of Research and Development, vol. 28, no. 1, pp. 2-14, Jan. 1984.
[26] Y. Levendel, private communication.
[27] I. Lee and R. K. Iyer,“Identifying software problems using symptoms,” Proc. 24th Int’l Symp. on Fault-Tolerant Computing,Austin, Tex., June 1994, pp. 320-329.
[28] Y. Huang and C. Kintala,“Software implemented fault tolerance: Technologies and experience,” Proc. 23rd Int’l Symp. on Fault-Tolerant Computing,Toulouse, France, June 1993, pp. 2-9.
[29] Y.M. Wang, Y. Huang, and W.K. Fuchs, "Progressive Retry for Software Error Recovery in Distributed Systems," Proc. IEEE Fault Tolerant Computing Symp., pp. 138-144, June 1993.
[30] F. Cristian,“Exception handling and software fault tolerance,” IEEE Trans. on Computers, vol. 31, no. 6, pp. 531-540, June 1982.
[31] R. V. Hogg and E. A. Tanis,Probability and Statistical Inference, third edition. New York, N.Y.: Macmillan Publishing Co., Inc., 1988.
[32] H. Hecht and M. Hecht,“Software reliability in the system context,” IEEE Trans. on Software Engineering, vol. 12, no. 1, pp. 51-58, Jan. 1986.
[33] J.-C. Laprie,“Dependability evaluation of software systems in operation,” IEEE Trans. on Software Engineering, vol. 10, no. 6, pp. 701-714, Nov. 1984.
[34] T. Stalhane,“Assessing software reliability in a changing environment,” IFAC SAFECOM,London, U.K., 1990, pp. 83-88.
[35] M. D. Beaudry,“Performance-related reliability measures for computing systems,” IEEE Trans. on Computers, vol. 27, no. 6, pp. 540-547, June 1978.
[36] R. A. Sahner and K. S. Trivedi,“Reliability modeling using SHARPE,” IEEE Trans. on Reliability, vol. 36, no. 2, pp. 186-193, June 1987.
[37] J. B. Dugan and K. S. Trivedi,“Coverage modeling for dependability analysis of fault-tolerant systems,” IEEE Trans. on Computers, vol. 38, no. 6, pp. 775-787, June 1989.

Index Terms:
Measurement, fault categorization, software fault tolerance, recurrence, software reliability, operational phase, Tandem GUARDIAN System.
Inhwan Lee, Ravishankar K. Iyer, "Software Dependability in the Tandem GUARDIAN System," IEEE Transactions on Software Engineering, vol. 21, no. 5, pp. 455-467, May 1995, doi:10.1109/32.387474
Usage of this product signifies your acceptance of the Terms of Use.