This Article 
 Bibliographic References 
 Add to: 
Evaluating Testing Methods by Delivered Reliability
August 1998 (vol. 24 no. 8)
pp. 586-601

Abstract—There are two main goals in testing software: 1) to achieve adequate quality (debug testing); the objective is to probe the software for defects so that these can be removed and 2) to assess existing quality (operational testing); the objective is to gain confidence that the software is reliable. The names are arbitrary, and most testing techniques address both goals to some degree. However, debug methods tend to ignore random selection of test data from an operational profile, while for operational methods this selection is all-important. Debug methods are thought, without any real proof, to be good at uncovering defects so that these can be repaired, but having done so they do not provide a technically defensible assessment of the reliability that results. On the other hand, operational methods provide accurate assessment, but may not be as useful for achieving reliability. This paper examines the relationship between the two testing goals, using a probabilistic analysis. We define simple models of programs and their testing, and try to answer theoretically the question of how to attain program reliability: Is it better to test by probing for defects as in debug testing, or to assess reliability directly as in operational testing, uncovering defects by accident, so to speak? There is no simple answer, of course. Testing methods are compared in a model where program failures are detected and the software changed to eliminate them. The "better" method delivers higher reliability after all test failures have been eliminated. This comparison extends previous work, where the measure was the probability of detecting a failure. Revealing special cases are exhibited in which each kind of testing is superior. Preliminary analysis of the distribution of the delivered reliability indicates that even simple models have unusual statistical properties, suggesting caution in interpreting theoretical comparisons.

[1] V. Basili and S. Green, "Software Process Evolution at the SEL," IEEE Software, pp. 58-66, 1994.
[2] B. Beizer, "The Cleanroom Process Model: A Critical Examination," Proc. 13th Ann. Pacific Northwest Software Quality Conf., pp. 148-173,Portland, Ore., 1995.
[3] R.W. Butler and G.B. Finelli,“The infeasibility of quantifying the reliability of life-critical real-timesoftware,” IEEE Trans. Software Engineering, vol. 19, no. 1, pp. 3-12, 1993.
[4] T. Chen and Y. Yu, “On the Relationship Between Partition and Random Testing,” IEEE Trans. Software Eng., vol. 20, no. 12, pp. 977-980, Dec. 1994.
[5] T. Chen and Y. Yu, “On the Expected Number of Failures Detected by Subdomain Testing and Random Testing,” IEEE Trans. Software Eng., vol. 22, no. 2, pp. 109-119, Feb. 1996.
[6] R. Cobb and H. Mills, “Engineering Software Under Statistical Quality Control,” IEEE Software, pp. 44-54, Nov. 1990.
[7] S.R. Dalal, J.R. Horgan, and J.R. Kettenring, "Reliable Software and Communication: Software Quality, Reliability, and Safety," Proc. 15th ICSE, pp. 425-435,Baltimore, Md., 1993.
[8] J. Duran and S. Ntafos, "An Evaluation of Random Testing," IEEE Trans. Software Eng., vol. 10, pp. 438-444, 1984.
[9] P.G. Frankl and S.N Weiss,“An experimental comparison of the effectiveness of branch testing and data flow testing,” IEEE Transactions on Software Engineering, vol. 19, no. 8, Oct. 1993, pp. 774-787.
[10] P.G. Frankl and E.J. Weyuker,“Provable improvements on branch testing,” IEEE Transactions on Software Engineering, vol. 19, no. 10, Oct. 1993, pp. 962-975.
[11] S. Gerhart, D. Craigen, and T. Ralston, "Observation on Industrial Practice Using Formal Methods," Proc. 15th Int'l Conf. Software Eng., IEEE CS Press, Los Alamitos, Calif., 1993, pp. 24-33.
[12] D. Hamlet and R. Taylor, "Partition Testing Does Not Inspire Confidence," IEEE Trans. Software Eng., vol. 16, pp. 1,402-1,412, Dec. 1990.
[13] J.R. Horgan, S. London, and M. Lyu, “Achieving Software Quality with Testing Coverage Measures,” Computer, Sept. 1994, pp. 60-69.
[14] M. Hutchins, H. Foster, T. Goradia, and T. Ostrand, “Experiments on the Effectiveness of Dataflow- and Controlflow-Based Test Adequacy Criteria,” Proc. Int'l Conf. Software Eng., pp. 191–200, May 1994.
[15] Z. Jelinski and P.B. Moranda, "Software Reliability Research," Statistical Computer Performance Evaluation, pp. 465-484.New York: Academic Press, 1972.
[16] E.J. Weyuker and B. Jeng,“Analyzing partition testing strategies,” IEEE Trans. Software Engineering, vol. 17, pp. 703-711, 1991.
[17] L. Lauterbach and W. Randall, "Experimental Evaluation of Six Test Techniques.," COMPASS '89, pp. 36-41,Gaithersburg, Md., 1989.
[18] N. Li and Y.K. Malaiya, "On Input Profile Selection for Software Testing," Proc. Fifth Int'l Symp. Software Reliability Eng., pp. 196-205, 1994.
[19] B. Littlewood, "Stochastic Reliability Growth: A Model with Applications to Computer Software and Hardware Design," IEEE Trans. Reliability, vol. 30, no. 4, pp. 313-320, Oct. 1981.
[20] B. Littlewood and L. Strigini,“Validation of ultra-high dependability for software-based systems,” Comm. ACM, vol. 36, no. 11, pp. 69-80, Nov. 1993.
[21] J.D. Musa, "Operational Profiles in Software Reliability Engineering," IEEE Software, vol. 10, no. 2, pp. 14-32, 1993.
[22] P. Piwowarski, M. Ohba, and J. Caruso, "Coverage Measurement Experience During Function Test," Proc. 15th Int'l Conf. Software Eng., pp. 287-301,Baltimore, Md., 1993.
[23] M. Pizza and L. Strigini, "The Effect of Program Variability on Debugging Efficiency—Preliminary Results," technical report, Centre for Software Reliability, 1998.
[24] R. W. Selby, V. R. Basili, and F. T. Baker,“Cleanroom software development: An empirical investigation,”IEEE Trans. Software Eng., vol. SE-13, pp. 1027–1037, Sept. 1987.
[25] M. Tsoukalas, M.J. Duran, and S. Ntafos, “On Some Reliability Estimation Problems in Random and Partition Testing,” IEEE Trans. Software Eng., vol. 19, no. 7, pp. 687-697, July 1993.
[26] W.E. Wong, J.R. Horgan, S. London, and A.P. Mathur, "Effect of Test Set Size and Block Coverage on The Fault Detection Effectiveness," Technical Report SERC-TR-153-P, SERC, 1994.

Index Terms:
Reliability, debugging, software testing, statistical testing theory.
Phyllis G. Frankl, Richard G. Hamlet, Bev Littlewood, Lorenzo Strigini, "Evaluating Testing Methods by Delivered Reliability," IEEE Transactions on Software Engineering, vol. 24, no. 8, pp. 586-601, Aug. 1998, doi:10.1109/32.707695
Usage of this product signifies your acceptance of the Terms of Use.