This Article 
 Bibliographic References 
 Add to: 
Predicting the Location and Number of Faults in Large Software Systems
April 2005 (vol. 31 no. 4)
pp. 340-355
Advance knowledge of which files in the next release of a large software system are most likely to contain the largest numbers of faults can be a very valuable asset. To accomplish this, a negative binomial regression model has been developed and used to predict the expected number of faults in each file of the next release of a system. The predictions are based on the code of the file in the current release, and fault and modification history of the file from previous releases. The model has been applied to two large industrial systems, one with a history of 17 consecutive quarterly releases over 4 years, and the other with nine releases over 2 years. The predictions were quite accurate: For each release of the two systems, the 20 percent of the files with the highest predicted number of faults contained between 71 percent and 92 percent of the faults that were actually detected, with the overall average being 83 percent. The same model was also used to predict which files of the first system were likely to have the highest fault densities (faults per KLOC). In this case, the 20 percent of the files with the highest predicted fault densities contained an average of 62 percent of the system's detected faults. However, the identified files contained a much smaller percentage of the code mass than the files selected to maximize the numbers of faults. The model was also used to make predictions from a much smaller input set that only contained fault data from integration testing and later. The prediction was again very accurate, identifying files that contained from 71 percent to 93 percent of the faults, with the average being 84 percent. Finally, a highly simplified version of the predictor selected files containing, on average, 73 percent and 74 percent of the faults for the two systems.

[1] E.N. Adams, “Optimizing Preventive Service of Software Products,” IBM J. Research Development, vol. 28, no. 1, pp. 2-14, Jan. 1984.
[2] V.R. Basili and B.T. Perricone, “Software Errors and Complexity: An Empirical Investigation,” Comm. ACM, vol. 27, no. 1, pp. 42-52, Jan 1984.
[3] N.E. Fenton and N. Ohlsson, “Quantitative Analysis of Faults and Failures in a Complex Software System,” IEEE Trans. Software Eng., vol. 26, no. 8, pp. 797-814, Aug. 2000.
[4] T.L. Graves, A.F. Karr, J.S. Marron, and H. Siy, “Predicting Fault Incidence Using Software Change History,” IEEE Trans. Software Eng., vol. 26, no. 7, pp. 653-661, July 2000.
[5] L. Guo, Y. Ma, B. Cukic, and H. Singh, “Robust Prediction of Fault-Proneness by Random Forests,” Proc. Int'l Symp. Software Reliability Eng., Nov. 2004.
[6] L. Hatton, “Reexamining the Fault Density— Component Size Connection,” IEEE Software, pp. 89-97, Mar./Apr. 1997.
[7] T.M. Khoshgoftaar, E.B. Allen, K.S. Kalaichelvan, and N. Goel, “Early Quality Prediction: A Case Study in Telecommunications,” IEEE Software, pp. 65-71, Jan. 1996.
[8] T.J. McCabe, “A Complexity Measure,” IEEE Trans. Software Eng., vol. 2, pp. 308-320, 1976.
[9] P. McCullagh and J.A. Nelder, Generalized Linear Models, second ed. Chapman and Hall, 1989.
[10] K-H. Moller and D.J. Paulish, “An Empirical Investigation of Software Fault Distribution,” Proc. IEEE First Int'l Software Metrics Symp., pp. 82-90, May 1993.
[11] J.C. Munson and T.M. Khoshgoftaar, “The Detection of Fault-Prone Programs,” IEEE Trans. Software Eng., vol. 18, no. 5, pp. 423-433, May 1992.
[12] N. Ohlsson and H. Alberg, “Predicting Fault-Prone Software Modules in Telephone Switches,” IEEE Trans. Software Eng., vol. 22, no. 12, pp. 886-894, Dec. 1996.
[13] T. Ostrand and E.J. Weyuker, “The Distribution of Faults in a Large Industrial Software System,” Proc. ACM/Int'l Symp. Software Testing and Analysis (ISSTA 2002), pp. 55-64, July 2002.
[14] T. Ostrand, E.J. Weyuker, and R. Bell, “Using Static Analysis to Determine Where to Focus Dynamic Testing Effort,” Proc. IEE/Workshop Dynamic Analysis (WODA 04), May 2004.
[15] M. Pighin and A. Marzona, “An Empirical Analysis of Fault Persistence through Software Releases,” Proc. IEEE/ACM Symp. Empirical Software Eng., pp. 206-212, 2003.
[16] G. Rothermel, R. Untch, C. Chu, and M.J. Harrold, “Test Case Prioritization,” IEEE Trans. Software Eng., vol. 27, no. 10, pp. 929-948, Oct. 2001.
[17] SAS Institute Inc., “SAS/STAT User's Guide,” Version 8, SAS Inst., Cary, N.C., 1999.

Index Terms:
Index Terms- Software faults, fault-prone, prediction, regression model, empirical study, software testing.
Thomas J. Ostrand, Elaine J. Weyuker, Robert M. Bell, "Predicting the Location and Number of Faults in Large Software Systems," IEEE Transactions on Software Engineering, vol. 31, no. 4, pp. 340-355, April 2005, doi:10.1109/TSE.2005.49
Usage of this product signifies your acceptance of the Terms of Use.