This Article 
 Bibliographic References 
 Add to: 
Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings
July/August 2008 (vol. 34 no. 4)
pp. 485-496
Stefan Lessmann, University of Hamburg, Hamburg
Bart Baesens, K.U.Leuven, Leuven
Christophe Mues, University of Southampton, Southampton
Swantje Pietsch, University of Hamburg, Hamburg
Software defect prediction strives to improve software quality and testing efficiency by constructing predictive classification models from code attributes to enable a timely identification of fault-prone modules. Several classification models have been evaluated for this task. However, due to inconsistent findings regarding the superiority of one classifier over another and the usefulness of metric-based classification in general, more research is needed to improve convergence across studies and further advance confidence in experimental results. We consider three potential sources for bias: comparing classifiers over one or a small number of proprietary datasets, relying on accuracy indicators that are conceptually inappropriate for software defect prediction and cross-study comparisons, and finally, limited use of statisti-cal testing procedures to secure empirical findings. To remedy these problems, a framework for comparative software defect prediction experiments is proposed and applied in a large-scale empirical comparison of 22 classifiers over ten public domain datasets from the NASA Metrics Data repository. Our results indicate that the importance of the particu-lar classification algorithm may have been overestimated in previous research since no significant performance differ-ences could be detected among the top-17 classifiers.

[1] C. Andersson, “A Replicated Empirical Study of a Selection Method for Software Reliability Growth Models,” Empirical Software Eng., vol. 12, no. 2, pp. 161-182, 2007.
[2] C. Andersson and P. Runeson, “A Replicated Quantitative Analysis of Fault Distributions in Complex Software Systems,” IEEE Trans. Software Eng., vol. 33, no. 5, pp. 273-286, May 2007.
[3] B. Baesens, T. Van Gestel, S. Viaene, M. Stepanova, J. Suykens, and J. Vanthienen, “Benchmarking State-of-the-Art Classification Algorithms for Credit Scoring,” J. Operational Research Soc., vol. 54, no. 6, pp. 627-635, 2003.
[4] V.R. Basili, L.C. Briand, and W.L. Melo, “A Validation of Object-Oriented Design Metrics as Quality Indicators,” IEEE Trans. Software Eng., vol. 22, no. 10, pp. 751-761, Oct. 1996.
[5] C.M. Bishop, Neural Networks for Pattern Recognition. Oxford Univ. Press, 1995.
[6] A.P. Bradley, “The Use of the Area under the ROC Curve in the Evaluation of Machine Learning Algorithms,” Pattern Recognition, vol. 30, no. 7, pp. 1145-1159, 1997.
[7] L. Breiman, “Random Forests,” Machine Learning, vol. 45, no. 1, pp. 5-32, 2001.
[8] L.C. Briand, V.R. Basili, and C.J. Hetmanski, “Developing Interpretable Models with Optimized Set Reduction for Identifying High-Risk Software Components,” IEEE Trans. Software Eng., vol. 19, no. 11, pp. 1028-1044, Nov. 1993.
[9] L.C. Briand, W.L. Melo, and J. Wüst, “Assessing the Applicability of Fault-Proneness Models Across Object-Oriented Software Projects,” IEEE Trans. Software Eng., vol. 28, no. 7, pp. 706-720, July 2002.
[10] M. Chapman, P. Callis, and W. Jackson, “Metrics Data Program,” NASA IV and V Facility, http:/, 2004.
[11] J.G. Cleary and L.E. Trigg, “${\rm K}^{\ast}$ : An Instance-Based Learner Using an Entropic Distance Measure,” Proc. 12th Int'l Conf. Machine Learning, 1995.
[12] J. Demšar, “Statistical Comparisons of Classifiers over Multiple Data Sets,” J. Machine Learning Research, vol. 7, pp. 1-30, 2006.
[13] J. Dougherty, R. Kohavi, and M. Sahami, “Supervised and Unsupervised Discretization of Continuous Features,” Proc. 12th Int'l Conf. Machine Learning, 1995.
[14] R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification, second ed. Wiley, 2001.
[15] K. El-Emam, S. Benlarbi, N. Goel, and S.N. Rai, “Comparing Case-Based Reasoning Classifiers for Predicting High-Risk Software Components,” J. Systems and Software, vol. 55, no. 3, pp. 301-320, 2001.
[16] K. El-Emam, W. Melo, and J.C. Machado, “The Prediction of Faulty Classes Using Object-Oriented Design Metrics,” J. Systems and Software, vol. 56, no. 1, pp. 63-75, 2001.
[17] T. Fawcett, “An Introduction to ROC Analysis,” Pattern Recognition Letters, vol. 27, no. 8, pp. 861-874, 2006.
[18] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, “From Data Mining to Knowledge Discovery in Databases: An Overview,” AI Magazine, vol. 17, no. 3, pp. 37-54, 1996.
[19] N. Fenton and M. Neil, “A Critique of Software Defect Prediction Models,” IEEE Trans. Software Eng., vol. 25, no. 5, pp. 675-689, Sept./Oct. 1999.
[20] N.E. Fenton and N. Ohlsson, “Quantitative Analysis of Faults and Failures in a Complex Software System,” IEEE Trans. Software Eng., vol. 26, no. 8, pp. 797-814, Aug. 2000.
[21] Y. Freund and L. Mason, “The Alternating Decision Tree Learning Algorithm,” Proc. 16th Int'l Conf. Machine Learning, 1999.
[22] Y. Freund and R.E. Schapire, “Large Margin Classification Using the Perceptron Algorithm,” Machine Learning, vol. 37, no. 3, pp.277-296, 1999.
[23] K. Ganesan, T.M. Khoshgoftaar, and E.B. Allen, “Case-Based Software Quality Prediction,” Int'l J. Software Eng. and Knowledge Eng., vol. 10, no. 2, pp. 139-152, 2000.
[24] L. Guo, Y. Ma, B. Cukic, and H. Singh, “Robust Prediction of Fault-Proneness by Random Forests,” Proc. 15th Int'l Symp. Software Reliability Eng., 2004.
[25] M.A. Hall and G. Holmes, “Benchmarking Attribute Selection Techniques for Discrete Class Data Mining,” IEEE Trans. Knowledge and Data Eng., vol. 15, no. 6, pp. 1437-1447, Nov./Dec. 2003.
[26] M.H. Halstead, Elements of Software Science. Elsevier, 1977.
[27] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2002.
[28] T.M. Khoshgoftaar and E.B. Allen, “Logistic Regression Modeling of Software Quality,” Int'l J. Reliability, Quality and Safety Eng., vol. 6, no. 4, pp. 303-317, 1999.
[29] T.M. Khoshgoftaar, E.B. Allen, J.P. Hudepohl, and S.J. Aud, “Application of Neural Networks to Software Quality Modeling of a Very Large Telecommunications System,” IEEE Trans. Neural Networks, vol. 8, no. 4, pp. 902-909, 1997.
[30] T.M. Khoshgoftaar, E.B. Allen, W.D. Jones, and J.P. Hudepohl, “Classification-Tree Models of Software-Quality over Multiple Releases,” IEEE Trans. Reliability, vol. 49, no. 1, pp. 4-11, 2000.
[31] T.M. Khoshgoftaar, A.S. Pandya, and D.L. Lanning, “Application of Neural Networks for Predicting Faults,” Annals of Software Eng., vol. 1, no. 1, pp. 141-154, 1995.
[32] T.M. Khoshgoftaar and N. Seliya, “Analogy-Based Practical Classification Rules for Software Quality Estimation,” Empirical Software Eng., vol. 8, no. 4, pp. 325-350, 2003.
[33] T.M. Khoshgoftaar and N. Seliya, “Comparative Assessment of Software Quality Classification Techniques: An Empirical Case Study,” Empirical Software Eng., vol. 9, no. 3, pp. 229-257, 2004.
[34] T.M. Khoshgoftaar, N. Seliya, and N. Sundaresh, “An Empirical Study of Predicting Software Faults with Case-Based Reasoning,” Software Quality J., vol. 14, no. 2, pp. 85-111, 2006.
[35] A.G. Koru and H. Liz, “An Investigation of the Effect of Module Size on Defect Prediction Using Static Measures,” Proc. Workshop Predictor Models in Software Eng., 2005.
[36] N. Landwehr, M. Hall, and F. Eibe, “Logistic Model Trees,” Machine Learning, vol. 59, no. 1, pp. 161-205, 2005.
[37] F. Lanubile and G. Visaggio, “Evaluating Predictive Quality Models Derived from Software Measures: Lessons Learned,” J.Systems and Software, vol. 38, no. 3, pp. 225-234, 1997.
[38] J. Li, G. Ruhe, A. Al-Emran, and M. Richter, “A Flexible Method for Software Effort Estimation by Analogy,” Empirical Software Eng., vol. 12, no. 1, pp. 65-106, 2007.
[39] D.J.C. MacKay, “The Evidence Framework Applied to Classification Networks,” Neural Computation, vol. 4, no. 5, pp. 720-736, 1992.
[40] O.L. Mangasarian and D.R. Musicant, “Lagrangian Support Vector Machines,” J. Machine Learning Research, vol. 1, pp. 161-177, 2001.
[41] T.J. McCabe, “A Complexity Measure,” IEEE Trans. Software Eng., vol. 2, no. 4, pp. 308-320, 1976.
[42] T. Menzies, A. Dekhtyar, J. Distefano, and J. Greenwald, “Problems with Precision: A Response to Comments on 'Data Mining Static Code Attributes to Learn Defect Predictors',” IEEE Trans. Software Eng., vol. 33, no. 9, pp. 637-640, Sept. 2007.
[43] T. Menzies, J. DiStefano, A. Orrego, and R. Chapman, “Assessing Predictors of Software Defects,” Proc. Workshop Predictive Software Models, 2004.
[44] T. Menzies, J. Greenwald, and A. Frank, “Data Mining Static Code Attributes to Learn Defect Predictors,” IEEE Trans. Software Eng., vol. 33, no. 1, pp. 2-13, Jan. 2007.
[45] I. Mierswa, M. Wurst, R. Klinkenberg, M. Scholz, and T. Euler, “YALE: Rapid Prototyping for Complex Data Mining Tasks,” Proc. 12th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, 2006.
[46] J. Mingers, “An Empirical Comparison of Pruning Methods for Decision Tree Induction,” Machine Learning, vol. 4, no. 2, pp. 227-243, 1989.
[47] J.C. Munson and T.M. Khoshgoftaar, “The Detection of Fault-Prone Programs,” IEEE Trans. Software Eng., vol. 18, no. 5, pp. 423-433, May 1992.
[48] I. Myrtveit and E. Stensrud, “A Controlled Experiment to Assess the Benefits of Estimating with Analogy and Regression Models,” IEEE Trans. Software Eng., vol. 25, no. 4, pp. 510-525, July/Aug. 1999.
[49] I. Myrtveit, E. Stensrud, and M. Shepperd, “Reliability and Validity in Comparative Studies of Software Prediction Models,” IEEE Trans. Software Eng., vol. 31, no. 5, pp. 380-391, May 2005.
[50] M.C. Ohlsson and P. Runeson, “Experience from Replicating Empirical Studies on Prediction Models,” Proc. Eighth Int'l Software Metrics Symp., 2002.
[51] N. Ohlsson and H. Alberg, “Predicting Fault-Prone Software Modules in Telephone Switches,” IEEE Trans. Software Eng., vol. 22, no. 12, pp. 886-894, Dec. 1996.
[52] N. Ohlsson, A.C. Eriksson, and M. Helander, “Early Risk-Management by Identification of Fault Prone Modules,” Empirical Software Eng., vol. 2, no. 2, pp. 166-173, 1997.
[53] A.A. Porter and R.W. Selby, “Evaluating Techniques for Generating Metric-Based Classification Trees,” J. Systems and Software, vol. 12, no. 3, pp. 209-218, 1990.
[54] F. Provost and T. Fawcett, “Robust Classification for Imprecise Environments,” Machine Learning, vol. 42, no. 3, pp. 203-231, 2001.
[55] J.B. Robert, “A Priori Tests in Repeated Measures Designs: Effects of Nonsphericity,” Psychometrika, vol. 46, no. 3, pp. 241-255, 1981.
[56] J. Sayyad Shirabad and T.J. Menzies, “The PROMISE Repository of Software Engineering Databases,” School of Information Technology and Eng., Univ. of Ottawa, uottawa.caSERepository , 2005.
[57] N.F. Schneidewind, “Methodology for Validating Software Metrics,” IEEE Trans. Software Eng., vol. 18, no. 5, pp. 410-422, May 1992.
[58] R.W. Selby and A.A. Porter, “Learning from Examples: Generation and Evaluation of Decision Trees for Software Resource Analysis,” IEEE Trans. Software Eng., vol. 14, no. 12, pp. 1743-1756, Dec. 1988.
[59] M. Shepperd and G. Kadoda, “Comparing Software Prediction Techniques Using Simulation,” IEEE Trans. Software Eng., vol. 27, no. 11, pp. 1014-1022, Nov. 2001.
[60] M. Shepperd and C. Schofield, “Estimating Software Project Effort Using Analogies,” IEEE Trans. Software Eng., vol. 23, no. 11, pp.736-743, Nov. 1997.
[61] J.A.K. Suykens and J. Vandewalle, “Least Squares Support Vector Machine Classifiers,” Neural Processing Letters, vol. 9, no. 3, pp.293-300, 1999.
[62] M.E. Tipping, “The Relevance Vector Machine,” Advances in Neural Information Processing Systems 12, S.A. Solla, T.K. Leen, and K.-R. Müller, eds., pp. 652-658, MIT Press, 2000.
[63] T. Van Gestel, J.A.K. Suykens, B. Baesens, S. Viaene, J. Vanthienen, G. Dedene, B. De Moor, and J. Vandewalle, “Benchmarking Least Squares Support Vector Machine Classifiers,” Machine Learning, vol. 54, no. 1, pp. 5-32, 2004.
[64] O. Vandecruys, D. Martens, B. Baesens, C. Mues, M.D. Backer, and R. Haesen, “Mining Software Repositories for Comprehensible Software Fault Prediction Models,” J. Systems and Software, vol. 81, no. 5, pp. 823-839, 2008.
[65] J.H. Zar, Biostatistical Analysis, fourth ed. Prentice Hall, 1999.
[66] H. Zhang and X. Zhang, “Comments on 'Data Mining Static Code Attributes to Learn Defect Predictors',” IEEE Trans. Software Eng., vol. 33, no. 9, pp. 635-637, Sept. 2007.
[67] S. Zhong, T.M. Khoshgoftaar, and N. Seliya, “Analyzing Software Measurement Data with Clustering Techniques,” IEEE Intelligent Systems, vol. 19, no. 2, pp. 20-27, Mar./Apr. 2004.

Index Terms:
Complexity measures, Data mining, Formal methods, Statistical methods
Stefan Lessmann, Bart Baesens, Christophe Mues, Swantje Pietsch, "Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings," IEEE Transactions on Software Engineering, vol. 34, no. 4, pp. 485-496, July-Aug. 2008, doi:10.1109/TSE.2008.35
Usage of this product signifies your acceptance of the Terms of Use.