Analyzing Data Sets with Missing Data: An Empirical Evaluation of Imputation Methods and Likelihood-Based Methods
Issue No.11 - November (2001 vol.27)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/32.965340
<p><b>Abstract</b>—Missing data are often encountered in data sets used to construct effort prediction models. Thus far, the common practice has been to ignore observations with missing data. This may result in biased prediction models. In this paper, we evaluate four missing data techniques (MDTs) in the context of software cost modeling: listwise deletion (LD), mean imputation (MI), similar response pattern imputation (SRPI), and full information maximum likelihood (FIML). We apply the MDTs to an ERP data set, and thereafter construct regression-based prediction models using the resulting data sets. The evaluation suggests that only FIML is appropriate when the data are not missing completely at random (MCAR). Unlike FIML, prediction models constructed on LD, MI and SRPI data sets will be biased unless the data are MCAR. Furthermore, compared to LD, MI and SRPI seem appropriate only if the resulting LD data set is too small to enable the construction of a meaningful regression-based prediction model.</p>
Software effort prediction, cost estimation, missing data, imputation methods, listwise deletion, mean imputation, similar response pattern imputation, full information maximum likelihood, log-log regression, ERP.
Ingunn Myrtveit, Erik Stensrud, Ulf H. Olsson, "Analyzing Data Sets with Missing Data: An Empirical Evaluation of Imputation Methods and Likelihood-Based Methods", IEEE Transactions on Software Engineering, vol.27, no. 11, pp. 999-1013, November 2001, doi:10.1109/32.965340