Issue No. 10 - October (2001 vol. 27)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/32.962560
<p><b>Abstract</b>—The construction of software cost estimation models remains an active topic of research. The basic premise of cost modeling is that a historical database of software project cost data can be used to develop a quantitative model to predict the cost of future projects. One of the difficulties faced by workers in this area is that many of these historical databases contain substantial amounts of missing data. Thus far, the common practice has been to ignore observations with missing data. In principle, such a practice can lead to gross biases and may be detrimental to the accuracy of cost estimation models. In this paper, we describe an extensive simulation where we evaluate different techniques for dealing with missing data in the context of software cost modeling. Three techniques are evaluated: listwise deletion, mean imputation, and eight different types of hot-deck imputation. Our results indicate that all the missing data techniques perform well with small biases and high precision. This suggests that the simplest technique, listwise deletion, is a reasonable choice. However, this will not necessarily provide the best performance. Consistent best performance (minimal bias and highest precision) can be obtained by using hot-deck imputation with Euclidean distance and a z-score standardization.</p>
Software cost estimation, missing data, imputation, data quality, cost modeling.
K. Strike, K. El Emam and N. Madhavji, "Software Cost Estimation with Incomplete Data," in IEEE Transactions on Software Engineering, vol. 27, no. , pp. 890-908, 2001.