This Article 
 Bibliographic References 
 Add to: 
Analyzing Data Sets with Missing Data: An Empirical Evaluation of Imputation Methods and Likelihood-Based Methods
November 2001 (vol. 27 no. 11)
pp. 999-1013

Abstract—Missing data are often encountered in data sets used to construct effort prediction models. Thus far, the common practice has been to ignore observations with missing data. This may result in biased prediction models. In this paper, we evaluate four missing data techniques (MDTs) in the context of software cost modeling: listwise deletion (LD), mean imputation (MI), similar response pattern imputation (SRPI), and full information maximum likelihood (FIML). We apply the MDTs to an ERP data set, and thereafter construct regression-based prediction models using the resulting data sets. The evaluation suggests that only FIML is appropriate when the data are not missing completely at random (MCAR). Unlike FIML, prediction models constructed on LD, MI and SRPI data sets will be biased unless the data are MCAR. Furthermore, compared to LD, MI and SRPI seem appropriate only if the resulting LD data set is too small to enable the construction of a meaningful regression-based prediction model.

[1] T.W. Anderson, “Maximum Likelihood Estimates for Multivariate Normal Distributions when Some Observations are Missing,” J. Am. Statistical Assoc., vol. 52, pp. 200-203, 1957.
[2] A.B. Anderson, A. Basilevsky, and D.P.J. Hum, “Missing Data: A Review of the Literature,” Handbook of Survey Research. P.H. Rossi, J.D. Wright and A.B. Anderson eds., New York: Academic Press, pp. 415-492, 1983.
[3] J. Anderson and D.W. Gerbing, “The Effects of Sampling Error on Convergence, Improper Solutions, and Goodness-of-Fit Indices for Maximum Likelihood Confirmatory Factor Analysis,” Psychometrika, vol. 49, pp. 155-173, 1984.
[4] L. Angelis, I. Stamelos, and M. Morisio, “Building a Software Cost Estimation Model Based on Categorical Data,” Proc. METRICS 2001, pp. 4-15, 2001.
[5] J.L. Arbucle, “Full Information Estimation in Presence of Incomplete Data,” Advanced Structural Equation Modeling, Issues and Techniques. G.A. Marcoulides and R.E. Schumacker, eds., 1996.
[6] J.L. Arbucle, Amos User's Guide. Chicago: SmallWaters, 1995.
[7] A. Boomsma, On Robustness of LISREL (Maximum Likelihood Estimation) Against Small Samples Sizes and Non-Normality. Amsterdam: Sociometric Research Foundation, 1982.
[8] C.H. Browne, “Asymptotic Comparison of Missing Data Procedures for Estimating Factor Loadings,” Psychometrika, vol. 48, no. 2, pp. 269-291, 1983.
[9] R.L. Brown, “Efficacy of the Indirect Approach for Estimating Structural Equation Models With Missing Data: A Comparison of Five Methods,” Structural Equation Modeling, vol. 1, no. 4, pp. 287-316, 1994.
[10] S.D. Conte, H. E. Dunsmore, and V. Y. Shen, Software Engineering Metrics and Models, Benjamin/Cummings, Menlo Park, Calif., 1986.
[11] T. DeMarco, Controlling Software Projects: Management, Measurement, and Estimates, New York: Prentice-Hall, 1982.
[12] A.P. Dempster and D.B. Rubin, Incomplete Data in Sample Surveys. W.G. Madow, I. Olkin and D.B. Rubin eds. vol. 2, pp. 3-10, New York: Academic Press, 1983.
[13] K. El Emam and A. Birk, "Validating the ISO/IEC 15504 Measure of Software Requirements Analysis Process Capability," IEEE Trans. Software Eng., vol. 26, no. 6, 2000, pp. 541-566.
[14] M.S. Gold and P.M. Bentler, “Treatment of Missing Data: A Monte Carlo Comparison of RBHDI, Iterative Stochastic Regression Imputation, and Expectation-Maximation,” Structural Equation Modeling, vol. 7, no. 3, pp. 319-355, 2000.
[15] D.N. Gujarati, Basic Econometrics, London: McGraw-Hill, 1995.
[16] R. Jeffery, M. Ruhe, and I. Wieczorek, “Using Public Domain Metrics to Estimate Software Development Effort,” Proc. Seventh IEEE Int'l Metrics Symp., 2001.
[17] K.G. Jøreskog and D. Sørbom, LISREL 8.50 Student Edition. Chicago: Scientific Software Int'l, 2000.
[18] K.G. Jøreskog and D. Sørbom, LISREL 8 User's Reference Guide. Chicago: Scientific Software Int'l Inc., 1993.
[19] K.G. Jøreskog and D. Sørbom, PRELIS 2 User's Reference Guide, Chicago: Scientific Software Int'l Inc., 1995.
[20] K.G. Jøreskog, Personal Email Correspondence, Apr. 2001.
[21] G.K. Kanji, 100 Statistical Tests. London: SAGE Publications, 1993.
[22] R. Little and D. Rubin, Statistical Analysis With Missing Data. Wiley, 1987.
[23] R.J.A. Little and D.B. Rubin, “The Analysis of Social Science Data with Missing Values,” Sociological Methods and Research, vol. 18, nos. 2/3, pp. 292-326, Nov. 1989 Feb. 1990.
[24] K.V. Mardia, “Measures of Multivariate Skewness and Kurtosis with Applications,” Biometrika, vol. 57, no. 3, p. 519, 1970.
[25] Minitab Statistical Software, Release 13, State College, Penn.: Minitab, Inc., 2000.
[26] B. Muthen, D. Kaplan, and M. Hollis, “On Structural Equation Modeling with Data that are not Missing Completely at Random,” Psychometrika, vol. 52, pp. 431-462, 1987.
[27] I. Myrtveit and E. Stensrud, "A Controlled Experiment to Assess the Benefits of Estimating with Analogy and Regression Models," IEEE Trans. Software Engineering, vol. 25, no. 4, Jul./Aug. 1999, pp. 510-525.
[28] I. Myrtveit, E. Stensrud, and U. Olsson, “Assessing the Benefits of Imputing ERP Projects with Missing Data,” Proc. METRICS 2001, pp. 78-84, 2001.
[29] M.C. Neal, Mx: Statistical Modeling, second ed., 1994.
[30] M.J. Rovine and M. Delaney, “Missing Data Estimation in Developmental Research,” Statistical Methods in Longitudinal Research: Principles and Structuring Change, A. Von Eye ed., vol. 1, pp. 35-79, New York: Academic, 1990.
[31] D.B. Rubin, Multiple Imputation for Nonresponse in Surveys, New York: Wiley, 1987.
[32] J.L. Schafer, Analysis of Incomplete Multivariate Data, Boca Raton: Chapman and Hall, 1997.
[33] S.S. Shapiro and R.S. Francia, “An Approximate Analysis of Variance Test for Normality,” J. Am. Statistical Assoc., vol. 67, p. 215, 1972.
[34] M.J. Shepperd and C. Schofield, “Estimating Software Project Effort Using Analogies,” IEEE Trans. Software Eng., vol. 23, pp. 736-743, 1997.
[35] E. Stensrud and I. Myrtveit, “Human Performance Estimating with Analogy and Regression Models: An Empirical Validation,” Proc. METRICS'98, pp. 205-213, 1998.
[36] E. Stensrud, “Estimating with Enhanced Object Points versus Function Points,” Proc. Cost Constructive Model, (COCOMO '13), Oct. 1998.
[37] K. Strike, K.E. Emam, and N. Madhavji, “Software Cost Estimation with Incomplete Data,” ERB-1071 NRC, 1071.pdf, also to appear in, IEEE Trans. Software Eng.
[38] H. White, “Estimation, Inference, and Specification Analysis,” Econometric Society Monographs, no. 22, Cambridge Univ. Press, 1994.

Index Terms:
Software effort prediction, cost estimation, missing data, imputation methods, listwise deletion, mean imputation, similar response pattern imputation, full information maximum likelihood, log-log regression, ERP.
Ingunn Myrtveit, Erik Stensrud, Ulf H. Olsson, "Analyzing Data Sets with Missing Data: An Empirical Evaluation of Imputation Methods and Likelihood-Based Methods," IEEE Transactions on Software Engineering, vol. 27, no. 11, pp. 999-1013, Nov. 2001, doi:10.1109/32.965340
Usage of this product signifies your acceptance of the Terms of Use.