
This Article  
 
Share  
Bibliographic References  
Add to:  
Digg Furl Spurl Blink Simpy Del.icio.us Y!MyWeb  
Search  
 
ASCII Text  x  
Ingunn Myrtveit, Erik Stensrud, Ulf H. Olsson, "Analyzing Data Sets with Missing Data: An Empirical Evaluation of Imputation Methods and LikelihoodBased Methods," IEEE Transactions on Software Engineering, vol. 27, no. 11, pp. 9991013, November, 2001.  
BibTex  x  
@article{ 10.1109/32.965340, author = {Ingunn Myrtveit and Erik Stensrud and Ulf H. Olsson}, title = {Analyzing Data Sets with Missing Data: An Empirical Evaluation of Imputation Methods and LikelihoodBased Methods}, journal ={IEEE Transactions on Software Engineering}, volume = {27}, number = {11}, issn = {00985589}, year = {2001}, pages = {9991013}, doi = {http://doi.ieeecomputersociety.org/10.1109/32.965340}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, }  
RefWorks Procite/RefMan/Endnote  x  
TY  JOUR JO  IEEE Transactions on Software Engineering TI  Analyzing Data Sets with Missing Data: An Empirical Evaluation of Imputation Methods and LikelihoodBased Methods IS  11 SN  00985589 SP999 EP1013 EPD  9991013 A1  Ingunn Myrtveit, A1  Erik Stensrud, A1  Ulf H. Olsson, PY  2001 KW  Software effort prediction KW  cost estimation KW  missing data KW  imputation methods KW  listwise deletion KW  mean imputation KW  similar response pattern imputation KW  full information maximum likelihood KW  loglog regression KW  ERP. VL  27 JA  IEEE Transactions on Software Engineering ER   
Abstract—Missing data are often encountered in data sets used to construct effort prediction models. Thus far, the common practice has been to ignore observations with missing data. This may result in biased prediction models. In this paper, we evaluate four missing data techniques (MDTs) in the context of software cost modeling: listwise deletion (LD), mean imputation (MI), similar response pattern imputation (SRPI), and full information maximum likelihood (FIML). We apply the MDTs to an ERP data set, and thereafter construct regressionbased prediction models using the resulting data sets. The evaluation suggests that only FIML is appropriate when the data are not missing completely at random (MCAR). Unlike FIML, prediction models constructed on LD, MI and SRPI data sets will be biased unless the data are MCAR. Furthermore, compared to LD, MI and SRPI seem appropriate only if the resulting LD data set is too small to enable the construction of a meaningful regressionbased prediction model.
[1] T.W. Anderson, “Maximum Likelihood Estimates for Multivariate Normal Distributions when Some Observations are Missing,” J. Am. Statistical Assoc., vol. 52, pp. 200203, 1957.
[2] A.B. Anderson, A. Basilevsky, and D.P.J. Hum, “Missing Data: A Review of the Literature,” Handbook of Survey Research. P.H. Rossi, J.D. Wright and A.B. Anderson eds., New York: Academic Press, pp. 415492, 1983.
[3] J. Anderson and D.W. Gerbing, “The Effects of Sampling Error on Convergence, Improper Solutions, and GoodnessofFit Indices for Maximum Likelihood Confirmatory Factor Analysis,” Psychometrika, vol. 49, pp. 155173, 1984.
[4] L. Angelis, I. Stamelos, and M. Morisio, “Building a Software Cost Estimation Model Based on Categorical Data,” Proc. METRICS 2001, pp. 415, 2001.
[5] J.L. Arbucle, “Full Information Estimation in Presence of Incomplete Data,” Advanced Structural Equation Modeling, Issues and Techniques. G.A. Marcoulides and R.E. Schumacker, eds., 1996.
[6] J.L. Arbucle, Amos User's Guide. Chicago: SmallWaters, 1995.
[7] A. Boomsma, On Robustness of LISREL (Maximum Likelihood Estimation) Against Small Samples Sizes and NonNormality. Amsterdam: Sociometric Research Foundation, 1982.
[8] C.H. Browne, “Asymptotic Comparison of Missing Data Procedures for Estimating Factor Loadings,” Psychometrika, vol. 48, no. 2, pp. 269291, 1983.
[9] R.L. Brown, “Efficacy of the Indirect Approach for Estimating Structural Equation Models With Missing Data: A Comparison of Five Methods,” Structural Equation Modeling, vol. 1, no. 4, pp. 287316, 1994.
[10] S.D. Conte, H. E. Dunsmore, and V. Y. Shen, Software Engineering Metrics and Models, Benjamin/Cummings, Menlo Park, Calif., 1986.
[11] T. DeMarco, Controlling Software Projects: Management, Measurement, and Estimates, New York: PrenticeHall, 1982.
[12] A.P. Dempster and D.B. Rubin, Incomplete Data in Sample Surveys. W.G. Madow, I. Olkin and D.B. Rubin eds. vol. 2, pp. 310, New York: Academic Press, 1983.
[13] K. El Emam and A. Birk, "Validating the ISO/IEC 15504 Measure of Software Requirements Analysis Process Capability," IEEE Trans. Software Eng., vol. 26, no. 6, 2000, pp. 541566.
[14] M.S. Gold and P.M. Bentler, “Treatment of Missing Data: A Monte Carlo Comparison of RBHDI, Iterative Stochastic Regression Imputation, and ExpectationMaximation,” Structural Equation Modeling, vol. 7, no. 3, pp. 319355, 2000.
[15] D.N. Gujarati, Basic Econometrics, London: McGrawHill, 1995.
[16] R. Jeffery, M. Ruhe, and I. Wieczorek, “Using Public Domain Metrics to Estimate Software Development Effort,” Proc. Seventh IEEE Int'l Metrics Symp., 2001.
[17] K.G. Jøreskog and D. Sørbom, LISREL 8.50 Student Edition. Chicago: Scientific Software Int'l, 2000.
[18] K.G. Jøreskog and D. Sørbom, LISREL 8 User's Reference Guide. Chicago: Scientific Software Int'l Inc., 1993.
[19] K.G. Jøreskog and D. Sørbom, PRELIS 2 User's Reference Guide, Chicago: Scientific Software Int'l Inc., 1995.
[20] K.G. Jøreskog, Personal Email Correspondence, Apr. 2001.
[21] G.K. Kanji, 100 Statistical Tests. London: SAGE Publications, 1993.
[22] R. Little and D. Rubin, Statistical Analysis With Missing Data. Wiley, 1987.
[23] R.J.A. Little and D.B. Rubin, “The Analysis of Social Science Data with Missing Values,” Sociological Methods and Research, vol. 18, nos. 2/3, pp. 292326, Nov. 1989 Feb. 1990.
[24] K.V. Mardia, “Measures of Multivariate Skewness and Kurtosis with Applications,” Biometrika, vol. 57, no. 3, p. 519, 1970.
[25] Minitab Statistical Software, Release 13, State College, Penn.: Minitab, Inc.,www.minitab.com. 2000.
[26] B. Muthen, D. Kaplan, and M. Hollis, “On Structural Equation Modeling with Data that are not Missing Completely at Random,” Psychometrika, vol. 52, pp. 431462, 1987.
[27] I. Myrtveit and E. Stensrud, "A Controlled Experiment to Assess the Benefits of Estimating with Analogy and Regression Models," IEEE Trans. Software Engineering, vol. 25, no. 4, Jul./Aug. 1999, pp. 510525.
[28] I. Myrtveit, E. Stensrud, and U. Olsson, “Assessing the Benefits of Imputing ERP Projects with Missing Data,” Proc. METRICS 2001, pp. 7884, 2001.
[29] M.C. Neal, Mx: Statistical Modeling, second ed., 1994.
[30] M.J. Rovine and M. Delaney, “Missing Data Estimation in Developmental Research,” Statistical Methods in Longitudinal Research: Principles and Structuring Change, A. Von Eye ed., vol. 1, pp. 3579, New York: Academic, 1990.
[31] D.B. Rubin, Multiple Imputation for Nonresponse in Surveys, New York: Wiley, 1987.
[32] J.L. Schafer, Analysis of Incomplete Multivariate Data, Boca Raton: Chapman and Hall, 1997.
[33] S.S. Shapiro and R.S. Francia, “An Approximate Analysis of Variance Test for Normality,” J. Am. Statistical Assoc., vol. 67, p. 215, 1972.
[34] M.J. Shepperd and C. Schofield, “Estimating Software Project Effort Using Analogies,” IEEE Trans. Software Eng., vol. 23, pp. 736743, 1997.
[35] E. Stensrud and I. Myrtveit, “Human Performance Estimating with Analogy and Regression Models: An Empirical Validation,” Proc. METRICS'98, pp. 205213, 1998.
[36] E. Stensrud, “Estimating with Enhanced Object Points versus Function Points,” Proc. Cost Constructive Model, (COCOMO '13), Oct. 1998.
[37] K. Strike, K.E. Emam, and N. Madhavji, “Software Cost Estimation with Incomplete Data,” ERB1071 NRC,http://wwwsel.iit.nrc.ca/~elemam/documents 1071.pdf, also to appear in, IEEE Trans. Software Eng.
[38] H. White, “Estimation, Inference, and Specification Analysis,” Econometric Society Monographs, no. 22, Cambridge Univ. Press, 1994.