This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Software Cost Estimation with Incomplete Data
October 2001 (vol. 27 no. 10)
pp. 890-908

Abstract—The construction of software cost estimation models remains an active topic of research. The basic premise of cost modeling is that a historical database of software project cost data can be used to develop a quantitative model to predict the cost of future projects. One of the difficulties faced by workers in this area is that many of these historical databases contain substantial amounts of missing data. Thus far, the common practice has been to ignore observations with missing data. In principle, such a practice can lead to gross biases and may be detrimental to the accuracy of cost estimation models. In this paper, we describe an extensive simulation where we evaluate different techniques for dealing with missing data in the context of software cost modeling. Three techniques are evaluated: listwise deletion, mean imputation, and eight different types of hot-deck imputation. Our results indicate that all the missing data techniques perform well with small biases and high precision. This suggests that the simplest technique, listwise deletion, is a reasonable choice. However, this will not necessarily provide the best performance. Consistent best performance (minimal bias and highest precision) can be obtained by using hot-deck imputation with Euclidean distance and a z-score standardization.

[1] A. Albrecht, “Measuring Application Development Productivity,” SHARE/GUIDE: Proc. IBM Applications Development Symp., pp. 83-92, 1979.
[2] A. Albrecht and J. Gaffney, “Software Function, Source Lines of Code, and Development Effort Prediction,” IEEE Trans. Software Eng., vol. 9, no. 6, pp. 639-648, June 1983.
[3] S. Azen and M. Van Guilder, “Conclusions Regarding Algorithms for Handling Incomplete Data,” Proc. Statistical Computing Section, pp. 53-56, 1981.
[4] J. Bailar III and B. Bailar, “Comparison of Two Procedures for Imputing Missing Survey Values,” Proc. Section on Survey Research Methods, pp. 462-467, 1978.
[5] J.W. Bailey and V.R. Basili, ”A Meta Model for Software Development Resource Expenditure,” Proc. Int'l Conf. Software Eng., pp. 107–115, Mar. 1981.
[6] B. Baker, C. Hardyck, and L. Petrinovich, “Weak Measurements vs. Strong Statistics: An Empirical Critique of S.S. Stevens' Proscriptions on Statistics,” Educational and Psychological Measurement, vol. 26, pp. 291-309, 1966.
[7] K. Baker, P. Harris, and J. O'Brien, “Data Fusion: An Appraisal and Experimental Evaluation,” J. Market Research Soc., vol. 31, pp. 153-212, 1989.
[8] R.D. Banker, S.M. Datar, and C.F. Kemerer, "A Model to Evaluate Variables Impacting the Productivity of Software Maintenance Projects," Management Science, vol. 37, pp. 1-18, Jan. 1991.
[9] R. Banker and C. Kemerer, “Scale Economies in New Software Development,” IEEE Trans. Software Eng., vol. 15, no. 10, pp. 1199-1205, Oct. 1989.
[10] R. Banker and C. Kemerer, “The Evidence on Economies of Scale in Software Development,” Information and Software Technology, vol. 36, no. 5, pp. 275-282, 1994.
[11] D. Belsley, E. Kuh, and R. Welsch, Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. John Wiley and Sons, 1980.
[12] B. Boehm, Software Engineering Economics, Prentice Hall, Upper Saddle River, N.J., 1981, pp. 533-535.
[13] G. Bohrnstedt and T. Carter, “Robustness in Regression Analysis,” Sociological Methodology, chapter 5, H. Costner, ed., 1971.
[14] L.C. Briand, V.R. Basili, and W.M. Thomas, "A Pattern Recognition Approach for Software Engineering Data Analysis," IEEE Trans. Software Eng., vol. 18, no. 11, pp. 931-942, 1992.
[15] L. Briand, “Quantitative Empirical Modeling for Managing Software Development: Constraints, Needs and Solutions,” Experimental Software Eng. Issues: Critical Assessment and Future Directions, H.D. Rombach, V. Basili, and R. Selby eds., 1993.
[16] L. Briand, K. El Emam, and S. Morasca, “On the Application of Measurement Theory in Software Eng.,” Empirical Software Eng.: An Int'l J., vol. 1, no. 1, pp. 61-88, 1996.
[17] L. Briand, K. El Emam, and I. Wieczorek, “Explaining the Cost of European Space and Military Projects,” Proc. Int'l Conf. Software Eng., pp. 303-312, 1999.
[18] L. Briand, K. El Emam, D. Surmann, I. Wieczorek, and K. Maxwell, “An Assessment and Comparison of Common Software Cost Estimation Modeling Techniques,” Proc. Int'l Conf. Software Eng., pp. 313-322, 1999.
[19] L. Briand, T. Langley, and I. Wieczorek, “A Replicated Assessment and Comparison of Common Software Cost Modeling Techniques,” Proc. Int'l Conf. Software Eng., pp. 377-386, 2000.
[20] L. Briand, K. El Emam, B. Freimut, and O. Laitenberger, “A Comprehensive Evaluation of Capture-Recapture Models for Estimating Software Defect Content,” IEEE Trans. Software Eng., vol. 26, no. 6, pp. 518-540, June 2000.
[21] F.P. Brooks, Jr., The Mythical Man-Month: Essays on Software Engineering, Addison Wesley Longman, Reading, Mass., 1975.
[22] B.K. Clark, The Effects of Software Process Maturity on Software Development Effort, doctoral dissertation, Univ. of Southern California, Los Angeles, 1997, (current Nov. 2000).
[23] B. Clark, S. Devnani-Chulani, and B. Boehm, “Calibrating the COCOMO II Post-Architecture Model,” Proc. 20th Int'l Conf. Software Eng., pp. 477-480, 1998.
[24] M. Colledge, J. Johnson, R. Pare, and I. Sande, “Large Scale Imputation of Survey Data,” Proc. Section on Survey Research Methods, pp. 431-436, 1978.
[25] S.D. Conte, H. E. Dunsmore, and V. Y. Shen, Software Engineering Metrics and Models, Benjamin/Cummings, Menlo Park, Calif., 1986.
[26] B. Cox and R. Folsom, “An Empirical Investigation of Alternate Item Nonresponse Adjustments,” Proc. Section on Survey Research Methods, pp. 219-223, 1978.
[27] L. Ernst, “Weighting to Adjust For Partial Nonresponse,” Proc. Section on Survey Research Methods, pp. 468-472, 1978.
[28] G. Finnie and G. Wittig, “A Comparison of Software Effort Estimation Techniques: Using Function Points with Neural Networks, Case-Based Reasoning and Regression Models,” J. Systems and Software, vol. 39, pp. 281-289, 1997.
[29] B. Ford, “Missing Data Procedures: A Comparative Study,” Proc. Social Statistics Section, pp. 324-329, 1976.
[30] B. Ford, “An Overview of Hot-Deck Procedures,” Incomplete Data in Sample Surveys, Theory and Bibliographies, volume 2.W. Madow, I. Olkin, and D. Rubin eds., Academic Press, 1983.
[31] J. Frane, “Some Simple Procedures for Handling Missing Data in Multivariate Analysis,” Psychometrika, vol. 41, no. 3, 409-415, 1976.
[32] G. Furnival and R. Wilson, “Regressions by Leaps and Bounds,” Technometrics, vol. 16, no. 4, 499-511, 1974.
[33] P. Gardner, “Scales and Statistics,” Review of Educational Research, vol. 45, no. 1, pp. 43-57, 1975.
[34] O. Gilley and R. Leone, “A Two-Stage Imputation Procedure for Item Nonresponse in Surveys,” J. Business Research, vol. 22, pp. 281-291, 1991.
[35] A. Gray and D. MacDonnell, “A Comparison of Techniques for Developing Predictive Models of Software Metrics,” Information and Software Tehcnology, vol. 39, pp. 425-437, 1997.
[36] Y. Haitovsky, “Missing Data in Regression Analysis,” J. Royal Statistical Soc., vol. B30, pp. 67-81, 1968.
[37] W. Hayes, Statistics, fifth ed. Hartcourt Brace College Publishers, 1994.
[38] D. Heitjan, “Annotation: What Can Be Done about Missing Data? Approaches to Imputation,” Am. J. Public Health, pp. 548-550, 1997.
[39] Q. Hu, "Evaluating, "Alternative Software Production Functions," IEEE Trans. Software Eng., vol. 23, no. 6, pp. 379-387, June 1997.
[40] B. Ives, M. Olson, and J. Baroudi, “The Measurement of User Information Satisfaction,” Comm. ACM, vol. 26, no. 10, pp. 785-793, 1983.
[41] R. Jeffery, M. Ruhe, and I. Wieczorek, “A Comparative Study of Cost Modeling Techniques Using Public Domain Multi-Organizational and Company-Specific Data,” Proc. European Software Control and Metrics Conf. (ESCOM), 2000. Also in Information and Software Technology, 2000.
[42] R. Jensen, “A Comparison of the Jensen and COCOMO Schedule and Cost Estimation Models,” Proc. Int'l Society of Parametric Analysis, pp. 96-106, 1984.
[43] C. Jones, Programming Productivity. McGraw-Hill, 1986.
[44] H. Jung and R. Hunter, “Modeling the Assessor Effort in Software Process Assessment,” submitted for publication, 1999.
[45] J. Kaiser, “The Effectiveness of Hot-Deck Procedures in Small Samples,” Proc. Ann. Meeting of the Am. Statistical Assoc., 1983.
[46] L. Kaufman and P. Rousseeuw, Finding Groups in Data. John Wiley&Sons, 1990.
[47] C. Kemerer, "An Empirical Validation of Software Cost Estimation Models," Comm. ACM, vol. 30, pp. 416-429, May 1987.
[48] J. Kim and J. Curry, “The Treatment of Missing Data in Multivariate Analysis,” Social Methods&Research, vol. 6, pp. 215-240, 1977.
[49] B.A. Kitchenham and N.R. Taylor, "Software Development Cost Estimation" J. Systems and Software, vol. 5, no. 5, pp. 267-278, 1985.
[50] B.A. Kitchenham, "Empirical Studies of Assumptions that Underlie Software Cost-Estimation Models," Information and Software Technology, vol. 34, no. 4, pp. 211-218, 1992.
[51] J. Kromrey and C. Hines, “Nonrandomly Missing Data in Multiple Regression: An Empirical Comparison Of Common Missing-Data Treatments,” Educational and Psychological Measurement, vol. 54, no. 3, pp. 573-593, 1994.
[52] S. Labovitz, “Some Observations on Measurement and Statistics,” Social Forces, vol. 46, no. 2, pp. 151-160, 1967.
[53] S. Labovitz, “The Assignment of Numbers to Rank Order Categories,” Am. Sociological Rev., vol. 35, pp. 515-524, 1970.
[54] S.-Y. Lee and Y.-M. Chiu, “Analysis of Multivariate Polychoric Correlation Models with Incomplete Data,” British J. Math. and Statistical Psychology, vol. 43, pp. 145-154, 1990.
[55] J. Lepowski, J. Landis, and S. Stehouwer, “Strategies for the Analysis of Imputed Data from A Sample Survey: The National Medical Care Utilization and Expenditure Survey,” Medical Care, vol. 25, pp. 705-716, 1987.
[56] R. Little and D. Rubin, Statistical Analysis With Missing Data. Wiley, 1987.
[57] R. Little, “Missing-Data Adjustments in Large Surveys,” J. Business&Economic Statistics, vol. 6, no. 3, pp. 287-296, 1988.
[58] R. Little, “Regression with Missing X's: A Review,” J. Am. Statistical Assoc., vol. 87, no. 420, pp. 1227-1237, 1992.
[59] N. Malhorta, “Analyzing Marketing Research Data with Incomplete Information on the Dependent Variable,” J. Marketing Research, vol. 24, pp. 74-84, 1987.
[60] J.E. Matson, B.E. Barret, and J.M. Mellichamp, “Software Development Cost Estimation Using Function Points,” IEEE Trans. Software Eng., vol. 20, no. 4, pp. 275–287, Apr. 1994.
[61] K. Maxwell, L. Van Wassenhove, and S. Dutta, "Performance Evaluation of General and Company Specific Models in Software Development Effort Estimation," Management Science, Vol. 45, No. 6, June 1999, pp. 787-803
[62] G. Milligan and M. Cooper, “A Study of Standardization of Variables in Cluster Analysis,” J. Classification, vol. 5, pp. 181-204, 1988.
[63] Y. Miyazaki et al., "Method to Estimate Parameter Values in Software Prediction Models," Information and Software Technology, vol. 33, no. 3, pp. 239-243, 1991.
[64] Y. Miyazaki, M. Terakado, K. Ozaki, and H. Nozaki, “Robust Regression for Developing Software Estimation Models,” J. Systems and Software, vol. 27, pp. 3-16, 1994.
[65] A. Mockus, “Missing Data in Software Eng.,” Empirical Methods in Software Eng., K. El Emam and J. Singer eds., 2000.
[66] T. Mukhopadhyay and S. KeKre, “Software Effort Models for Early Estimation of Process Control Applications,” IEEE Trans. Software Eng., vol. 18, no. 10, pp. 915–923, Oct. 1992.
[67] A. Myrvold, “Data Analysis for Software Metrics,” J. Systems and Software, vol. 12, pp. 271-275, 1990.
[68] J. Nunnally and I. Bernstein, Psychometric Theory. McGraw Hill, 1994.
[69] S. Oligny, P. Bouque, and A. Abran, “An Empirical Assessment of Project Duration Models in Software Eng.,” Proc. Eighth European Software Control and Metrics Conf., 1997.
[70] L. Pickard, B. Kitchenham, and P. Jones, “Comments on: Evaluating Alternative Software Production Functions,” IEEE Trans. Software Eng., vol. 25, no. 2, pp. 282-283, Feb. 1999.
[71] S. Rahhal, “An Effort Estimation Model for Implementing ISO 9001 in Software Organizations,” master's thesis, School of Computer Science, McGill Univ., Oct. 1995.
[72] S. Rahhal, “An Effort Estimation Model for Implementing ISO 9001 in Software Organizations,” master's thesis, School of Computer Science, McGill Univ., Oct. 1995.
[73] M. Raymond and D. Roberts, “A Comparison of Methods for Treating Incomplete Data in Selection Research,” Education and Psychological Measurement, vol. 47, pp. 13-26, 1987.
[74] M. Raymond, “Missing Data in Evaluation Research,” Evaluation and the Health Profession, vol. 9, no. 4, pp. 395-420, 1986.
[75] P. Roth, “Missing Data: A Conceptual Review for Applied Psychologists,” Personnel Psychology, vol. 47, pp. 537-560, 1994.
[76] P. Roth and F. Switzer, “A Monte Carlo Analysis of Missing Data Techniques in a HRM Setting,” J. Management, vol. 21, no. 5, pp. 1003-1023, 1995.
[77] D. Rubin, Multiple Imputation for Nonresponse in Surveys. John Wiley&Sons, 1987.
[78] I. Sande, “Hot-Deck Imputation Procedures,” Proc. Symp. Incomplete Data in Sample Surveys, W. Madow and I. Olkin eds., vol. 3, 1983.
[79] M.J. Shepperd and C. Schofield, “Estimating Software Project Effort Using Analogies,” IEEE Trans. Software Eng., vol. 23, pp. 736-743, 1997.
[80] S. Siegel and J. Castellan, Nonparametric Statistics for the Behavioral Sciences. McGraw-Hill, 1988.
[81] P. Spector, “Ratings of Equal and Unequal Response Choice Intervals,” The J. Social Psychology, vol. 112, pp. 115-119, 1980.
[82] K. Srinivasan and D. Fisher, “Machine Learning Approaches to Estimating Software Development Effort,” IEEE Trans. Software Eng., vol. 21, no. 2, pp. 126–137, Feb. 1995.
[83] S. Stevens, “Mathematics, Measurement, and Psychophysics,” Handbook of Experimental Psychology, S. Stevens ed., 1951.
[84] G.H. Subramanian and S. Breslawski, "Dimensionality Reduction in Software Development Effort Estimation," J. Systems and Software, vol. 21, no. 2, pp. 187-196, May 1993.
[85] University of Southern California COCOMO 2.0 Model User's Manual, Version 1.1, 1994.
[86] P. Velleman and L. Wilkinson, “Nominal, Ordinal, Interval, and Ratio Typologies Are Misleading,” The Am. Statistician, vol. 47, no. 1, pp. 65-72, 1993.
[87] F. Walkerden and R. Jeffery, “Software Cost Estimation: A Review of Models, Process, and Practice,” Advances in Computers, vol. 44, pp. 59-125, 1997.
[88] C. Walston and C. Felix, “A Method of Programming Measurement and Estimation,” IBM Systems Journal, vol. 1, pp. 54-73, 1977.
[89] H. Weisberg, Central Tendency and Variability. Sage Publications, 1992.
[90] C. Wrigley and A. Dexter, “Software Development Estimation Models: A Review and Critique,” Proc. Administrative Sciences Association of Canada (ASAC) Conf., 1987.

Index Terms:
Software cost estimation, missing data, imputation, data quality, cost modeling.
Citation:
Kevin Strike, Khaled El Emam, Nazim Madhavji, "Software Cost Estimation with Incomplete Data," IEEE Transactions on Software Engineering, vol. 27, no. 10, pp. 890-908, Oct. 2001, doi:10.1109/32.962560
Usage of this product signifies your acceptance of the Terms of Use.