This Article 
 Bibliographic References 
 Add to: 
A Procedure for Analyzing Unbalanced Datasets
April 1998 (vol. 24 no. 4)
pp. 278-301

Abstract—This paper describes a procedure for analyzing unbalanced datasets that include many nominal- and ordinal-scale factors. Such datasets are often found in company datasets used for benchmarking and productivity assessment. The two major problems caused by lack of balance are that the impact of factors can be concealed and that spurious impacts can be observed. These effects are examined with the help of two small artificial datasets. The paper proposes a method of forward pass residual analysis to analyze such datasets. The analysis procedure is demonstrated on the artificial datasets and then applied to the COCOMO dataset. The paper ends with a discussion of the advantages and limitations of the analysis procedure.

[1] K. Maxwell, L. Van Wassenhove, and S. Dutta, "Software Development Productivity of European Space, Military and Industrial Applications," IEEE Trans. Software Eng., Oct. 1996.
[2] J.W. Bailey and V.R. Basili, ”A Meta Model for Software Development Resource Expenditure,” Proc. Int'l Conf. Software Eng., pp. 107–115, Mar. 1981.
[3] Y. Miyazaki and K. Mori, "COCOMO Evaluation and Tailoring," Proc. Eighth Int'l Software. Eng. Conf.London: IEEE CS Press, 1985.
[4] B. Kitchenham and S. Linkman, "Estimates, Uncertainty and Risk," IEEE Software, vol. 14, no. 3, pp. 69-74, May 1997.
[5] N. Draper and H. Smith, Applied Regression Analysis. John Wiley&Sons, 1966.
[6] G. Milliken and D. Johnson, Analysis of Messy Data, vol. 1: Designed Experiments.New York&London: Chapman and Hall, 1992.
[7] S. Searle, Linear Models for Unbalanced Data.New York: John Wiley&Sons, 1987.
[8] B.A. Kitchenham, "Empirical Studies of Assumptions that Underlie Software Cost-Estimation Models," Information and Software Technology, vol. 34, no. 4, pp. 211-218, 1992.
[9] B. Kitchenham and K. Känsälä, "Inter-Item Correlations Among Function Points," First Int'l Software Metrics Symp.,Baltimore, Los Alamitos, Calif.: IEEE CS Press, May 1993.
[10] S. De Panfilis, B. Kitchenham, and N. Morfuni, "Experiences Introducing a Measurement Program," Information and Software Technology, vol. 39, pp. 745-754, 1997.
[11] International Software Benchmarking Standards Group, "Data Collection Package (Version 2)," Australian Software Metrics Assoc., Victoria, Canada, Feb. 1995.
[12] L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification and Regression Trees.Belmont, Calif.: Wadsworth Inc., 1984.
[13] D. Hoaglin F. Mosteller and J. Tukey, eds., Understanding Robust and Exploratory Data Analysis.New York: John Wiley&Sons, 1983.
[14] A. Atkinson, "Plots, Transformations, and Regression," An Introduction to Graphical Methods of Diagnostic Regression Analysis. Oxford: Clarendon Press, 1985.
[15] B. Boehm, Software Engineering Economics, Prentice Hall, Upper Saddle River, N.J., 1981, pp. 533-535.
[16] S. Shapiro and M. Wilk, "An Analysis of Variance Test for Normality (Complete Samples)," Biometrika, vol. 52, pp. 591-611, 1965.
[17] R. Hogg and J. Ledolter, Applied Statistics for Engineers and Physical Scientists, second edition. MacMillan, 1992.
[18] S.D. Conte, H. E. Dunsmore, and V. Y. Shen, Software Engineering Metrics and Models, Benjamin/Cummings, Menlo Park, Calif., 1986.
[19] G.H. Subramanian and S. Breslawski, "Dimensionality Reduction in Software Development Effort Estimation," J. Systems and Software, vol. 21, no. 2, pp. 187-196, May 1993.
[20] P. Sprent, Applied Nonparametric Statistical Methods. Chapman and Hall, 1989.
[21] M. Shepperd, C. Schofield, and B. Kitchenham, "Effort Estimation Using Analogy," Proc. 15th Int'l Conf. Software Eng.Los Alamitos, Calif.: IEEE CS Press, 1996.

Index Terms:
Software metrics, statistical analysis, unbalanced datasets, benchmarking data, analysis of variance, residual analysis.
Barbara Kitchenham, "A Procedure for Analyzing Unbalanced Datasets," IEEE Transactions on Software Engineering, vol. 24, no. 4, pp. 278-301, April 1998, doi:10.1109/32.677185
Usage of this product signifies your acceptance of the Terms of Use.