This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Learning from Examples: Generation and Evaluation of Decision Trees for Software Resource Analysis
December 1988 (vol. 14 no. 12)
pp. 1743-1757

A general solution method for the automatic generation of decision (or classification) trees is investigated. The approach is to provide insights through in-depth empirical characterization and evaluation of decision trees for one problem domain, specifically, that of software resource data analysis. The purpose of the decision trees is to identify classes of objects (software modules) that had high development effort, i.e. in the uppermost quartile relative to past data. Sixteen software systems ranging from 3000 to 112000 source lines have been selected for analysis from a NASA production environment. The collection and analysis of 74 attributes (or metrics), for over 4700 objects, capture a multitude of information about the objects: development effort, faults, changes, design style, and implementation style. A total of 9600 decision trees are automatically generated and evaluated. The analysis focuses on the characterization and evaluation of decision tree accuracy, complexity, and composition. The decision trees correctly identified 79.3% of the software modules that had high development effort or faults, on the average across all 9600 trees. The decision trees generated from the best parameter combinations correctly identified 88.4% of the modules on the average. Visualization of the results is emphasized, and sample decision trees are included.

[1] V. R. Basili,Tutorial on Models and Metrics for Software Management and Engineering. Washington, DC: IEEE Computer Society, 1980.
[2] J. W. Bailey and V. R. Basili, "A meta-model for software development resource expenditures," inProc. 5th Int. Conf. on Software Eng., Mar. 1981, pp. 107-116.
[3] B. Buchanan, E. Feigenbaum, and J. Lederberg, "A heuristic programming study of theory formation in sciences," inProc. Second Int. Joint Conf. Artificial Intelligence, London, England, 1971.
[4] G. E. P. Box, W. G. Hunter, and J. S. Hunter,Statistics for Exper imenters. New York: Wiley, 1978.
[5] B. G. Buchanan and T. M. Mitchell, "Model-directed learning of production rules," inPattern-Directed Inference Systems, D. A. Waterman and F. Hayes-Roth, Eds. New York: Academic, 1978.
[6] B. W. Boehm,Software Engineering Economics. Englewood Cliffs, NJ: Prentice-Hall, 1981.
[7] W. D. Brooks, "Software technology payoff: some statistical evidence,"J. Syst. Software, vol. 2, pp. 3-9, 1981.
[8] V. R. Basili and R. W. Selby, "Calculation and use of an environment's characteristic software metric set," inProc. Eighth Int. Conf. Software Eng., London, Aug. 28-30, 1985.
[9] V. R. Basili, R. W. Selby, and T. Y. Phillips, "Metric analysis and data validation across Fortran projects,"IEEE Trans. Software Eng., vol. SE-9, no. 6, pp. 652-663, Nov. 1983.
[10] V. R. Basili and D. M. Weiss, "A methodology for collecting valid software engineering data,"IEEE Trans. Software Eng., vol. SE-10, no. 6, pp. 728-738, Nov. 1984.
[11] V. R. Basili, M. V. Zelkowitz, F. E. McGarry, Jr., R. W. Reiter, W. F. Truszkowski, and D. L. Weiss, "The Software Engineering Laboratory," Software Eng. Lab. NASA/Goddard Space Flight Center, Greenbelt, MD, Tech. Rep. SEL-77-001, May 1977.
[12] W. G. Cochran and G. M. Cox,Experimental Designs. New York: Wiley, 1950.
[13] J. Carbonell, R. Michalski, and T. Mitchell, "An overview of machine learning," inMachine Learning: An Artificial Intelligence Approach, R. Michalski, J. Carbonell, and T. Mitchell, Ed. Palo Alto, CA: Tioga, 1983.
[14] D. N. Card, F. E. McGarry, J. Page, S. Eslinger, and V. R. Basili, "The Software Engineering Laboratory," Software Eng. Lab., NASA/Goddard Space Flight Center, Greenbelt, MD, Tech. Rep. SEL-81-104, Feb. 1982.
[15] P. Clark and T. Niblett, "The CN2 induction algorithm," Turing Inst., Glasgow, U.K., Tech. Rep., 1986.
[16] T. Dietterich and R. Michalski, "Learning and generalization of characteristic descriptions: Evaluation criteria and comparative review of selected methods," inProc. Sixth Int. Joint Conf., Artificial Intelligence, Tokyo, 1979, pp. 223-231.
[17] T. Dietterich and R. Michalski, "A comparative review of selected methods of learning from examples," inMachine Learning An Artificial Intelligence Approach, R. Michalski, J. Carbonell, and T. Mitchell, Eds. Palo Alto, CA: Tioga, 1983.
[18] T. Dietterich and R. Michalski, "Learning to predict sequences,'" inMachine Learning: An Artificial Intelligence Approach, vol. II, R. S. Michalski, J. Carbonell, and T. Mitchell, Eds. Los Altos, CA: Kaufmann, 1986, pp. 63-106.
[19] W. J. Decker and W. A. Taylor,FORTRAN Static Source Code Analyzer Program (SAP) User's Guide (Revision 1), NASA/Goddard Space Flight Center, Greenbelt, MD, Tech. Rep. SEL-78-102, May 1982.
[20] B. S. Everitt,Cluster Analysis, 2nd ed. London: Heineman, 1980.
[21] F. Hayes-Roth, "Collected papers on learning and recognition of structured patterns," Carnegie-Mellon Univ. Tech. Rep., 1975.
[22] F. Hayes-Roth, "Patterns of induction and associated knowledge acquisition algorithms," Carnegie-Mellon Univ. Tech. Rep., 1976.
[23] F. Hayes-Roth and J. McDermott, "Knowledge acquisitions from structure descriptions," inProc. Fifth Int. Joint Conf. Artificial Intelligence, Cambridge, MA, 1977.
[24] F. Hayes-Roth and J. McDermott, "An interference matching technique for inducing abstractions,"Commun. ACM, vol. 21, no. 5, pp. 401-410, 1978.
[25] E. B. Hunt, J. Marin, and P. Stone,Experiments in Induction. New York: Academic, 1966.
[26] Statistical Analysis System (SAS) User's Guide, Tech. Rep. SAS Inst., Inc., Cary, NC, 1982.
[27] P. Langley, Univ. California, Irvine, personal communication, 1987.
[28] P. Langley, Ed.,Proc. Fourth Int. Workshop Machine Learning, Univ. California, Irvine, June 1987.
[29] P. Langley and J. Carbonell, "Language acquisition and machine learning," Tech. Rep. 86-12, Univ. California, Irvine, 1986.
[30] P. Langley and D. Fisher, "Methods of conceptual clustering and their relation to numerical taxonomy," Univ. California, Irvine, Tech. Rep. 85-26, 1985.
[31] R. S. Michalski, S. Amarel, D. B. Lenat, D. Michie, and P. H. Winston, "Machine learning: Challenges of the eighties," in R. S. Michalski, J. Carbonell, and T. Mitchell, editors,Machine Learning: An Artificial Intelligence Approach, vol. II, R. S. Michalski, J. Carbonell, and T. Mitchell, Eds. Los Altos, CA: Kaufmann, 1986.
[32] R. S. Michalski and R. L. Chilausky, "Learning by being told and learning from examples: An experimental comparison of the two methods of knowledge acquisition in the context of developing an expert system for soybean disease diagnosis,"Policy Anal. Inform. Syst., vol. 4, no. 2, June 1980.
[33] R. S. Michalski, "Pattern recognition as rule guided inductive inference,"IEEE Trans. Pattern Anal. Machine Intell., vol. PAMI-2, no. 4, 1980.
[34] R. S. Michalski, "How to learn imprecise concepts: A method for employing a two-tiered knowledge representation in learning," inProc. Fourth Int. Workshop Machine Learning, Irvine, CA, June 1987, pp. 50-58.
[35] T. M. Mitchell, "The need for biases in learning generalizations," Dep. Comput. Sci., Rutgers Univ., New Brunswick, NJ, Tech. Rep. CBM-TR-117, 1980.
[36] T. M. Mitchell, "Generalization as search,"Artificial Intell., vol. 18, no. 2, pp. 203-226, 1982.
[37] R. S. Michalski, I. Mozetic, J. Hong, and N. Lavrac, "The multi-purpose incremental learning system AQ15 and its testing application to three medical domains," inProc. AAAI Conf., Philadelphia, PA, 1986.
[38] S. A. Mulaik,The Foundations of Factor Analysis. New York: McGraw-Hill, 1972.
[39] T. Mitchell, P. Utgoff, B. Nudel, and R. Bannerji, "Learning by experimentation: Acquiring and refining problem-solving heuristics,"Machine Learning: An Artificial Intelligence Approach, R. Michalski, J. Carbonell, and T. Mitchell, Eds. Palo Alto, CA: Tioga, 1983.
[40] P. O'Rorke, "A comparative study of inductive learning systems AQI 1P and ID-3 using a chess endgame test problem," Technical Dep. Comput. Sci., Univ. Illinois, Urbana, IL, Tech. Rep. ISG 82- 2, Sept. 1982.
[41] L. Putnam, "A general empirical solution to the macro software sizing and estimating problem,"IEEE Trans. Software Eng., vol. SE- 4, no. 4, 1978.
[42] J. R. Quinlan, P. J. Compton, K. A. Horn, and L. Lazarus, "Inductive knowledge acquisition: A case study," inProc. Second Australian Conf. Applications of Expert Systems, Sydney, 1986.
[43] J. R. Quinlan, "Discovering rules by induction from large collections of examples," in D. Michie, editor,Expert Systems in the Micro-Electronic Age, D. Michie, Ed. Edinburgh: Edinburgh University Press, 1979, pp. 168-201.
[44] J. R. Quinlan, "Learning efficient classification procedures and their application to chess end games," inMachine Learning: An Artificial Intelligence Approach, R. Michalski, J. Carbonell, and T. Mitchell, Eds. Palo Alto, CA: Tioga, 1983.
[45] J. R. Quinlan, "Induction of Decision Trees," New South Wales Inst. Technol. Australia, Tech. Rep. 85.6, 1985.
[46] L. Rendell, R. Seshu, and D. Tcheng, "More robust concept learning using dynamically-variable bias," inProc. Fourth Int. Workshop Machine Learning, Irvine, CA, June 1987, pp. 66-78.
[47] H. Scheffe,The Analysis of Variance, New York: Wiley, 1959.
[48] J. Schlimmer, "A note on correlational measures," Univ. California, Irvine, Tech. Rep. 86-13, May 1986.
[49] Annotated Bibliography of Software Engineering Laboratory (SEL) Literature, Software Eng. Lab., NASA/Goddard Space Flight Center, Greenbelt, MD, Tech. Rep. SEL-82-006, Nov. 1982.
[50] R. W. Selby, "Characterization of software reuse at the project, module design, and module implementation levels," Univ. California, Irvine, Tech. Rep. TR-87-12, 1987.
[51] H. Simon, "Why should machines learn?" inMachine Learning: An Artificial Intelligence Approach, R. Michalski, J. Carbonell, and T. Mitchell, Eds. Palo Alto, CA: Tioga, 1983.
[52] Thayer and Pyster, "Twenty problems in software development," inProc. IEEE COMPSAC, 1979.
[53] S. Watanabe,Methodologies of Pattern Recognition. New York: Academic, 1969.
[54] C. E. Walston and C. P. Felix, "A method of programming measurement and estimation,"IBM Syst. J., vol. 16, no. 1, pp. 54-73, 1977.
[55] P. Winston,Learning Structural Descriptions from Examples. New York: McGraw-Hill, 1975.
[56] P. M. Winston,Artificial Intelligence. Reading, MA: Addison-Wesley, 1984.

Index Terms:
software engineering; machine learning; artificial intelligence; decision trees; software resource analysis; software modules; NASA; production environment; metrics; artificial intelligence; decision theory; software engineering; trees (mathematics)
Citation:
R.W. Selby, A.A. Porter, "Learning from Examples: Generation and Evaluation of Decision Trees for Software Resource Analysis," IEEE Transactions on Software Engineering, vol. 14, no. 12, pp. 1743-1757, Dec. 1988, doi:10.1109/32.9061
Usage of this product signifies your acceptance of the Terms of Use.