Complex high-assurance software systems depend highly on reliability of their underlying software applications. Early identification of high-risk modules can assist in directing quality enhancement efforts to modules that are likely to have a high number of faults. Regression tree models are simple and effective as software equality prediction models, and timely predictions from such models can be used to achieve high software reliability.
This paper presents a case study from our comprehensive evaluation (with several large case studies) of currently available regression tree algorithms for software fault prediction. These are, cart-ls (least squares), s-plus , and cart-lad (least absolute deviation). The case study presented comprises of software design metrics collected from a large network telecommunications system consisting of almost 13 million lines of code. Tree models using design metrics are built to predict number of faults in modules. Besides being compared for fault prediction accuracy, algorithms are also compared based on structure and complexity of their tree models. Performance metrics, average absolute and average relative errors are used to evaluate fault prediction accuracy. Number of terminal nodes and number of independent variables utilized by a tree model are used to gauge complexity and structure of regression trees. It was observed that s-plus trees had poor predictive accuracy and were large and more complex. On the other hand, cart-lad trees had good accuracy and were simple and easy to interpret.