The Community for Technology Leaders

Guest Editor's Introduction: The Promise of Public Software Engineering Data Repositories

Bojan Cukic, West Virginia University

Pages: pp. 20-22

The underpinnings of software engineering come from a unique combination of engineering discipline and scientific foundation. The scientific discovery related to software is based on a centuries-old paradigm common to all fields of science: setting up hypotheses and testing them through experiments. Repeatedly confirmed hypotheses become models that can describe and predict real-world phenomena. All software practitioners are aware of some of the models from their discipline. The best-known models in software engineering describe relationships between development processes, cost and schedule, defects, and numerous software "-ilities" such as reliability, maintainability, and availability.

But, compared to other disciplines, the science of software is relatively new. Unlike in the older fields of science (physics, for example), software models study man-made processes and objects. Predicting human behavior and modeling its effects is a difficult problem, so the process of hypothesizing software engineering models and testing their validity is complicated. It comes as no surprise that most existing software models have proponents and opponents among software engineers. Some practitioners are strongly convinced that predictive models increase productivity and quality, while others can barely hide their skepticism. In most cases, convincing arguments based on past experiences, positive and negative, can support individual opinions in both camps.

Software measures

To better understand the reasons behind the existence of so wildly different experiences and opinions, we need to look into the desirable properties of software engineering measures. Lionel Briand, editor of the journal Empirical Software Engineering, says these measures must be based on explicit and justified assumptions and they must be objective, repeatable, precise, and unbiased. 1 Software engineering measures rarely satisfy all these requirements. Some of them, like cohesion and complexity, are difficult to quantify. Other measures depend on context—that is, they relate to specific development processes or applications.

Software engineering is a decision-intensive discipline. It still struggles with basic questions regarding the utility of models. Can researchers help software practitioners by building models that make explicit the knowledge hidden in various software resources? Can we readily use these models to make predictions regarding some aspect of software? Can we use models to make better decisions? Can we assess those models? Do different researchers working on the same data arrive at similar models? Challenges surrounding the development of predictive software models are the subject of active research. Many predictor models already exist, but information leading to their creation is proprietary. Therefore, model calibration in different development environments is difficult.

Data repositories

Software engineering is not the only discipline that has faced this type of challenge. About a decade ago, the machine-learning and knowledge-engineering communities confronted similar problems. Researchers were happily developing dozens of learning algorithms and defining hundreds of models explaining domains ranging from electrical circuits to medical treatments. But the researchers failed to validate many of these models properly; they couldn't translate research hype into tangible economic benefits regardless of the seemingly endless set of possibilities and opportunities.

As a result, a group of researchers started developing a repository of databases, domain theories, and data generators that the machine-learning community now uses for empirically analyzing machine-learning algorithms. This effort resulted in what's now known as the UCI Machine Learning Repository, 2 and so far, researchers have contributed almost a hundred data sets to it. Furthermore, as the machine-learning technology matured, it became obvious that the relatively small classification-oriented data sets in the UCI MLR could not support knowledge discovery in complex databases. So, researchers developed a new repository, the UCI Knowledge Discovery in Databases Archive, 3 to enable data-mining researchers to scale existing and future data analysis algorithms to very large and complex data sets.

The software engineering community realized at least a decade ago that it needed to develop a similar public repository to host software engineering data sets. Several initiatives were created, some of them backed by US government funding. To mention one, the US National Institute of Standards and Technology initiated in 1998 the Error, Fault, and Failure Data Collection and Analysis Project. 4 This project's goal was to "help industry assess software system quality by collecting, analyzing, and providing error, fault, and failure data and providing statistical methods and tools for analysis on software systems." Unfortunately, a discouragingly low rate of data set submission led to a premature termination of the project. Apparently, in the highly competitive software industry, the public release of any data that could link a company or a proprietary software development process with a negative image is a "kiss of death" to software engineering repository initiatives. It's not surprising that the lack of publicly available software engineering data sets results in poorly validated models. The current status of distrust regarding many existing software engineering models, as well as the proliferation of new ones, is a natural consequence.

The Promise Workshop

Driven by these challenges, Tim Menzies and Jelber Sayyad Shirabad initiated the International Workshop on Predictor Models in Software Engineering (PROMISE) to strengthen the community's faith in software engineering models. Workshop organizers have put a twist in the paper submission rules. Together with an article describing a software engineering model, authors must submit a related data set into the publicly available PROMISE repository ( Submissions have been encouraging; the Web site currently lists 17 data sets. 5 They describe a wide range of software engineering processes, from defect prediction and cost estimation to software reuse. Some of the data sets come from the NASA Metrics Data Program ( More recently, problem reports from several open source development projects found a home in the PROMISE repository. This is a good beginning, and as one of the PROMISE organizers, I hope that the submission trend will continue for the benefit of the entire software engineering community.

In this issue

This special issue of IEEE Software offers selected articles based on papers for the first PROMISE Workshop, held in conjunction with the 2005 International Conference on Software Engineering. The predictive models described in these articles are rather different. A. Günes Koru and Hongfang Liu describe the application of defect-prediction models on several NASA data sets and present the effect of grouping software modules by size. Jane Huffman Hayes, Alex Dekhtyar, and Senthil Karthikeyan Sundaram present their experiences in applying information retrieval methods to requirements tracing. Finally, Zhihao Chen, Tim Menzies, Daniel Port, and Barry Boehm discuss how they applied machine-learning methods to improve and automate the tuning of software cost prediction models.


I hope you find these interesting reading and that some of you will become active supporters of the PROMISE software engineering data repository, either as contributors by submitting relevant data sets or by validating your preferred predictive models.


About the Authors

Bio Graphic
Bojan Cukic is an associate professor in the Lane Department of Computer Science and Electrical Engineering at West Virginia University, where he also serves as codirector of the Center for Identification Technology Research. His research interests include software engineering for high-assurance systems, fault-tolerant computing, information assurance, and biometrics. He received a US National Science Foundation Career award and a Tycho Brahe Award for research excellence from the NASA Office of Safety and Mission Assurance. He is a member of the IEEE and the IEEE Computer Society. He received his PhD in computer science from the University of Houston. Contact him at the Lane Dept. of Computer Science and Electrical Eng., PO Box 6109, W. Virginia Univ., Morgantown, WV 26506-6109;
63 ms
(Ver 3.x)