Pages: pp. 426-428
Software repositories contain a wealth of valuable information for empirical studies in software engineering. For example:
The data in these repositories is available for most large software projects and represents a detailed and rich record of the historical development of software systems. Until recently these repositories were used primarily for their intended activities such as maintaining versions of the source code or tracking the status of a defect. Software practitioners and researchers are beginning to recognize the potential benefit of mining this information for other purposes. Research is now proceeding to uncover the ways in which mining these repositories can help support the maintenance of software systems, improve software design/reuse, and empirically validate novel ideas and techniques.
Unfortunately, software repositories are not designed to facilitate empirical understanding of a software project. Software tools (such as version control and defect tracking systems) tend to have various anomalies and issues in their recorded information. Tools may be used differently in different projects, and different tools are often used in different organizations. More importantly, the quantities of interest such as development effort are usually not directly captured. A key challenge is to build useful theories and models of software development that can be empirically validated using the information in software repositories. There has been progress in overcoming some of these challenges by constructing tools and developing methods to extract, clean, and validate information from software repositories.
A workshop on the topic of mining software repositories (MSR 2004: http://msr.uwaterloo.ca) was held on 25 May 2004 in Edinburgh, UK, in conjunction with the 26th IEEE International Conference on Software Engineering (ICSE). A following workshop (MSR 2005) was held on 17 May 2005 in Saint Louis, Missouri, in conjunction with the 27th ICSE. Each workshop had submissions from over 14 countries. Both workshops had over 50 attendees, making the MSR workshop series the most attended ICSE workshop for the last two years. These workshops have established a community of researchers and practitioners who are working to recover and use the data stored in software repositories to further understanding of software development practices.
The papers in this special issue present several novel and interesting approaches to transform software repositories from static record keeping repositories to active repositories. These active repositories are then used to gain empirically-based understanding of software development, and by software practitioners to predict, plan, and understand various aspects of their project. We hope that the results presented in these papers will highlight the benefits and potential of historical project information. We also hope that the issue will encourage other researchers, throughout the software engineering field, to explore enriching their current research methods and techniques with historical information. We now give an overview of the six papers in this special issue.
The first three papers present tools that utilize version control data to guide and inform new and existing developers when making software modifications or fixing defects. The remaining papers present techniques that investigate how defect density and effort are related to variations in development process and how they can help us understand the nature of software development.
Shopping for a book at Amazon.com, one comes across a section that reads "Customers who bought this book also bought..." That section lists other books that are typically included in the same purchase. Such information is gathered by data mining the historical purchase information for customers. The paper by Zimmermann et al. applies data mining techniques to source control change history to produce recommendations for developers "Programmers who changed these functions also changed..." Much like the Amazon.com section helps customers to browse related items, the ROSE tool developed by Zimmermann et al. guides programmers to locate possible changes. Historical information is used to suggest likely changes therefore preventing errors due to incomplete changes, and detecting coupling that is undetectable by program analysis.
Software developers who join an existing team must come up to speed on a large and varied amount of information before becoming productive. Sociological and technical difficulties, such as a lack of informal encounters, can make it harder for new members of noncollocated software development teams to learn from their more experienced colleagues. The paper by ubrani et al. presents a tool, named Hipikat to address this situation. Hipikat provides developers with convenient access to the group memory for a software development project that is implicitly formed by all of the artifacts, such as bug reports and mailing list postings, that are produced during the development. This project memory is built automatically with little change to existing work practices. The paper presents several studies to show the usefulness of Hipikat to newcomers engaged in software modification tasks.
Static analysis of source code to locate bugs has been successful in locating bugs. The simplest kind of static analysis systems are compilers that perform type checking. A step beyond compilers are tools like Lint or Metal that flag common programming errors by searching for patterns in the code. While static checkers are effective at finding bugs, they can produce a large number of false positive warnings. Therefore, the order in which the warnings produced by a static bug checker are presented may have a significant impact on its usefulness. Checkers that have their false positives scattered evenly throughout their warnings report are likely to frustrate users by making actual errors hard to find. Tools with few false positives at the top of their warnings report are usually perceived by users as more effective. While previous work on improving the ordering of warnings in reports has focused on using properties of the code that contains the flagged error, the paper by Williams and Hollingsworth uses historical change information to refine the order of warnings produced by static code checkers.
Open source software development has been touted as having the capacity to compete successfully with, and perhaps in many cases to displace, traditional commercial development methods. The paper by Dinh-Trong and Bieman investigates seven hypotheses about developer participation, core team size, code ownership, productivity, and defect density in the case of the FreeBSD operating system. The paper uses a decade of source code change history and problem reports to revise the hypotheses posed in earlier studies of the Apache server and Mozilla browser. The study finds and explains a significant difference in the size of the core team from previous studies, while other hypotheses receive additional support. The study shows that software repositories can be used to replicate prior studies, to enhance their external validity, and to determine if their findings hold for other systems.
Software reuse can increase productivity by avoiding redevelopment, and can improve quality by incorporating components whose reliability has already been established. The paper by Selby quantifies the characteristics of software reuse in large-scale systems by mining software repositories from a NASA software development environment. His results show relatively large amount of reuse in that environment. Among many other results, he find that the modules reused without revision had the fewest faults and lowest fault correction effort. By contrast, modules reused with major revision had the highest fault correction effort and the highest fault isolation effort.
Software systems are continuously growing in size and complexity. The paper by Purushothaman and Perry presents a study which uses historical change and defect data for a large commercial telephony system to study properties of small changes. Some of the findings are: 1) the probability of small "one line" changes to introduce a fault is very low and 2) a significant number of changes to fix faults result in further faults.
Based on MSR 2004 review reports and discussions during and following the workshop, authors of four papers were invited to submit revised and expanded versions of their workshop paper for this special issue. As a result, three papers were revised and submitted. In addition, an open call for papers was distributed over mailing lists and during ICSE 2004. The call for papers was posted on the IEEE Transactions on Software Engineering's Web site and at http://msr.uwaterloo.ca. In total, 45 submissions, including the three invited submissions, underwent two rounds of full reviewing, resulting in the six papers presented in this special issue.
This issue is the result of a great deal of effort by many people who helped make it a success. We would like to thank them all. The large number of submissions to this issue, representing over 15 percent of submissions to the IEEE Transactions on Software Engineering in 2004, was timely and professionally reviewed by a community of hard working and devoted reviewers. We would like to thank these reviewers for their well written reports and their constructive criticism; their effort improved the quality of the papers as noted by the authors. We thank the authors for keeping up with our tight schedule. We are delighted that the turnaround of this issue is the one of the shortest in the Transactions on Software Engineering's history only eight months from paper submission to publication! We are grateful to the editorial board for the IEEE Transactions on Software Engineering and the editor-in-chief (John Knight) for supporting our proposal for this issue. We also appreciate the support and cooperation provided by the IEEE Computer Society Staff (Selina Norman and Suzanne Werner) during the review process.