The Community for Technology Leaders

Guest Editors' Introduction: Special Issue on Mining Software Repositories

Ahmed E. Hassan, IEEE
Audris Mockus, IEEE
Richard C. Holt, IEEE
Philip M. Johnson, IEEE

Pages: pp. 426-428


Software repositories contain a wealth of valuable information for empirical studies in software engineering. For example:

  • source control systems store changes to the source code as development progresses,
  • defect tracking systems follow the resolution of software defects, and
  • archived communications between project personnel record rationale for decisions throughout the life of a project.

The data in these repositories is available for most large software projects and represents a detailed and rich record of the historical development of software systems. Until recently these repositories were used primarily for their intended activities such as maintaining versions of the source code or tracking the status of a defect. Software practitioners and researchers are beginning to recognize the potential benefit of mining this information for other purposes. Research is now proceeding to uncover the ways in which mining these repositories can help support the maintenance of software systems, improve software design/reuse, and empirically validate novel ideas and techniques.

Unfortunately, software repositories are not designed to facilitate empirical understanding of a software project. Software tools (such as version control and defect tracking systems) tend to have various anomalies and issues in their recorded information. Tools may be used differently in different projects, and different tools are often used in different organizations. More importantly, the quantities of interest such as development effort are usually not directly captured. A key challenge is to build useful theories and models of software development that can be empirically validated using the information in software repositories. There has been progress in overcoming some of these challenges by constructing tools and developing methods to extract, clean, and validate information from software repositories.

A workshop on the topic of mining software repositories (MSR 2004: was held on 25 May 2004 in Edinburgh, UK, in conjunction with the 26th IEEE International Conference on Software Engineering (ICSE). A following workshop (MSR 2005) was held on 17 May 2005 in Saint Louis, Missouri, in conjunction with the 27th ICSE. Each workshop had submissions from over 14 countries. Both workshops had over 50 attendees, making the MSR workshop series the most attended ICSE workshop for the last two years. These workshops have established a community of researchers and practitioners who are working to recover and use the data stored in software repositories to further understanding of software development practices.

The Issue

The papers in this special issue present several novel and interesting approaches to transform software repositories from static record keeping repositories to active repositories. These active repositories are then used to gain empirically-based understanding of software development, and by software practitioners to predict, plan, and understand various aspects of their project. We hope that the results presented in these papers will highlight the benefits and potential of historical project information. We also hope that the issue will encourage other researchers, throughout the software engineering field, to explore enriching their current research methods and techniques with historical information. We now give an overview of the six papers in this special issue.

The first three papers present tools that utilize version control data to guide and inform new and existing developers when making software modifications or fixing defects. The remaining papers present techniques that investigate how defect density and effort are related to variations in development process and how they can help us understand the nature of software development.

2.1 Guiding Software Changes

Shopping for a book at, one comes across a section that reads "Customers who bought this book also bought..." That section lists other books that are typically included in the same purchase. Such information is gathered by data mining the historical purchase information for customers. The paper by Zimmermann et al. applies data mining techniques to source control change history to produce recommendations for developers "Programmers who changed these functions also changed..." Much like the section helps customers to browse related items, the ROSE tool developed by Zimmermann et al. guides programmers to locate possible changes. Historical information is used to suggest likely changes therefore preventing errors due to incomplete changes, and detecting coupling that is undetectable by program analysis.

2.2 Assisting Newcomers in Distributed Software Development

Software developers who join an existing team must come up to speed on a large and varied amount of information before becoming productive. Sociological and technical difficulties, such as a lack of informal encounters, can make it harder for new members of noncollocated software development teams to learn from their more experienced colleagues. The paper by ubrani et al. presents a tool, named Hipikat to address this situation. Hipikat provides developers with convenient access to the group memory for a software development project that is implicitly formed by all of the artifacts, such as bug reports and mailing list postings, that are produced during the development. This project memory is built automatically with little change to existing work practices. The paper presents several studies to show the usefulness of Hipikat to newcomers engaged in software modification tasks.

2.3 Improving Static Analysis Bug Finding Techniques

Static analysis of source code to locate bugs has been successful in locating bugs. The simplest kind of static analysis systems are compilers that perform type checking. A step beyond compilers are tools like Lint or Metal that flag common programming errors by searching for patterns in the code. While static checkers are effective at finding bugs, they can produce a large number of false positive warnings. Therefore, the order in which the warnings produced by a static bug checker are presented may have a significant impact on its usefulness. Checkers that have their false positives scattered evenly throughout their warnings report are likely to frustrate users by making actual errors hard to find. Tools with few false positives at the top of their warnings report are usually perceived by users as more effective. While previous work on improving the ordering of warnings in reports has focused on using properties of the code that contains the flagged error, the paper by Williams and Hollingsworth uses historical change information to refine the order of warnings produced by static code checkers.

2.4 Studying the Open Source Software Development Process

Open source software development has been touted as having the capacity to compete successfully with, and perhaps in many cases to displace, traditional commercial development methods. The paper by Dinh-Trong and Bieman investigates seven hypotheses about developer participation, core team size, code ownership, productivity, and defect density in the case of the FreeBSD operating system. The paper uses a decade of source code change history and problem reports to revise the hypotheses posed in earlier studies of the Apache server and Mozilla browser. The study finds and explains a significant difference in the size of the core team from previous studies, while other hypotheses receive additional support. The study shows that software repositories can be used to replicate prior studies, to enhance their external validity, and to determine if their findings hold for other systems.

2.5 Understanding the Benefits and Characteristics of Successful Software Reuse

Software reuse can increase productivity by avoiding redevelopment, and can improve quality by incorporating components whose reliability has already been established. The paper by Selby quantifies the characteristics of software reuse in large-scale systems by mining software repositories from a NASA software development environment. His results show relatively large amount of reuse in that environment. Among many other results, he find that the modules reused without revision had the fewest faults and lowest fault correction effort. By contrast, modules reused with major revision had the highest fault correction effort and the highest fault isolation effort.

2.6 Studying Small Code Changes

Software systems are continuously growing in size and complexity. The paper by Purushothaman and Perry presents a study which uses historical change and defect data for a large commercial telephony system to study properties of small changes. Some of the findings are: 1) the probability of small "one line" changes to introduce a fault is very low and 2) a significant number of changes to fix faults result in further faults.

Paper Selection Procedure

Based on MSR 2004 review reports and discussions during and following the workshop, authors of four papers were invited to submit revised and expanded versions of their workshop paper for this special issue. As a result, three papers were revised and submitted. In addition, an open call for papers was distributed over mailing lists and during ICSE 2004. The call for papers was posted on the IEEE Transactions on Software Engineering's Web site and at In total, 45 submissions, including the three invited submissions, underwent two rounds of full reviewing, resulting in the six papers presented in this special issue.


This issue is the result of a great deal of effort by many people who helped make it a success. We would like to thank them all. The large number of submissions to this issue, representing over 15 percent of submissions to the IEEE Transactions on Software Engineering in 2004, was timely and professionally reviewed by a community of hard working and devoted reviewers. We would like to thank these reviewers for their well written reports and their constructive criticism; their effort improved the quality of the papers as noted by the authors. We thank the authors for keeping up with our tight schedule. We are delighted that the turnaround of this issue is the one of the shortest in the Transactions on Software Engineering's history only eight months from paper submission to publication! We are grateful to the editorial board for the IEEE Transactions on Software Engineering and the editor-in-chief (John Knight) for supporting our proposal for this issue. We also appreciate the support and cooperation provided by the IEEE Computer Society Staff (Selina Norman and Suzanne Werner) during the review process.

About the Authors

Bio Graphic
Ahmed E. Hassan received the PhD degree in 2005 and MMath degree in 2001 from the School of Computer Science at the University of Waterloo in Canada. For the last four years, he has been working at Research In Motion (RIM) where he is responsible, along with his team, for the development of protocols, simulation tools, and software to ensure the scalability and reliability of RIM's global infrastructure. His research interests include mining source control data, high redundancy and availability systems, and visualization and migration of Web applications. Previously, he worked at IBM's Almaden Research Lab in San Jose and at Nortel Networks in Ottawa. He holds numerous patents in the area of wireless communications and distributed systems. (See Dr. Hassan, along with Ric Holt and Audris Mockus organized MSR 2004, held at ICSE 2004. Dr. Hassan, along with Ric Holt and Stephan Diehl organized MSR 2005, held at ICSE 2005. He is a member of the IEEE.
Bio Graphic
Audris Mockus received the BS and MS degrees in applied mathematics from Moscow Institute of Physics and Technology in 1988. In 1991, he received the MS degree and, in 1994, he received the PhD degree in statistics from Carnegie Mellon University. He conducts research of complex dynamic systems. He designs data mining methods to summarize and augment the system evolution data, interactive visualization techniques to inspect, present, and control the systems, and statistical models and optimization techniques to understand the systems. He works at the Software Technology Research Department of Avaya Labs. Previously, he worked at the Software Production Research Department of Bell Labs. His CV is at He is a member of the IEEE.
Bio Graphic
Richard C. Holt is a professor at the University of Waterloo, where his research interests include visualizing software architecture. This work includes reverse engineering of legacy systems and repairing software architecture. His architectural visualizations have included Linux, Mozilla (Netscape), IBM's TOBEY code generator, and Apache. His previous work includes foundational work on deadlock, development of a number of compilers and compilation techniques, development of the first Unix clone, and authoring a dozen books on programming and operating systems. He is one of the designers of the Turing programming language. (See He is a member of the IEEE.
Bio Graphic
Philip M. Johnson received BS degrees in both biology and computer science from the University of Michigan in 1980, and the MS and PhD degrees in computer science from the University of Massachusetts in 1990. He is a professor of information and computer sciences at the University of Hawaii and director of the Collaborative Software Development Laboratory. He has published more than 50 papers in areas including software engineering, computer supported cooperative work, and artificial intelligence. He serves on the board of directors of several technology companies in Hawaii. He is a member of the IEEE.
61 ms
(Ver 3.x)