In this special issue of IEEE Software, we invited submissions that reflected the benefits (and drawbacks) of software analytics. The response was overwhelming. Software analytics is an area of explosive growth, and we had so many excellent submissions that we had to split this special issue into two volumes—you'll see even more content in the September/October issue. We divided the articles on conceptual grounds, so both volumes will feature equally excellent work.
To better frame these articles, we offer some definitions and historical perspectives on software analytics. Specifically, we describe where the field was, where it is, and where it might be going.
What Is Software Analytics?
Thanks to the Internet and open source, there's now so much data about software projects that it's impossible to manually browse through it all:
• As of late 2012, our Web searches show that Mozilla Firefox had 800,000 bug reports, and platforms such as Sourceforge.net and GitHub hosted 324,000 and 11.2 million projects, respectively.
• The PROMISE repository of software engineering data has grown to more than 100 projects and is just one of more than a dozen open source repositories that are readily available to industrial practitioners and researchers; see Table 1 for more.
Table 1. Repositories of software engineering data.
To handle this data, many practitioners and researchers have turned to analytics—that is, the use of analysis, data, and systematic reasoning for making decisions. 1
We can define software analytics as follows: "Software analytics is analytics on software data for managers and software engineers with the aim of empowering software development individuals and teams to gain and share insight from their data to make better decisions." 2
Such insights include, but might not be limited to, actionable advice on how to improve software projects. Due to the volume of data, finding these insights typically requires some degree of automation, usually combined with human involvement. As the "Applications of Software Analytics" sidebar shows, augmenting analytics with automation has led to impressive results in a wide range of application areas.
Analytics Must Be Real Time and Actionable
An important part of software analytics is that they include actionable advice. For example, if the manager is driving toward a cliff, we don't want to distract her with analytics telling her about the clouds in the sky or the flowers on the side of the road. Instead, we want our smart analytics to shout in her ear, "There's a cliff up ahead! Turn left immediately!"
What happens next is up to the manager (who might elect to ignore this advice and make some decision based on his or her own gut instincts; http://newsroom.accenture.com/article_display.cfm?article_id=4777). But we know from our own experience that if we don't deliver relevant advice, then there's little hope that a manager will use any of our analytics advice.
In practice, actionable analytics means that those analytics must be available in real time—faster than the rate of change of effects within a project. Decisions should be based on recent, not outdated, data. This is an important point to make because traditional data collection and analysis techniques might be too slow. For example, at an International Conference on Software Engineering, Forrest Shull, the editor in chief of this magazine, told his audience in a tutorial that the current pace of manual methods in empirical software engineering might not keep up with the fast pace of modern agile software practices. 3
As to what constitutes real time, that's a domain-dependent issue. For example, for stock trading, real time might mean microseconds, whereas for process change in an organization, it might mean before the December merit awards. For more details on this real-time approach to analytics, see "Leveraging the Crowd: How 48,000 Users Helped Improve Lync Performance" by Robert Musson and his colleagues in this special issue.
Analytics Means—Sharing Information
To a large extent, analytics is about what software projects can learn from themselves and each other. Looking back at decades of research in this area, we see claims that software analytics let us share different things between projects:
• sharing models (one of the early models was proposed by Fumio Akiyama and says that we should expect more than a dozen bugs per KLOC; see the sidebar "Early 'Global' Models and Software Analytics"),
• sharing insights (for example, Christian Bird and his colleagues found that in the case of Windows Vista, it's possible to build high-quality software using distributed teams, just as long as the management is structured around code functionality—who works on what—and not merely on geography—who sits where 4 ),
• sharing data (as in Table 1 's list of repositories), and
• sharing methods (the techniques by which we can convert data into models to find local models).
The ordering of the items in this list is very significant. We've seen a major change in the nature of software analytics. In the past, most research focused on the start of this list and searched for general models; today, it focuses on the end of this list (sharing general methods) to enable the discovery of local lessons.
Forty years ago, the goal was to find and share "the" model of software development, which assumed a single model existed that was global to all software projects. But more recently, the goal of analytics has changed—the research community has accepted that lessons learned from one project can't always be applied verbatim to another. For example, as one of us (Zimmermann) reported at the 2009 International Symposium on the Foundations of Software Engineering, a model that predicts defects in Internet Explorer may not be able to predict defects in Firefox.
Rather than searching for global models, the focus is now on local methods. This shift was caused by two developments. In the late 20th century, a whole new generation of data mining algorithms became available, along with large amounts of data from open source projects. Consequently, more researchers applied more mining algorithms to more data. 5
A side effect of all that work was the growing realization that one global model wouldn't cover all software projects.
The ability to quickly reason about large amounts of data from a particular site has become of vital importance—we used to expect that we could share models, but if project models are project-specific, and if we reuse your model on our project, it could lead to inappropriate and costly management decisions. It has become more important today to discuss and document the methods by which a data scientist might turn local data into relevant and useful local models (a catalog of such methods appears at http://research.microsoft.com/en-us/events/dapse2013).
General Principles for Analytics and the Need for Skilled People
Any field needs some general principles to guide
• novices in their journey from beginner to expert,
• managers when they assess potential hires,
• academics when they design subjects or degrees,
• professional bodies when they design accreditation programs,
• the identification of gaps in current techniques, and
• the design and implementation of new and better techniques.
The sidebar "Principles for Software Analytics" offers some tips on how to do this. Note that they're more concerned with usage patterns and interactions with the consumer of data insights than with particular algorithms.
In the era of Google-style inference and cloud computing, it's a common belief that a company can analyze large amounts of data merely by building (or renting) a CPU farm, then running some distributed algorithms, perhaps using Hadoop ( http://hadoop.apache.org) or some other distributed inference mechanism.
This isn't the case. In our experience, while having many CPUs is (sometimes) useful, the factors that select successful software analytics rarely include the hardware. More important than the hardware is how that hardware is used by skilled data scientists.
Another misconception we often see relates to the role of software. Some managers think that if they acquire the right software tools—Weka, Matlab, and so on—then all their analytical problems will be instantly solved.
Nothing could be further from the truth. All the standard data analysis toolkits come with built-in assumptions that might be suitable for particular domains. Hence, a premature commitment to particular automatic analysis tools can be counterproductive. When it isn't clear what the important factors in a domain are, a data scientist must make many ad hoc queries, to clarify the issues in that domain. Subsequently, once the analysis method stabilizes, it becomes possible to build tools to automate the routine and repeated analysis tasks.
In practice, most analytics projects mature by moving up along the curve in Figure 1
. Initially, we might start in the blue and then move into the orange region. This diagram leads to one of our favorite mantras for software analytics: for new or infrequent problems, deploy the data scientists before deploying tools.
Figure 1. The frequency of analytics questions. A small number of questions are very frequent and therefore should be supported by tools (orange region), whereas the long tail of questions that are more unique and asked less frequently should be addressed by data scientists (blue). As the analytics domain matures, we expect the orange area to grow because tools will become more powerful and cheaper to develop.
Different and Distinct Kinds of Analytics
Expert data scientists know that they must choose their methods to best match the distinctive features of their particular kind of analytics. As this field matures, we'll see more and more recognition of distinct subtypes of analytics, each requiring different tools and techniques. Here are the distinctions we can see within current work on software analytics—it's hardly a complete set, and it will certainly change over time. However, we make this list to make the point that "analytics" is a broad area with many exciting and challenging issues and room for development.
The first important distinction is between internal and external analytics, both of which have significant implications. In this context, one issue is live versus stale data: while the internal team can access current data, there are many practical challenges associated with shipping a copy of the data outside the organization's firewalls to an external team. A second issue is privacy. If an external team wants to access the data, it might have to perform some anonymization of the information. This can be a problem because altering data even to ensure privacy can damage the signal in that data. Recently, we've had some success with an instance-based privacy algorithm that clears the space around each example, then jumps the remaining data some small distance into that cleared space. The resulting data contains none of the original individuals, but preserves the hyperspace boundaries between conclusions. 6
Another important distinction is between quantitative and qualitative methods. Quantitative methods include the traditional automatable tasks performed by statistical packages and data mining tools, 7 , 8
whereas qualitative methods are typically more manual and require extensive user interaction. A common myth with qualitative methods is that they're less-than-rigorous and somehow less useful than quantitative methods, but we haven't found this to be the case. See "Developer Dashboards: The Need for Qualitative Analytics" by Olga Baysal, Reid Holmes, and Michael Godfrey in this issue for an excellent example.
Yet another important distinction is between data mining tools and interactive tools. Data mining tools are typically automatic and run to produce one conclusion, whereas visualization tools give users more control over the output and the ability to generate ad hoc on-the-fly reports on any aspect of the data. Much current software analytics work is focused on data mining, but this issue features some exceptions to this rule, which can be found in both the roundtable discussion and "Looking under the Lamppost for Useful Software Analytics" by Philip Johnson.
It's also important to distinguish the audience for analytics results, which could include developers, testers, development leads, test leads, managers, and researchers, all of whom have different analysis and data needs. 9
Finally, we wish to distinguish exploratory versus deployment analytics. In an exploratory analytical study, it's often unclear how to add value to the business using analytics. Hence, in this stage, a data scientist's work is often preliminary and perhaps includes many dead ends. If the exploratory results are successful, there might be a business case for moving to deployment analytics. In this second phase, the goals are better understood, and the team (which might be much larger than the preliminary analytics team) works to integrate the analysis method into the organization's information systems. If the preliminary analysis takes weeks to months, deployment analytics can take months to years. For an example of the cost and benefits of deployment analytics, see "CODEMINE: Building a Software Development Data Analytics Platform at Microsoft" by Jacek Czerwonka and his colleagues in this special issue.
This is an exciting time for those of us involved in data science and software analytics. Looking into the very near future, we can only predict more use of analytics. By 2020, we would predict
• more and different data,
• more algorithms,
• faster decision making with the availability of more data and faster release cycles,
• more people involved in analytics as it becomes more routine to mine data,
• more education as more people analyze and work with data,
• more roles for data scientists and developers as this field matures with specialized subareas,
• more real-time analytics to address the challenges of quickly finding patterns in big data,
• more analytics for software systems such as mobile apps and games, and
• more impact of social tools in analytics.
As an example of this last point, check out "Human Boosting" by Harsh Pareek and Pradeep Ravikumar, which discusses how to boost human learning with the help of data miners. 10
In the very near future, this kind of human(s)-in-the-loop analytics will become much more prevalent.
The work of such a special issue falls mostly on the authors and reviewers, and we're very appreciative of all those who took the time to write and comment on these articles. The work of the reviewers was particularly challenging because their feedback was required in a very condensed timetable. Accordingly, we offer them our heartfelt thanks. We're also grateful to the IEEE Software production team for their hard work in assembling and editing all these articles.
is a full professor in computer science at West Virginia University. His research focuses on combining carbon and silicon intelligence to produce smarter communities. He's an associate editor of IEEE Transactions on Software Engineering
, the Automated Software Engineering Journal
, and the Empirical Software Engineering Journal
. Menzies received a PhD in artificial intelligence from the University of New South Wales. Contact him at email@example.com or via http://menzies.us.
is a researcher in the Empirical Software Engineering Group at Microsoft Research, adjunct assistant professor at the University of Calgary, and affiliate faculty at the University of Washington. His research interests include empirical software engineering, mining software repositories, development tools, social networking, and games analytics. He's an associate editor of IEEE Software
and the Empirical Software Engineering Journal.
Zimmermann received a PhD from Saarland University. Contact him at firstname.lastname@example.org or via http://thomas-zimmermann.com.