Sometimes I hear people wondering whether software measurement is still a viable research area. After all, haven't we already identified enough software-related things to measure? Don't we know by this point "the" set of things to measure on a project? For me, such questions are hard to reconcile with the state of the practice.
A few years ago, on the occasion of IEEE Software
's 25th anniversary, I asked Shari Lawrence Pfleeger to reflect on what 25 years of work on software measurement has wrought. Her remarkably complete and compact summary contained a tour through many of the obstacles to good measurement in software (such as the complexities involved in dealing with uncertainty, or in dealing with "softe," human-related factors) and the approaches that have helped us overcome them. And yet, she also observed that "many practitioners are still skeptical, especially for small projects." 1
I can certainly attest that such skepticism is still alive and well in some contexts. Here at Fraunhofer, some of our projects support government agencies in using metrics to monitor the progress of contractors who are developing systems. In short, we aim to protect the taxpayer investment by giving the government an accurate assessment of whether the ambitious and sophisticated systems it's buying are on schedule and within budget. Although we typically start with a basic consensus of the metrics we need, enough significant variation exists among different program goals and objectives—not to mention constraints on what data can be collected, how it's reported, and how often—that no one answer works for all projects, and significant analysis is required to design new measurement regimens.
Not only is software metrics not a solved problem, but practitioners could use help from researchers on numerous questions:
• Do more cost-effective metrics exist for accurately tracking value and monitoring progress across large, complex, and geographically distributed teams?
• How can researchers improve the accuracy, fidelity, and insight associated with development metrics?
• Can the data available to developers be put to good use?
• How can researchers demonstrate that new practices provide value and improve progress in these environments?
Fortunately, recent trends in software engineering research might allow better answers to these questions. In fact, research methods have changed so much that I've been starting to think that we are on the cusp of Empirical Software Engineering (ESE) 2.0. I tried to flesh out these thoughts in a technical briefing that Tim Menzies and I gave at the 2011 International Conference on Software Engineering (ICSE 11; download the slides at http://tinyurl.com/version-two) as well as a recent keynote at the 2012 International Conference on Global Software Engineering (ICGSE 12).
Wait, So What Was ESE 1.0?
ESE is the field of research that advocates grounding decisions about the best practices to apply in thoughtful observation of how those practices actually perform. ESE research has used a rich body of investigative methods, including controlled or quasicontrolled experiments, case studies, surveys, and statistical analyses. As such, the legacy of ESE has been to make the study of software more of a true engineering field. But we're not there yet: fads come and go in software development practice and drive more decision making than we'd like—until the bloom wears off of new technologies that seemed to have such promise and folks start jumping aboard the next bandwagon. Significant numbers of software development projects still end up costing more than expected and delivering less than expected or late.
In terms of impact on practice, I worry that ESE hasn't always lived up to its potential because rigorous research takes time and, in contrast, the need for results is much more immediate in our fast-changing field. As an example of the mismatch, I'll pick on some work of my own: in 2010, I was involved in a meta analysis that systematically collected, aggregated, and analyzed the results of multiple studies in different environments regarding the practice of test-driven development (TDD). 2
I'm proud of that study; I felt it came at the right time because only then did the field have access to a substantial body of knowledge about the practice. And yet there were many development teams still wrestling in 2010 with the decision of whether or not to adopt TDD. Much of the original excitement and discussion had already happened around the release of Kent Beck's 2002 book, Test-Driven Development: By Example
(Addison-Wesley). Although I felt we were able to produce very compelling research results in our analysis, I suspect that most practitioners had already moved on to the next burning problem—or the one after that—by the time our results were published.
This is hardly an atypical example of the timetable on which good research is done.
However, several recent trends give me hope that research is changing and becoming better able to grapple with the needs of the practitioner community. These trends are transformative enough in combination that I think the ESE 2.0 moniker is warranted. Let's discuss a few of them.
Recognizing Value Propositions
It shouldn't be controversial that no one-size-fits-all best software engineering practices exist: "best" only makes sense relative to a project's specific goals. The field of value-based software engineering has made this a foundational principle by looking at software practice through the lens of economic considerations, and such ideas are becoming more mainstream.
Some recent work has brought home afresh the importance of such an approach: the software engineering field has long made use of approaches such as goal, question, metric (GQM) to tie specific measurements to the technical goals that they address. Recently, a collaboration between Fraunhofer organizations both in the US and Germany expanded GQM to tie such technical measurement frameworks to business decisions at various levels of an organization. The resulting GQM+Strategies methodology has been useful in giving us a practical approach for articulating the business context of technical decisions while also showing the level of complexity this articulation requires. In our work with companies, it's common to uncover with this approach important issues that must be addressed, such as technical initiatives that can't be explained in terms of business goals that managers would recognize or management strategies that don't trace down to anything being operationalized within organizational units further down the hierarchy.
GQM+Strategies and other approaches show that, oftentimes, the business context itself can't be articulated without some support and analysis. And research in turn has benefitted from leveraging this work and understanding that the technical decisions we examine can't be divorced from the business context—unless, of course, you want to build models that nobody cares about.
When I began working with software measurement, too often getting additional insight into any interesting aspects of development required developers to manually collect data. Questions such as how much time they spent on which tasks and how many defects they found or fixed required them to expend additional effort to self-report the data in question.
Today, development teams are awash in certain types of data, if only we take the time to extract them. For example, the local Jira/Bugzilla installation has tons of information about the defects found and fixed on a project, and the local timesheet system has insight into effort allocations. The challenge now is quite different: Can we make sense of all of this data? Can we pick out the pieces that actually tell us something useful about what works and what doesn't in practice and that lets us make better decisions?
Empirical research has traditionally solved the question of meaning in the data by coming up with testable, insightful hypotheses a priori and then manually doing the analyses and running the statistical tests that determine whether clear answers exist in the data. While rigorous, this is one reason why results take so long to percolate out to practitioners. We should be leveraging data mining technologies to find patterns in the data that we haven't even considered. But just as manual hypothesis testing alone runs the risk of taking too long, data mining alone runs the risk of producing uninteresting, unactionable, or just plain weird results. We need a hybrid approach that combines the best of both.
Recent experiences have convinced me that such hybrid approaches are increasingly feasible. Developers have no shortage of hypotheses about which factors impact their development performance and what they should be doing to improve their outcomes. With tool support, it's now feasible to code up such hypotheses as rules about what makes for good practice, then analyze an entire source code repository and accompanying defect set to see, historically, whether breaking those rules actually diminishes team performance in some way. (The video accompanying the digital edition of this issue contains a talk I gave earlier this year describing some experiences formulating different but equally useful rules for finding and dealing with technical debt in a number of different environments.) The bottom line: when appropriately deployed, data mining technologies are putting the right tools into developers' hands to help them better understand their own work and make data-driven decisions based on their own measurement baselines.
Aggregating Robust Results across Studies
I'll restrain myself from saying too much about this area because I could go on for pages. Trying to abstract the take-away message for practitioners from the body of research knowledge available on various topics has consumed the attention of researchers over the past several years, myself included. IEEE Software's Voice of Evidence department has focused on highlighting exemplary work in this area.
The search for robust lessons learned is important in that it can provide useful advice for developers who are considering new practices or technologies but don't have their own baselines yet. Exemplary analyses across multiple studies certainly exist with clear visualizations of results and actionable messages for practitioners. However, in many cases, I see attempts to aggregate research findings that run aground on the fact that too few data exist; while many research publications can be found for certain topics, few if any provide any evidential or data-backed evidence for their claims. It's also common for a systematic review to founder when data can be found but a majority of the research papers turn out to be reporting on the invention of new, slightly different approaches.
So, in this case, while advancing research methods are cause for hope, the paucity of underlying data on some topics is conversely cause for worry. I'm concerned that we see more research effort spent on developing new technologies, presumably for the sake of novelty, than on testing and comparing them. Without a shift in emphasis, such research is unlikely to have the desired impact.
Based on these trends and challenges, I've been advocating the following, and we've been putting these principles in place with IEEE Software:
• Use data and experiences to demonstrate why new proposed techniques are better than the state of the practice or to illuminate how they can be improved. Having such an experiential basis for papers is probably one of the first things that our reviewers check in technical submissions.
• Be open to appropriate uses of experience reports as well as more rigorous forms of technical communication. One of the ways to help provide useful decision support to practitioners more quickly is to share stories of our experiences about what works and what doesn't in practice.
• Important considerations still exist for describing such experiences so that they're helpful reports rather than anecdotes. The Insights department has been both leading the way on this and providing mentoring.
• Treat replication as a first-class goal and repeatability as an important quality check. Research papers shouldn't be turned away for providing more data that help improve our understanding of important practices and how they work across different environments. Let's not overemphasize novelty at the expense of practical results.
And who knows? If we extrapolate these trends forward and imagine improved methods for gathering and making sense of data about software development, we could envision a world in which developers are working while data mining agents run in the background, providing useful insights in a real-time fashion about what's working, what they should do differently, and when the team is moving into uncharted territory in which our current understanding doesn't provide much help. While I don't see computers ever developing software on their own, I can certainly imagine a better partnership that can give developers the tools to deal with the ever-increasing complexity of our systems and improve our confidence in the software that underlies our world.
is a division director at the Fraunhofer Center for Experimental Software Engineering in Maryland, a nonprofit research and tech transfer organization, where he leads the Measurement and Knowledge Management Division. He is an adjunct professor at the University of Maryland College Park and editor in chief of IEEE Software
. Contact him at firstname.lastname@example.org.