Issue No. 05 - October (1996 vol. 11)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/64.539008
<p>Evaluating the effectiveness of research ideas is an important aspect of any work. However, experimental studies frequently do not receive the attention they deserve. Consequently, evaluations of systems are not as compelling as they could or should be. Evaluating a system is not simply a matter of running the system and collecting a few numbers. Rather, a useful evaluation requires first determining exactly what you want to show—your hypothesis. Then, you have to develop one or more experiments to test this hypothesis. Finally, you need to carefully evaluate the results of your experiments to determine whether the data they produced supports the hypothesis.</p><p>This installment of "Trends & Controversies" looks closely at the ingredients of a compelling empirical evaluation. First, Kevin Knight presents his thoughts on some of the advantages and disadvantages of "bake-offs," or evaluations in which a number of researchers compare their work on a set of shared test problems. In recent years, this type of evaluation has found widespread use in large natural-language and machine-translation projects, of which Kevin has a great deal of first-hand experience. Next, Pat Langley, who has long and successfully advocated the use of empirical evaluation in machine-learning research, describes how to design a set of compelling experiments through the use of both synthetic and real-world domains. Finally, Paul Cohen, who recently wrote <it>Empirical Methods for Artificial Intelligence</it>, identifies some of the common pitfalls in analyzing the results of experiments and describes ways to avoid them.</p>
P. R. Cohen, P. Langley and K. Knight, "What makes a compelling empirical evaluation?," in IEEE Intelligent Systems, vol. 11, no. , pp. 10-14, 1996.