This Article 
 Bibliographic References 
 Add to: 
Preliminary Guidelines for Empirical Research in Software Engineering
August 2002 (vol. 28 no. 8)
pp. 721-734

Abstract—Empirical software engineering research needs research guidelines to improve the research and reporting processes. We propose a preliminary set of research guidelines aimed at stimulating discussion among software researchers. They are based on a review of research guidelines developed for medical researchers and on our own experience in doing and reviewing software engineering research. The guidelines are intended to assist researchers, reviewers, and meta-analysts in designing, conducting, and evaluating empirical studies. Editorial boards of software engineering journals may wish to use our recommendations as a basis for developing guidelines for reviewers and for framing policies for dealing with the design, data collection, and analysis and reporting of empirical studies.

[1] D. Altman, “Guidelines for Contributors,” Statistics in Practice, S.M. Gore and D. Altman, eds., 1991.
[2] D. Altman, “Statistical Reviewing for Medical Journals,” Statistics in Medicine, vol. 17, pp. 2661-2674, 1998.
[3] D. Altman, S. Gore, M. Gardner, and S. Pocock, “Statistical Guidelines for Contributors to Medical Journals,” British Medical J., vol. 286, pp. 1489-1493, 1983.
[4] C. Begg, M. Cho, S. Eastwood, R. Horton, D. Moher, I. Olkin, R. Pitkin, D. Rennie, K.F. Schultz, D. Simel, and D.F. Stroup, “Improving the Quality of Reporting of Randomized Trials (the CONSORT Statement),” J. Am. Medical Association, vol. 276, no. 8, pp. 637-639, Aug. 1996.
[5] J. Cohen, Statistical Power Analysis for the Behavioral Sciences, Hillsdale, N.J.: Lawrence Erlbaum Associates, 1988.
[6] R.E. Courtney and D.A. Gustafson, “Shotgun Correlations in Software Measures,” Software Eng. J., vol. 8, no. 1, pp. 5-13, 1992.
[7] L.J. Cronbach, “Coefficient Alpha and the Internal Structure of Tests,” Pscychometrika, vol. 16, no. 3, pp. 297-334, 1951.
[8] S. DePanfilis, B. Kitchenham, and N. Morfuni, “Experiences Introducing a Measurement Program,” Information and Software Technology, vol. 39, no. 11, pp 745-754, 1997.
[9] E. Doolan, “Experience with Fagan's Inspection Method,” Software—Practice and Experience, vol. 22, no. 2, pp. 173-182, Feb. 1992.
[10] K. El-Emam, S. Benlarbi, N. Goel, and S. Rai, “The Confounding Effect of Class Size on the Validity of Object-Oriented Metrics,” IEEE Trans. Software Eng., vol. 27, no. 6, pp. 630-650, July 2001.
[11] N.E. Fenton and S.L. Pfleeger, Software Metrics: A Rigorous and Practical Approach, second ed. Brooks-Cole, 1997.
[12] H. Fukuda and Y. Ohashi, “A Guideline for Reporting Results of Statistical Analysis in Japanese Journal of Clinical Oncology,” Japanese J. Clinical Oncology, vol. 27, pp. 121-127, 1997, also available in English at.
[13] P. Fusaro, K. El-Emam, and B. Smith, “Evaluating the Interrater Agreement of Process Capability Ratings,” Proc. Fourth Int'l Software Metrics Symp., pp. 2-11, 1997.
[14] M.J. Gardner and D.G. Altman, Statistics with Confidence, London: BMJ, 1989.
[15] L. Gordis, Epidemiology. W.B. Sunders Company, 1996.
[16] D. Heinsman and W. Shadish, “Assignment Methods in Experimentation: When Do Nonrandomized Experiments Approximate Answers from Randomized Experiments?” Psychological Methods, vol. 1, no. 2, pp. 154-169, 1996.
[17] M. Hitz and B. Montazeri, “Measuring Product Attributes of Object-Oriented Systems,” Proc. Fifth European Software Eng. Conf., W. Schafer and P. Botella, eds., Sept. 1995.
[18] D.C. Hoaglin, F. Mosteller, and J.W. Tukey, Understanding Robust and Exploratory Data Analysis. John Wiley, 1983.
[19] M. Host and C. Wohlin, “A subjective Effort Estimation Experiment,” Information and Software Technology, vol. 39, pp. 755-762, 1997.
[20] P. Johnson and D. Tjahjono, “Does Every Inspection Really Need a Meeting?” Empirical Software Eng., vol. 3, pp. 9-35, 1998.
[21] G. Keppel, Design and Analysis: A Researcher's Handbook, third ed. Prentice Hall, 1991.
[22] B.A. Kitchenham, R.T. Hughes, and S.G. Linkman, “Modeling Software Measurement Data,” IEEE Trans. Software Eng., vol. 27, no. 9, pp. 788-804, Sept. 2001.
[23] B.A. Kitchenham, G.H. Travassos, A. Von Mayrhauser, F. Niessink, N.F. Schniedewind, J. Singer, S. Takado, R. Vehvilainen, and H. Yang, “Towards an Ontology of Software Maintenance,” J. Software Maintenance: Research and Practice, vol. 11, pp. 365-389, 1999.
[24] B.A. Kitchenham, S.L. Pfleeger, and N. Fenton, “Towards a Framework for Software Measurement Validation,” IEEE Trans. Software Eng., vol. 21, no. 12, pp. 929-944, Dec. 1995.
[25] B.A. Kitchenham and K. Kansala, “Inter-Item Correlations Among Function Points,” Proc. First Int'l Software Metrics Symp., pp. 11-14, 1993.
[26] O. Laitenberger and J.-M. DeBaud, “Perspective-Based Reading of Code Documents at Robert Bosch GmbH,” Information and Software Technology, vol. 39, pp. 781-791, 1997.
[27] L. Land, C. Sauer, and R. Jeffery, “Validating the Defect Detection Performance Advantage of Group Designs for Software Reviews: Report of a Laboratory Experiment Using Program Code,” Proc. Sixth European Software Eng. Conf., M. Jazayeri and H. Schauer, eds., pp. 294-309, 1997.
[28] T. Lang and M. Secic, How to Report Statistics in Medicine: Annotated Guidelines for Authors, Editors and Reviewers. Am. College of Physicians, 1997.
[29] T.C. Lethbridge, “What Knowledge is Important to a Software Professional?” Computer, vol. 33, no. 5, pp. 44-50, May 2000.
[30] R. Little and D. Rubin, Statistical Analysis With Missing Data. John Wiley&Sons, 1987.
[31] S.M. McGuigan, “The Use of Statistics in theBritish Journal of Psychiatry,” British J. Psychiatry, vol. 167, no. 5, pp. 683-688, 1995.
[32] R.G. Miller, Jr., Simultaneous Statistical Inference, second ed. New York: Springer-Verlag, 1981.
[33] G.A. Milliken and D.A. Johnson, Analysis of Messy Data, Volume 1: Designed Experiments. London: Chapman&Hall, 1992.
[34] L.M. Pickard, B.A. Kitchenham, and P. Jones, “Combining Empirical Results in Software Engineering,” Information and Software Technology, vol. 40, no. 14, pp. 811-821, 1998.
[35] S.J. Pocock, Clinical Trials: A Practical Approach. Chichester, U.K.: John Wiley and Sons, 1984.
[36] A.A. Porter and P.M. Johnson, “Assessing Software Review Meetings: Results of a Comparative Analysis of Two Experimental Studies,” IEEE Trans. Software Eng., vol. 23, no. 3, pp. 129-145, Mar. 1997.
[37] A.A. Porter, L.G. Votta, and V.R. Basili, “Comparing Detection Methods for Software Requirements Inspections: A Replicated Experiment,” IEEE Trans. Software Eng., vol. 21, no. 6, pp. 563-575, June 1995.
[38] A.M. Porter, “Misuse of Correlation and Regression in Three Medical Journals,” J. Royal Soc. Medicine, vol. 92, no. 3, pp. 123-128, 1999.
[39] W.F. Rosenberger, “Dealing with Multiplicities in Pharmacoepidemiologic Studies,” Pharmacoepidemiology and Drug Safety, vol. 5, pp. 95-100, 1996.
[40] J. Ropponen and K. Lyytinen, “Components of Software Development Risk: How to Address Them? A Project Manager Survey,” IEEE Trans. Software Eng., vol. 26, no. 2, pp. 98-111, 2000.
[41] R. Rosenthal, Experimenter Effects in Behavioral Research, New York: John Wiley and Sons, 1976.
[42] R. Rosenthal, “Science and Ethics in Conducting, Analyzing, and Reporting Psychological Research,” Psychological Science, vol. 5, pp. 127-134, 1994.
[43] H.S. Sacks, J. Berrier, D. Reitman, V.A. Ancona-Berk, and T.C. Chalmers, “Meta-Analyses of Randomized Controlled Trials,” The New England J. Medicine, vol. 316, no. 8, pp. 312-455, Feb. 1987.
[44] H. Siy and L. Votta, “Does the Modern Code Inspection Have Value?” Proc. IEEE Int'l Conf. Software Maintenance, 1999.
[45] W.F. Tichy, “Should Computer Scientists Experiment More?,” Computer, vol. 31, no. 5, pp. 32-40, May 1998.
[46] R. Vinter, M. Loomes, and D. Kornbrot, “Applying Software Metrics to Formal Specifications: A Cognitive Approach,” Proc. Fifth Int'l Software Metrics Symp., pp. 216-223, 1998.
[47] L. Votta, “Does Every Inspection Need a Meeting?” ACM Software Eng. Notes, vol. 18, no. 5, pp. 107-114, 1993.
[48] G.E. Welch and S.G. Gabbe, “Review of Statistics Usage in theAmerican Journal of Obstetrics and Gynecology,” Am. J. Obstetrics and Gynecology, vol. 175, no. 5, pp. 1138-1141, 1996.
[49] L. Wilkinson and Task Force on Statistical Inference, “Statistical Methods in Psychology Journals: Guidelines and Explanations,” Am. Psychologist, vol. 54, no. 8, pp. 594-604, 1999, .
[50] J.M. Yancey, “Ten Rules for Reading Clinical Research Reports,” Am. J. Orthodontics and Dentofacial Orthopedics, vol. 109, no. 5, pp. 558-564, May 1996.
[51] M. Zelkowitz and D. Wallace, “Experimental Models for Validating Technology,” Computer, vol. 31, no. 5, pp. 23–31, May 1998.

Index Terms:
Empirical software research, research guidelines, statistical mistakes.
Barbara A. Kitchenham, Shari Lawrence Pfleeger, Lesley M. Pickard, Peter W. Jones, David C. Hoaglin, Khaled El Emam, Jarrett Rosenberg, "Preliminary Guidelines for Empirical Research in Software Engineering," IEEE Transactions on Software Engineering, vol. 28, no. 8, pp. 721-734, Aug. 2002, doi:10.1109/TSE.2002.1027796
Usage of this product signifies your acceptance of the Terms of Use.