Issue No.01 - January/February (2009 vol.11)
pp: 28-34
Roger D. Peng , Johns Hopkins Bloomberg School of Public Health
The ability to make scientific findings reproducible is increasingly important in areas where substantive results are the product of complex statistical computations. Reproducibility can allow others to verify the published findings and conduct alternate analyses of the same data. A question that arises naturally is how to conduct and distribute reproducible research. The authors describe a simple framework in which reproducible research can be conducted and distributed via cached computations and tools for both authors and readers. As a prototype implementation they also describe a software package written in the R language. The "cacher" package provides tools for caching computational results in a key-value style database, which can be published to a public repository for readers to download. As a case study, they demonstrate the use of the package on a study of ambient air pollution exposure and mortality in the US.
Reproducible research, database, software
Roger D. Peng, "Distributed Reproducible Research Using Cached Computations", Computing in Science & Engineering, vol.11, no. 1, pp. 28-34, January/February 2009, doi:10.1109/MCSE.2009.6
1. C. Laine et al., "Reproducible Research: Moving toward Research the Public Can Really Trust," Annals Internal Medicine, vol. 146, no. 6, 2007, pp. 450–453.
2. K. Baggerly et al., "Signal in Noise: Evaluating Reported Reproducibility of Serum Proteomic Tests for Ovarian Cancer," J. Nat'l Cancer Inst., vol. 97, no. 4, 2005, pp. 307–309.
3. K. Coombes, J. Wang, and K. Baggerly, "Microarrays: Retracing Steps," Natural Medicine, vol. 13, no. 11, 2007, pp. 1276–1277.
4. J. Buckheit and D.L. Donoho, "Wavelab and Reproducible Research," Wavelets and Statistics, A. Antoniadis ed., Springer-Verlag, 1995.
5. M. Ruschhaupt et al., "A Compendium to Ensure Computational Reproducibility in High-Dimensional Classification Tasks," , Statistical Applications in Genetics and Molecular Biology, vol. 3, no. 37, 2004.
6. G. Sawitzki, "Keeping Statistics Alive in Documents," J. Computational and Graphical Statistics, vol. 17, 2002, pp. 65–88.
7. R.C. Gentleman, "Bioconductor: Open Software Development for Computational Biology and Bioinformatics," Genome Biology, vol. 5, no. 10, 2004, R80.
8. R. Gentleman, "Reproducible Research: A Bioinformatics Case Study," Statistical Applications in Genetics and Molecular Biology, vol. 4, no. 1, 2005, Article 2.
9. F. Leisch, "Sweave: Dynamic Generation of Statistical Reports Using Literate Data Analysis," Compstat 2002—Proc. Computational Statistics, W. Härdle, and B. Rönz eds., Physika Verlag, 2002, pp. 575–580.
10. D.E. Knuth, "Literate Programming," Computer J., vol. 27, no. 2, 1984, pp. 97–111.
11. M. Schwab, N. Karrenbach, and J. Claerbout, "Making Scientific Computations Reproducible," Computing in Science &Eng., vol. 2, no. 6, 2002, pp. 61–67.
12. S. Fomel and G. Hennenfent, "Reproducible Computational Experiments Using Scons," Proc. IEEE Int'l Conf. Acoustics, Speech and Signal Processing (ICASSP 07), vol. 4, Apr. 2007, pp. IV1257–IV1260.
13. R. Gentleman and D. Temple Lang, "Statistical Analyses and Reproducible Research," J. Computational and Graphical Statistics, vol. 16, no. 1, 2007, pp. 1–23.
14. R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, 2008.
15. R.D. Peng, "Caching and Distributing Statistical Analyses in R," J. Statistical Software, vol. 26, no. 7, 2008, pp. 1–24.
16. R.D. Peng, F. Dominici, and S.L. Zeger, "Reproducible Epidemiologic Research," Am. J. Epidemiology, vol. 163, no. 9, 2006, pp. 783–789.
17. J.M. Samet et al., "Fine Particulate Air Pollution and Mortality in 20 US Cities," New England J. Medicine, vol. 343, no. 24, 2000, pp. 1742–1749.
18. J.M. Samet et al., The National Morbidity, Mortality, and Air Pollution Study, Part I: Methods and Methodological Issues, Health Effects Inst., 2000.
19. J.M. Samet et al., The National Morbidity, Mortality, and Air Pollution Study, Part II: Morbidity and Mortality from Air Pollution in the United States, Health Effects Inst., 2000.
20. S.L. Zeger et al., Internet-Based Health and Air Pollution Surveillance System, Communication 12, Health Effects Inst., 2006.
21. R.D. Peng and L.J. Welty, "The NMMAPSdata Package," R News, vol. 4, no. 2, 2004, pp. 10–14.
22. R.D. Peng, F. Dominici, and T.A. Louis, "Model Choice in Time Series Studies of Air Pollution and Mortality (with Discussion)," J. Royal Statistical Soc., ser. A, vol. 169, no. 2, 2006 pp. 179–203.
23. A.J. Rossini et al., "Emacs Speaks Statistics: A Multiplatform, Multipackage Development Environment for Statistical Analysis," J. Computational and Graphical Statistics, vol. 13, no. 1, 2004, pp. 247–261.
24. R. Gini and J. Pasquini, "Automatic Generation of Documents," The Stata J., vol. 6, no. 1, 2006, pp. 22–39.
25. R. Newson, "Confidence Intervals and p-Values for Delivery to the End User," The Stata J., vol. 3, no. 3, 2003, pp. 245–269.
26. J. Zhang and R. Gentleman, "Tools for Interactively Exploring R Packages," R News, vol. 4, no. 1, 2004, pp. 20–25.