The Community for Technology Leaders
RSS Icon
Issue No.06 - Nov.-Dec. (2012 vol.9)
pp: 1663-1675
O. Irsoy , Dept. of Comput. Eng., Bogazici Univ., Istanbul, Turkey
O. T. Yildiz , Dept. of Comput. Eng., Isik Univ., Istanbul, Turkey
E. Alpaydin , Dept. of Comput. Eng., Bogazici Univ., Istanbul, Turkey
In many bioinformatics applications, it is important to assess and compare the performances of algorithms trained from data, to be able to draw conclusions unaffected by chance and are therefore significant. Both the design of such experiments and the analysis of the resulting data using statistical tests should be done carefully for the results to carry significance. In this paper, we first review the performance measures used in classification, the basics of experiment design and statistical tests. We then give the results of our survey over 1,500 papers published in the last two years in three bioinformatics journals (including this one). Although the basics of experiment design are well understood, such as resampling instead of using a single training set and the use of different performance metrics instead of error, only 21 percent of the papers use any statistical test for comparison. In the third part, we analyze four different scenarios which we encounter frequently in the bioinformatics literature, discussing the proper statistical methodology as well as showing an example case study for each. With the supplementary software, we hope that the guidelines we discuss will play an important role in future studies.
Bioinformatics, Algorithm design and analysis, Measurement, Approximation algorithms, Computational biology,model selection, Statistical tests, classification
O. Irsoy, O. T. Yildiz, E. Alpaydin, "Design and Analysis of Classifier Learning Experiments in Bioinformatics: Survey and Case Studies", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.9, no. 6, pp. 1663-1675, Nov.-Dec. 2012, doi:10.1109/TCBB.2012.117
[1] E. Alpaydin, Introduction to Machine Learning, second ed. The MIT Press, 2010.
[2] D.C. Montgomery, Design and Analysis of Experiments, seventh ed. Wiley, 2009.
[3] T. Fawcett, “An Introduction to ROC Analysis,” Pattern Recognition Letters, vol. 27, pp. 861-874, 2006.
[4] V.V. Raghavan, G.S. Jung, and P. Bollmann, “A Critical Investigation of Recall and Precision as Measures of Retrieval System Performance,” ACM Trans. Information Systems, vol. 7, pp. 205-229, 1989.
[5] J. Davis and M. Goadrich, “The Relationship between Precision-Recall and ROC Curves,” Proc. Int'l Conf. Machine Learning, pp. 233-240, 2006.
[6] T.G. Dietterich, “Approximate Statistical Tests for Comparing Supervised Classification Learning Classifiers,” Neural Computation, vol. 10, pp. 1895-1923, 1998.
[7] E. Alpaydin, “Combined $5\times2$ CV $F$ test for Comparing Supervised Classification Learning Classifiers,” Neural Computation, vol. 11, pp. 1975-1982, 1999.
[8] B. Efron and R. Tibshirani, “Improvements on Cross-Validation: The .632+ Bootstrap Method,” J. Am. Statistical Assoc., vol. 92, pp. 548-560, 1997.
[9] J. Demsar, “Statistical Comparisons of Classifiers over Multiple Data Sets,” J. Machine Learning Research, vol. 7, pp. 1-30, 2006.
[10] P.R. Cohen, Empirical Methods for Artificial Intelligence. MIT Press, 1995.
[11] T. Mitchell, Machine Learning. McGraw-Hill, 1997.
[12] S.L. Salzberg, “On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach,” Data Mining and Knowledge Discovery, vol. 1, pp. 317-328, 1997.
[13] C. Nadeau and Y. Bengio, “Inference for the Generalization Error,” Machine Learning, vol. 52, pp. 239-281, 2003.
[14] R.R. Bouckaert, “Estimating Replicability of Classifier Learning Experiments,” Proc. Int'l Conf. Machine Learning, pp. 15-22, 2004.
[15] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, third ed. Springer Verlag, 2011.
[16] J.A. Hanley and B.J. McNeil, “The Meaning and Use of the Area under a Receiver Operating Characteristic (ROC) Curve,” Radiology, vol. 143, pp. 29-36, 1982.
[17] C.X. Ling, J. Huang, and H. Zhang, “AUC: A Better Measure than Accuracy in Comparing Learning Algorithms,” Proc. Int'l Joint Conf. Artificial Intelligence, pp. 329-341, 2004.
[18] J. Huang, J. Lu, and C. Ling, “Comparing Naive Bayes, Decision Trees, and SVM with AUC and Accuracy,” Proc. IEEE Int'l Conf. Data Mining, pp. 553-556, 2003.
[19] A.P. Bradley, “The Use of the Area under the ROC Curve in the Evaluation of Machine Learning Algorithms,” Pattern Recognition, vol. 30, pp. 1145-1159, 1997.
[20] J.A. Hanley and B.J. McNeil, “A Method of Comparing the Areas under Receiver Operating Characteristic Curves Derived from the Same Cases,” Radiology, vol. 148, pp. 839-843, 1983.
[21] C. Cortes and M. Mohri, “Confidence Intervals for the Area Under the ROC Curve,” Proc. Neural Information Processing Systems, pp. 305-312, 2004.
[22] H.C. Bravo, G. Wahba, K.E. Lee, B.E.K. Klein, R. Klein, and S.K. Iyengar, “Examining the Relative Influence of Familial, Genetic, and Environmental Covariate Information in Flexible Risk Models,” Proc. Nat'l Academy of Sciences USA, vol. 106, pp. 8128-8133, 2004.
[23] G.M. Weiss and F. Provost, “Learning when Training Data Are Costly: The Effect of Class Distribution on Tree Induction,” J. Artificial Intelligence Research, vol. 19, pp. 315-354, 2003.
[24] B. Hanczar, J. Hua, C. Sima, J. Weinstein, M. Bittner, and E.R. Dougherty, “Small-Sample Precision of Roc-Related Estimates,” Bioinformatics, vol. 26, no. 6, pp. 822-830, 2010.
[25] S.J. Swamidass, C.-A. Azencott, K. Daily, and P. Baldi, “A Croc Stronger than Roc: Measuring, Visualizing, and Optimizing Early Retrieval,” Bioinformatics, vol. 26, no. 10, pp. 1348-1356, 2010.
[26] A. Folleco, T.M. Khoshgoftaar, and A. Napolitano, “Comparison of Four Performance Metrics for Evaluating Sampling Techniques for Low Quality Class-Imbalanced Data,” Proc. Int'l Conf. Machine Learning and Applications, pp. 153-158, 2008.
[27] E. Bloedorn, I. Mani, and T.R. Macmillan, “Machine Learning of User Profiles: Representational Issues,” Proc. Nat'l Conf. Artificial Intelligence, pp. 433-438, 1996.
[28] T.C.W. Landgrebe, P. Paclik, and R.P.W. Duin, “Precision-Recall Operating Characteristic (P-ROC) Curves in Imprecise Environments,” Proc. Int'l Conf. Pattern Recognition, pp. 123-127, 2006.
[29] S. Clemencon and N. Vayatis, “Nonparametric Estimation of the Precision-Recall Curve,” Proc. 26th Ann. Int'l Conf. Machine Learning, vol. 382, pp. 185-192, 2009.
[30] S.B. Bengio, J. Marithoz, and M. Keller, “The Expected Performance Curve,” Proc. Int'l Conf. Machine Learning, pp. 9-16, 2005.
[31] C. Drummond and R.C. Holte, “Cost Curves: An Improved Method for Visualizing Classifier Performance,” Machine Learning, vol. 65, no. 1, pp. 95-130, Oct. 2006.
[32] O.T. Yildiz and E. Alpaydin, “Ordering and Finding the Best of $K>2$ Supervised Learning Algorithms,” IEEE Trans. Pattern Analysis Machine Intelligence, vol. 28, no. 3, pp. 392-402, Mar. 2006.
[33] P.D. Turney, “Types of Cost in Inductive Concept Learning,” Proc. Workshop Cost-Sensitive Learning in 17th Int'l Conf. Machine Learning, pp. 15-21, 2000.
[34] A. Ulaş, O.T. Yildiz, and E. Alpaydin, “Cost-Conscious Comparison of Supervised Learning Algorithms over Multiple Data Sets,” Pattern Recognition, vol. 45, pp. 1772-1781, 2012.
[35] A. Dean and D. Voss, Design and Analysis of Experiments. Springer Verlag, 1999.
[36] S. García and F. Herrera, “An Extension on “Statistical Comparisons of Classifiers over Multiple Data Sets” for All Pairwise Comparisons,” J. Machine Learning Research, vol. 9, pp. 2677-2694, 2008.
[37] S. Holm, “A Simple Sequentially Rejective Multiple Test Procedure,” Scandinavian J. Statistics, vol. 6, pp. 65-70, 1979.
[38] J.P. Shaffer, “Modified Sequentially Rejective Multiple Test Procedures,” J. Am. Statistical Assoc., vol. 81, no. 395, pp. 826-831, 1986.
[39] G. Bergmann and G. Hommel, “Improvements of General Multiple Test Procedures for Redundant Systems of Hypotheses,” Multiple Hypotheses Testing, P. Bauer, G. Hommel, and E. Sonnemann, eds., pp. 100-115, Springer, 1988.
[40] L.J. Jensen and A. Bateman, “The Rise and Fall of Supervised Machine Learning Techniques,” Bioinformatics, vol. 27, no. 24, pp. 3331-3332, 2011.
[41] C.C. Chang and C.J. Lin, “LIBSVM: A Library for Support Vector Machines,”, 2001.
[42] W.W. Cohen, “Fast Effective Rule Induction,” Proc. Int'l Conf. Machine Learning, pp. 115-123, 1995.
[43] O.T. Yildiz and E. Alpaydin, “Linear Discriminant Trees,” Int'l J. Pattern Recognition and Artificial Intelligence, vol. 19, no. 3, pp. 323-353, 2005.
[44] D. Kulp, D. Haussler, M.G. Reese, and F.H. Eeckman, “A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA,” Proc. Int'l Conf. Intelligent Systems for Molecular Biology, 1996.
[45] A. Statnikov, C. Aliferis, I. Tsamardinos, D. Hardin, and S. Levy, “A Comprehensive Evaluation of Multicategory Classification Methods for Microarray Gene Expression Cancer Diagnosis,” Bioinformatics, vol. 21, pp. 631-643, 2005.
[46] J. Song, H. Tan, H. Shen, K. Mahmood, S.E. Boyd, G.I. Webb, T. Akutsu, and J.C. Whisstock, “Cascleave: Towards More Accurate Prediction of Caspase Substrate Cleavage Sites,” Bioinformatics, vol. 26, pp. 752-760, 2010.
[47] J.C. Jeong, X. Lin, and X.-W. Chen, “On Position-Specific Scoring Matrix for Protein Function Prediction,” IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 8, no. 2, pp. 308-315, Mar./Apr. 2011.
[48] N.J. MacDonald and R.G. Beiko, “Efficient Learning of Microbial Genotype-Phenotype Association Rules,” Bioinformatics, vol. 26, pp. 1834-1840, 2010.
[49] E. Tapia, L. Ornella, P. Bulacio, and L. Angelone, “Multiclass Classification of Microarray Data Samples with a Reduced Number of Genes,” BMC Bioinformatics, vol. 12, article 59, 2011.
[50] S. Zhu, D. Wang, K. Yu, T. Li, and Y. Gong, “Feature Selection for Gene Expression Using Model-Based Entropy,” IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 7, no. 1, pp. 25-36, Jan.-Mar. 2010.
[51] Z. Liu, S. Lin, and M.T. Tan, “Sparse Support Vector Machines with $l_p$ Penalty for Biomarker Identification,” IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 7, no. 1, pp. 100-107, Jan.-Mar. 2010.
92 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool