The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.01 - January-March (2010 vol.7)
pp: 172-182
Alex A. Freitas , University of Kent, Canterbury
Daniela C. Wieser , European Bioinformatics Institute, Cambridge
Rolf Apweiler , European Bioinformatics Institute, Cambridge
ABSTRACT
The literature on protein function prediction is currently dominated by works aimed at maximizing predictive accuracy, ignoring the important issues of validation and interpretation of discovered knowledge, which can lead to new insights and hypotheses that are biologically meaningful and advance the understanding of protein functions by biologists. The overall goal of this paper is to critically evaluate this approach, offering a refreshing new perspective on this issue, focusing not only on predictive accuracy but also on the comprehensibility of the induced protein function prediction models. More specifically, this paper aims to offer two main contributions to the area of protein function prediction. First, it presents the case for discovering comprehensible protein function prediction models from data, discussing in detail the advantages of such models, namely, increasing the confidence of the biologist in the system's predictions, leading to new insights about the data and the formulation of new biological hypotheses, and detecting errors in the data. Second, it presents a critical review of the pros and cons of several different knowledge representations that can be used in order to support the discovery of comprehensible protein function prediction models.
INDEX TERMS
Biology, classifier design and evaluation, induction, machine learning.
CITATION
Alex A. Freitas, Daniela C. Wieser, Rolf Apweiler, "On the Importance of Comprehensible Classification Models for Protein Function Prediction", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.7, no. 1, pp. 172-182, January-March 2010, doi:10.1109/TCBB.2008.47
REFERENCES
[1] D.W. Aha, ed., Artificial Intelligence Rev., special issue on lazy learning, vol. 11, 1997.
[2] A. Al-Shahib, R. Breitling, and D. Gilbert, "Feature Selection and the Class Imbalance Problem in Predicting Protein Function from Sequence," Applied Bioinformatics, vol. 4, no. 3, pp. 195-203, 2005.
[3] The Arabidopsis Genome Initiative, "Analysis of the Genome Sequence of the Flowering Plant Arabidopsis Thaliana," Nature, vol. 408, pp. 796-815, 2000.
[4] J.R. Bock and D.A. Gough, "In Silico Biological Function Attribution: A Different Perspective," Biosilico, vol. 2, no. 1, pp. 30-37, Jan. 2004.
[5] P.B. Brazdil, C. Soares, and J.P. Costa, "Ranking Learning Algorithms: Using IBL and Meta-Learning on Accuracy and Time Results," Machine Learning, vol. 50, no. 3, pp. 251-277, Mar. 2003.
[6] A. Bulashevska and R. Eils, "Predicting Protein Subcellular Locations Using Hierarchical Ensemble of Bayesian Classifiers Based on Markov Chains," BMC Bioinformatics, vol. 7, p. 298, 2006.
[7] Y. Chen and D. Xu, "Genome-Scale Protein Function Prediction in Yeast Saccharomyces cerevisiae through Integrating Multiple Sources of High-Throughput Data," Proc. Pacific Symp. Biocomputing (PSB '05), vol. 10, pp. 471-482, 2005.
[8] A. Clare and R.D. King, "Knowledge Discovery in Multi-Label Phenotype Data," Proc. Fifth European Conf. Principles of Data Mining and Knowledge Discovery (PKDD '01), pp. 42-53, 2001.
[9] A. Clare and R.D. King, "Machine Learning of Functional Class from Phenotype Data," Bioinformatics, vol. 18, no. 1, pp. 160-166, 2002.
[10] A. Clare and R.D. King, "Predicting Gene Function in Saccharomyces Cerevisiae," Bioinformatics, vol. 19, no. Suppl. 2, pp. ii42-ii49, 2003.
[11] A. Clare, A. Karwath, H. Ougham, and R.D. King, "Functional Bioinformatics for Arabidopsis thaliana," Bioinformatics, vol. 22, no. 9, pp. 1130-1136, 2006.
[12] R.J. Dobson, P.B. Munroe, M.J. Caufield, and M.A.S. Saqi, "Predicting Deleterious nsSNPs: An Analysis of Sequence and Structural Attributes," BMC Bioinformatics, vol. 7, p. 217, 2006.
[13] E.S. Correa, A.A. Freitas, and C.G. Johnson, "A New Discrete Particle Swarm Algorithm Applied to Attribute Selection in a Bioinformatics Data Set," Proc. Genetic and Evolutionary Computation Conf. (GECCO '06), J. Keijzer et al., eds., pp. 35-42, 2006.
[14] M.N. Davies, D.E. Gloriam, A. Secker, A.A. Freitas, M. Mendao, J. Timmis, and D.R. Flower, "Proteomic Applications of Automated GPCR Classification," Proteomics, vol. 7, no. 16, pp. 2800-2814, Aug. 2007.
[15] M. Doderer, K. Yoon, J. Salinas, and S. Kwek, "Protein Subcellular Localization Prediction Using a Hybrid of Similarity Search and Error-Correcting Output Code Techniques That Produces Interpretable Results," In Silico Biology, vol. 6, 2006.
[16] D. Filmore, "It's a GPCR World," Modern Drug Discovery, pp. 24-27, Nov. 2004.
[17] A.A. Freitas, "Are We Really Discovering "Interesting" Knowledge from Data," Expert Update (the BCS-SGAI Magazine), vol. 9, no. 1, pp. 41-47, Autumn 2006.
[18] I. Friedberg, "Automated Protein Function Prediction—The Genomic Challenge," Briefings in Bioinformatics, vol. 7, no. 3, pp. 225-242, 2006.
[19] G. Fung, S. Sandilya, and R.B. Rao, "Rule Extraction from Linear Support Vector Machines," Proc. ACM SIGKDD '05, pp. 32-40, 2005.
[20] J.A. Gerlt and P.C. Babbitt, "Can Sequence Determine Function," Genome Biology, vol. 1, no. 5, 2000.
[21] GO Consortium, "The Gene Ontology (GO) Database and Informatics Resource," Nucleic Acids Research, vol. 32, pp. D258-D261, 2004.
[22] GO Consortium, "The Gene Ontology (GO) Project in 2006," Nucleic Acids Research, vol. 34, pp. D322-D326, 2006.
[23] B. Hayete and J.R. Bienkowska, "GOTrees: Predicting GO Associations from Protein Domain Composition Using Decision Trees," Proc. Pacific Symp. Biocomputing (PSB '05), vol. 10, pp. 127-138, 2005.
[24] J. He, H.-J. Hu, R. Harrison, P.C. Tai, and Y. Pan, "Transmembrane Segments Prediction and Understanding Using Support Vector Machine and Decision Tree," Expert Systems with Applications, vol. 30, pp. 64-72, 2006.
[25] R.J. Henery, "Classification," Machine Learning, Neural and Statistical Classification, D. Michie, D.J. Spiegelhalter, and C.C. Taylor, eds., pp. 6-16, Ellis Horwood, 1994.
[26] P. Horton, K.-J. Park, T. Obayashi, N. Fujita, H. Harada, C.J. Adams-Collier, and K. Nakai, "WoLF PSORT: Protein Localization Predictor," Nucleic Acids Research Advance Access, May 2007.
[27] L.-T. Huang, M.M. Gromiha, and S.-Y. Ho, "iPTREE-STAB: Interpretable Decision Tree Based Method for Predicting Protein Stability Changes upon Mutations," Bioinformatics, vol. 23, no. 10, pp. 1292-1293, 2007.
[28] H. Jacobson, "Rule Extraction from Recurrent Neural Networks: A Taxonomy and Review," Neural Computation, vol. 17, pp. 1223-1263, 2005.
[29] L.J. Jensen, R. Gupta, N. Blom, D. Devos, J. Tamames, C. Kesmir, H. Nielsen, H.H. Staerfeldt, K. Rapacki, C. Workman, C.A.F. Andersen, S. Snudsen, A. Krogh, A. Valencia, and S. Brunak, "Prediction of Human Protein Function from Post-Translational Modifications and Localization Features," J. Molecular Biology, vol. 319, pp. 1257-1265, 2002.
[30] L.J. Jensen, R. Gupta, H.-H. Staerfeldt, and S. Brunak, "Prediction of Human Protein Function According to Gene Ontology Categories," Bioinformatics, vol. 19, no. 5, pp. 635-642, 2003.
[31] T. Jiang and A.E. Keating, "AVID: An Integrative Framework for Discovering Functional Relationships among Proteins," BMC Bioinformatics, vol. 6, no. 136, 2005.
[32] C.E. Jones, A.L. Brown, and U. Baumann, "Estimating the Annotation Error Rate of Curated GO Database Sequence Annotations," BMC Bioinformatics, vol. 8, no. 170, 2007.
[33] A. Karwath and R.D. King, "Homology Induction: The Use of Machine Learning to Improve Sequence Similarity Searches," BMC Bioinformatics, vol. 3, no. 11, 2002.
[34] T. Kenakin, "New Bull's Eyes for Drugs," Scientific Am., pp. 32-39, Oct. 2005.
[35] R.D. King, A. Karwath, A. Clare, and L. Dehaspe, "The Utility of Different Representations of Protein Sequence for Predicting Functional Class," Bioinformatics, vol. 17, no. 5, pp. 445-454, 2001.
[36] R.D. King, P.H. Wise, and A. Clare, "Confirmation of Data Mining Based Predictions of Protein Function," Bioinformatics, vol. 20, no. 7, pp. 1110-1118, 2004.
[37] K.B. Korb and A.E. Nicholson, Bayesian Artificial Intelligence. Chapman & Hall/CRC, 2004.
[38] E. Kretschmann, W. Fleischmann, and R. Apweiler, "Automatic Rule Generation for Protein Annotation with the C4.5 Data Mining Algorithm Applied on SWISS-PROT," Bioinformatics, vol. 17, no. 10, pp. 920-926, 2001.
[39] A. Laegreid, T. Hvidsten, H. Midelfart, J. Komorowski, and A.K. Sandvik, "Predicting Gene Ontology Biological Process from Temporal Gene Expression Patterns," Genome Research, vol. 13, pp. 965-979, 2003.
[40] T.S. Lim, W.Y. Loh, and Y.S. Shih, "A Comparison of Prediction Accuracy, Complexity and Training Time of Thirty-Three Old and New Classification Algorithms," Machine Learning, vol. 40, no. 3, pp. 203-228, 2000.
[41] J. McDowall, "InterPro, Exploring a Powerful Protein Diagnostic Tool," Tutorial at the Fourth European Conf. Computational Biology (ECCB '05), Sept. 2005.
[42] K. McGarry, "A Survey of Interestingness Measures for Knowledge Discovery," Knowledge Eng. Rev., vol. 20, no. 1, pp. 39-61, 2005.
[43] D. Michie, D.J. Spiegelhalter, and C.C. Taylor, Machine Learning, Neural and Statistical Classification. Ellis Horwood, 1994.
[44] B. Mirkin and O. Ritter, "A Feature-Based Approach to Discrimination and Prediction of Protein Folding Groups," Genomics and Proteomics: Functional and Computational Aspects, S. Suhai et al., eds., pp. 157-177, Kluwer Academic/Plenum Publishers, 2000.
[45] N. Nariai, E.D. Kolaczyk, and S. Kasif, "Probabilistic Protein Function Prediction from Heterogeneous Genome-Wide Data," PLoS One, vol. 2, no. 3, p. e337, 2007.
[46] H. Nunez, C. Angulo, and A. Catala, "Rule Extraction from Support Vector Machines," Proc. European Symp. Artificial Neural Networks (ESANN '02), pp. 107-202, 2002.
[47] G.L. Pappa, A.J. Baines, and A.A. Freitas, "Predicting Post-Synaptic Activity in Proteins with Data Mining," Bioinformatics, vol. 21, no. Suppl. 2, pp. ii19-ii25, 2005.
[48] M.J. Pazzani, "Knowledge Discovery from Data," IEEE Intelligent Systems, pp. 10-13, Mar./Apr. 2000.
[49] Y. Peng, P.A. Flach, C. Soares, and P. Brazdil, "Improved Dataset Characterisation for Meta-Learning," Proc. Fifth Int'l Conf. Discovery Science (DS '02), pp. 141-152, 2002.
[50] B. Pfahringer, H. Bensusan, and C. Giraud-Carrier, "Landmarking Various Learning Algorithms," Proc. 17th Int'l Conf. Machine Learning (ICML '00), pp. 743-750, 2000.
[51] J.R. Quinlan, C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
[52] R.B. Rao, D. Gordon, and W. Spears, "For Every Generalization Action, Is There Really an Equal and Opposite Reaction? Analysis of the Conservation Law for Generalization Performance," Proc. 12th Int'l Conf. Machine Learning (ICML '95), p. 471, 1995.
[53] W. Romao, A.A. Freitas, and I.M.S. Gimenes, "Discovering Interesting Knowledge from a Science and Technology Database with a Genetic Algorithm," Applied Soft Computing, vol. 4, pp. 121-137, 2004.
[54] B. Rost, J. Liu, R. Nair, K.O. Wrzeszczynski, and Y. Ofran, "Automatic Prediction of Protein Function," CMLS Cellular and Molecular Life Sciences, vol. 60, pp. 2637-2650, 2003.
[55] C. Schaffer, "A Conservation Law for Generalization Performance," Proc. 11th Int'l Conf. Machine Learning (ICML '94), pp. 259-265, 1994.
[56] J. Schug, S. Diskin, J. Mazzarelli, B.P. Brunk, and C.J. Stoeckert Jr., "Predicting Gene Ontology Functions from ProDom and CDD Protein Domains," Genome Research, vol. 12, pp. 648-655, 2002.
[57] M. Sebban, I. Mokrousov, N. Rastogi, and C. Sola, "A Data-Mining Approach to Spacer Oligonucleotide Typing of Mycobacterium Tuberculosis," Bioinformatics, vol. 18, no. 2, pp. 235-243, 2002.
[58] A. Secker, M.N. Davies, A.A. Freitas, J. Timmis, M. Mendao, and D. Flower, "An Experimental Comparison of Classification Algorithms for the Hierarchical Prediction of Protein Function," Expert Update (the BCS-SGAI Magazine), vol. 9, no. 3, pp. 17-22, Autumn 2007.
[59] M. Singh, P.K. Wadhwa, and P.W. Sandhu, "Human Protein Function Prediction Using Decision Tree Induction," Int'l J. Computer Science and Network Security, vol. 7, no. 4, pp. 92-98, Apr. 2007.
[60] E. Suzuki, "Discovering Interesting Exception Rules with Rule Pair," Proc. PKDD Workshop Advances in Inductive Rule Learning, pp. 163-178, 2004.
[61] U. Syed and G. Yona, "Using a Mixture of Probabilistic Decision Trees for Direct Prediction of Protein Function," Proc. Seventh Ann. Int'l Conf. Research in Computational Molecular Biology (RECOMB), 2003.
[62] D. Szafron, P. Lu, R. Greiner, D.S. Wishart, B. Poulin, R. Eisner, Z. Lu, B. Poulin, R. Eisner, J. Anvik, and C. Macdonell, "Proteome Analyst—Transparent High-Throughput Protein Annotation: Function, Localization and Custom Predictors," Proc. ICML Workshop Bioinformatics, 2003.
[63] D. Szafron, P. Lu, R. Greiner, D.S. Wishart, B. Poulin, R. Eisner, Z. Lu, J. Anvik, C. Macdonell, A. Fyshe, and D. Meeuwis, "Proteome Analyst: Custom Predictions with Explanations in a Web-Based Tool for High-Throughput Proteome Annotations," Nucleic Acids Research, vol. 32, pp. W365-W371, 2004.
[64] P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining, Section 5.6, Addison-Wesley, 2006.
[65] A.B. Tickle, R. Andrews, M. Golea, and J. Diederich, "The Truth Will Come to Light: Directions and Challenges in Extracting Knowledge Embedded within Trained Artificial Neural Networks," IEEE Trans. Neural Networks, vol. 9, no. 6, pp. 1057-1068, 1998.
[66] K. Tu, H. Yu, Z. Guo, and X. Li, "Learnability-Based Further Prediction of Gene Functions in Gene Ontology," Genomics, vol. 84, pp. 922-928, 2004.
[67] A. Vinayagam, R. Konig, J. Moormann, F. Schubert, R. Eils, K.-H. Glatting, and S. Suhai, "Applying Support Vector Machines for Gene Ontology Based Gene Function Prediction," BMC Bioinformatics, vol. 5, no. 116, 2004.
[68] A. Vinayagam, C. Del Val, F. Schubert, R. Eils, K.-H. Glatting, S. Suhai, and R. Konig, "GOPET: A Tool for Automated Predictions of Gene Ontology Terms," BMC Bioinformatics, vol. 7, no. 161, 2006.
[69] A. Vinayagam, R. Konig, J. Moormann, F. Schubert, R. Eils, K.-H. Glatting, and S. Suhai, "Applying Support Vector Machines for Gene Ontology Based Gene Function Prediction," BMC Bioinformatics, vol. 5, no. 116, 2004.
[70] W.R. Weinert and H.S. Lopes, "Neural Networks for Protein Classification," Applied Bioinformatics, vol. 3, no. 1, pp. 41-48, 2004.
[71] D. Wieser, E. Kretschmann, and R. Apweiler, "Filtering Erroneous Protein Annotation," Bioinformatics, vol. 20, no. Suppl. 1, pp. i342-i347, 2004.
[72] M.L. Wong and K.S. Leung, Data Mining Using Grammar-Based Genetic Programming and Applications. Kluwer Academic Publishers, 2000.
[73] J. Xiong, S. Rayner, K. Luo, Y. Li, and S. Chen, "Genome Wide Prediction of Protein Function via a Generic Knowledge Discovery Approach Based on Evidence Integration," BMC Bioinformatics, vol. 7, no. 628, 2006.
[74] M. Zhu, L. Gao, Z. Guo, Y. Li, D. Wang, J. Wang, and C. Wang, "Globally Predicting Protein Functions Based on Co-Expressed Protein-Protein Interaction Networks and Ontology Taxonomy Similarities," Gene, vol. 391, pp. 113-119, 2007.
6 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool