This Article 
 Bibliographic References 
 Add to: 
True Path Rule Hierarchical Ensembles for Genome-Wide Gene Function Prediction
May/June 2011 (vol. 8 no. 3)
pp. 832-847
Giorgio Valentini, Università degli Studi di Milano, Milano
Gene function prediction is a complex computational problem, characterized by several items: the number of functional classes is large, and a gene may belong to multiple classes; functional classes are structured according to a hierarchy; classes are usually unbalanced, with more negative than positive examples; class labels can be uncertain and the annotations largely incomplete; to improve the predictions, multiple sources of data need to be properly integrated. In this contribution, we focus on the first three items, and, in particular, on the development of a new method for the hierarchical genome-wide and ontology-wide gene function prediction. The proposed algorithm is inspired by the “true path rule” (TPR) that governs both the Gene Ontology and FunCat taxonomies. According to this rule, the proposed TPR ensemble method is characterized by a two-way asymmetric flow of information that traverses the graph-structured ensemble: positive predictions for a node influence in a recursive way its ancestors, while negative predictions influence its offsprings. Cross-validated results with the model organism S. Crevisiae, using seven different sources of biomolecular data, and a theoretical analysis of the the TPR algorithm show the effectiveness and the drawbacks of the proposed approach.

[1] I. Friedberg, "Automated Protein Function Prediction—The Genomic Challenge," Briefings Bioinformatics, vol. 7, pp. 225-242, 2006.
[2] L. Pena-Castillo et al., "A Critical Assessment of Mus Musculus Gene Function Prediction Using Integrated Genomic Evidence," Genome Biology, vol. 9, no. S1, 2008.
[3] H. Kriegel, P. Kroger, A. Pryakhin, and M. Schubert, "Using Support Vector Machines for Classifying Large Sets of Multi-Represented Objects," Proc. Fourth SIAM Int'l Conf. Data Mining, pp. 102-114, 2004.
[4] G. Tsoumakas and I. Katakis, "Multi Label Classification: An Overview," Int'l J. Data Warehousing and Mining, vol. 3, no. 3, pp. 1-13, 2007.
[5] A. Dimou, G. Tsoumakas, V. Mezaris, I. Kompatsiaris, and I. Vlahavas, "An Empirical Study of Multi-Label Methods for Video Annotation," Proc. Seventh Int'l Workshop Content-Based Multimedia Indexing (CBMI '09), 2009.
[6] K. Punera and J. Ghosh, "Enhanced Hierarchical Classification via Isotonic Smoothing," Proc. 17th Int'l Conf. World Wide Web, pp. 151-160, 2008.
[7] J. Rousu, C. Saunders, S. Szedmak, and J. Shawe-Taylor, "Kernel-Based Learning of Hierarchical Multilabel Classification Models," J. Machine Learning Research, vol. 7, pp. 1601-1626, 2006.
[8] The Gene Ontology Consortium "Gene Ontology: Tool for the Unification of Biology," Nature Genetics, vol. 25, pp. 25-29, 2000.
[9] A. Ruepp, A. Zollner, D. Maier, K. Albermann, J. Hani, M. Mokrejs, I. Tetko, U. Guldener, G. Mannhaupt, M. Munsterkotter, and H. Mewes, "The FunCat, a Functional Annotation Scheme for Systematic Classification of Proteins from Whole Genomes," Nucleic Acids Research, vol. 32, no. 18, pp. 5539-5545, 2004.
[10] G. Lanckriet, T. De Bie, N. Cristianini, M. Jordan, and W. Noble, "A Statistical Framework for Genomic Data Fusion," Bioinformatics, vol. 20, pp. 2626-2635, 2004.
[11] O. Troyanskaya et al., "A Bayesian Framework for Combining Heterogeneous Data Sources for Gene Function Prediction (in Saccharomices cerevisiae)," Proc. Nat'l Academy of Sciences USA, vol. 100, pp. 8348-8353, 2003.
[12] K. Tsuda, H. Shin, and B. Scholkopf, "Fast Protein Classification with Multiple Networks," Bioinformatics, vol. 21, pp. ii59-ii65, 2005.
[13] U. Karaoz et al., "Whole-Genome Annotation by Using Evidence Integration in Functional-Linkage Networks," Proc. Nat'l Academy of Sciences USA, vol. 101, pp. 2888-2893, 2004.
[14] H. Chua, W. Sung, and L. Wong, "An Efficient Strategy for Extensive Integration of Diverse Biological Data for Protein Function Prediction," Bioinformatics, vol. 23, no. 24, pp. 3364-3373, 2007.
[15] J. Xiong et al., "Genome Wide Prediction of Gene Function via a Generic Knowledge Discovery Approach Based on Evidence Integration," BMC Bioinformatics, vol. 7, 2006.
[16] W. Tian, L. Zhang, M. Tasan, F. Gibbons, O. King, J. Park, Z. Wunderlich, J. Cherry, and F. Roth, "Combining Guilt-by-Association and Guilt-by-Profiling to Predict Saccharomices cerevisiae Gene Function," Genome Biology, vol. 9, no. S7, 2008.
[17] W. Kim, C. Krumpelman, and E. Marcotte, "Inferring Mouse Gene Functions from Genomic-Scale Data Using a Combined Functional Network/Classification Strategy," Genome Biology, vol. 9, no. S5, 2008.
[18] A. Sokolov and A. Ben-Hur, "A Structured-Outputs Method for Prediction of Protein Function," Proc. Second Int'l Workshop Machine Learning in Systems Biology (MLSB '08), 2008.
[19] K. Astikainen, L. Holm, E. Pitkanen, S. Szedmak, and J. Rousu, "Towards Structured Output Prediction of Enzyme Function," BMC Proc., vol. 2, no. S2, 2008.
[20] B. Done, P. Khatri, A. Done, and S. Draghici, "Predicting Novel Human Gene Ontology Annotations Using Semantic Analysis," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 7, no. 1, pp. 91-99, 2008.
[21] R. Eisner, B. Poulin, D. Szafron, and P. Lu, "Improving Protein Prediction Using the Hierarchical Structure of the Gene Ontology," Proc. IEEE Symp. Computational Intelligence in Bioinformatics and Computational Biology, 2005.
[22] H. Blockeel, L. Schietgat, and A. Clare, "Hierarchical Multilabel Classification Trees for Gene Function Prediction," Probabilistic Modeling and Machine Learning in Structural and Systems Biology, J. Rousu, S. Kaski, and E. Ukkonen, eds., Helsinki Univ. Printing House, 2006.
[23] B. Shahbaba and M. Neal, "Gene Function Classification Using Bayesian Models with Hierarchy-Based Priors," BMC Bioinformatics, vol. 7, no. 448, 2006.
[24] C. Vens, J. Struyf, L. Schietgat, S. Dzeroski, and H. Blockeel, "Decision Trees for Hierarchical Multi-Label Classification," Machine Learning, vol. 73, pp. 185-214, 2008.
[25] G. Obozinski, G. Lanckriet, C. Grant, J. Michael, and W. Noble, "Consistent Probabilistic Output for Protein Function Prediction," Genome Biology, vol. 9, no. S6, 2008.
[26] Gene Ontology Consortium "True Path Rule," 2009.
[27] X. Jiang, N. Nariai, M. Steffen, S. Kasif, and E. Kolaczyk, "Integration of Relational and Hierarchical Network Information for Protein Function Prediction," BMC Bioinformatics, vol. 9, no. 350, 2008.
[28] Y. Guan, C. Myers, D. Hess, Z. Barutcuoglu, A. Caudy, and O. Troyanskaya, "Predicting Gene Function in a Hierarchical Context with an Ensemble of Classifiers," Genome Biology, vol. 9, no. S2, 2008.
[29] Z. Barutcuoglu, R. Schapire, and O. Troyanskaya, "Hierarchical Multi-Label Prediction of Gene Function," Bioinformatics, vol. 22, no. 7, pp. 830-836, 2006.
[30] H. Lin, C. Lin, and R. Weng, "A Note on Platt's Probabilistic Outputs for Support Vector Machines," Machine Learning, vol. 68, pp. 267-276, 2007.
[31] A. Valencia, "Automatic Annotation of Protein Function," Current Opinion in Structural Biology, vol. 15, pp. 267-274, 2005.
[32] W. Noble and A. Ben-Hur, "Integrating Information for Protein Function Prediction," Bioinformatics—From Genomes to Therapies, T. Lengauer, ed., vol. 3, pp. 1297-1314, Wiley-VCH, 2007.
[33] P. Pavlidis, J. Weston, J. Cai, and W. Noble, "Learning Gene Functional Classification from Multiple Data," J. Computational Biology, vol. 9, pp. 401-411, 2002.
[34] M. Re and G. Valentini, "Ensemble Based Data Fusion for Gene Function Prediction," Proc. Eighth Int'l Workshop Multiple Classifier Systems (MCS '09), pp. 448-457, 2009.
[35] N. Cesa-Bianchi and G. Valentini, "Hierarchical Cost-Sensitive Algorithms for Genome-Wide Gene Function Prediction," J. Machine Learning Research, W&C Proc., vol. 8: Machine Learning in Systems Biology, pp. 14-29, 2010.
[36] G. Valentini and M. Re, "Weighted True Path Rule: A Multilabel Hierarchical Algorithm for Gene Function Prediction," Proc. First Int'l Workshop Learning from Multi-Label Data (MLD '09), pp. 133-146, 2009.
[37] J. Rousu, C. Saunders, S. Szdemak, and J. Shawe-Taylor, "Learning Hierarchical Multi-Category Text Classification Models," Proc. 22nd Int'l Conf. Machine Learning, pp. 745-752, 2005.
[38] N. Cesa-Bianchi, C. Gentile, A. Tironi, and L. Zaniboni, "Incremental Algorithms for Hierarchical Classification," Advances in Neural Information Processing Systems, vol. 17, pp. 233-240, MIT Press, 2005.
[39] G. Valentini and N. Cesa-Bianchi, "Hcgene: A Software Tool to Support the Hierarchical Classification of Genes," Bioinformatics, vol. 24, no. 5, pp. 729-731, 2008.
[40] M. Deng, T. Chen, and F. Sun, "An Integrated Probabilistic Model for Functional Prediction of Proteins," Proc. Seventh Int'l Conf. Computational Molecular Biology, pp. 95-103, 2003.
[41] R. Finn, J. Tate, J. Mistry, P. Coggill, J. Sammut, H. Hotz, G. Ceric, K. Forslund, S. Eddy, E. Sonnhammer, and A. Bateman, "The Pfam Protein Families Database," Nucleic Acids Research, vol. 36, pp. D281-D288, 2008.
[42] P. Spellman et al., "Comprehensive Identification of Cell Cycle-Regulated Genes of the Yeast Saccharomices cerevisiae by Microarray Hybridization," Molecular Biology of the Cell, vol. 9, pp. 3273-3297, 1998.
[43] P. Gasch et al., "Genomic Expression Programs in the Response of Yeast Cells to Environmental Changes," Molecular Biology of the Cell, vol. 11, pp. 4241-4257, 2000.
[44] C. Stark, B. Breitkreutz, T. Reguly, L. Boucher, A. Breitkreutz, and M. Tyers, "BioGRID: A General Repository for Interaction Datasets," Nucleic Acids Research, vol. 34, pp. D535-D539, 2006.
[45] C. von Mering, R. Krause, B. Snel, M. Cornell, S. Oliver, S. Fields, and P. Bork, "Comparative Assessment of Large-Scale Data Sets of Protein-Protein Interactions," Nature, vol. 417, pp. 399-403, 2002.
[46] S. Eddy, "Profile Hidden Markov Models," Bioinformatics, vol. 14, no. 9, pp. 755-763, 1998.
[47] P. Uetz, L. Giot, G. Cagney, T. Mansfield, R. Judson, J. Knight, D. Lockshon, V. Narayan, M. Srinivasan, and P. Pochart, "A Comprehensive Analysis of Protein-Protein Interactions in Saccharomyces cerevisiae," Nature, vol. 403, pp. 623-627, 2000.
[48] Y. Ho, A. Gruhler, A. Heilbut, G. Bader, L. Moore, S. Adams, A. Millar, P. Taylor, K. Bennett, and K. Boutilier, "Systematic Identification of Protein Complexes in Saccharomyces cerevisiae by Mass Spectrometry," Nature, vol. 415, pp. 180-183, 2002.
[49] A. Davierwala, J. Haynes, Z. Li, R. Brost, M. Robinson, L. Yu, S. Mnaimneh, H. Ding, H. Zhu, and Y. Chen, "The Synthetic Genetic Interaction Spectrum of Essential Genes," Nature Genetics, vol. 37, pp. 1147-1152, 2005.
[50] G. Lanckriet, R.G. Gert, M. Deng, N. Cristianini, M. Jordan, and W. Noble, "Kernel-Based Data Fusion and Its Application to Protein Function Prediction in Yeast," Proc. Pacific Symp. Biocomputing, pp. 300-311, 2004.
[51] A. Ben-Hur and W. Noble, "Choosing Negative Examples for the Prediction of Protein-Protein Interactions," BMC Bioinformatics, vol. 7, nos. S1/S2, 2006.
[52] K. Verspoor, J. Cohn, S. Mnizewski, and C. Joslyn, "A Categorization Approach to Automated Ontological Function Annotation," Protein Science, vol. 15, pp. 1544-1549, 2006.
[53] J. Demsar, "Statistical Comparisons of Classifiers over Multiple Data Sets," J. Machine Learning Research, vol. 7, pp. 1-30, 2006.
[54] T. Dietterich, "Approximate Statistical Test for Comparing Supervised Classification Learning Algorithms," Neural Computation, vol. 10, no. 7, pp. 1895-1924, 1998.
[55] T. Hastie, R. Tibshirani, and R. Friedman, The Elements of Statistical Learning. Springer, 2001.
[56] M. Re and G. Valentini, "Simple Ensemble Methods Are Competitive with State-of-the-Art Data Integration Methods for Gene Function Prediction," J. Machine Learning Research, W&C Proc., vol. 8: Machine Learning in Systems Biology, pp. 98-111, 2010.

Index Terms:
Gene function prediction, ensemble methods, hierarchical classification, Functional Catalogue (FunCat).
Giorgio Valentini, "True Path Rule Hierarchical Ensembles for Genome-Wide Gene Function Prediction," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 8, no. 3, pp. 832-847, May-June 2011, doi:10.1109/TCBB.2010.38
Usage of this product signifies your acceptance of the Terms of Use.