The Community for Technology Leaders
RSS Icon
Issue No.03 - July-September (2010 vol.7)
pp: 563-571
Arthur Zimek , Ludwig-Maximilians-Universitaet Muenchen and Forschungseinheit fuer Datenbanksysteme, Muenchen
Fabian Buchwald , Technische Universitaet Muenchen, Muenchen
Eibe Frank , University of Waikato, Hamilton
Stefan Kramer , Technische Universitaet Muenchen, Muenchen
Automatic classification of proteins using machine learning is an important problem that has received significant attention in the literature. One feature of this problem is that expert-defined hierarchies of protein classes exist and can potentially be exploited to improve classification performance. In this article, we investigate empirically whether this is the case for two such hierarchies. We compare multiclass classification techniques that exploit the information in those class hierarchies and those that do not, using logistic regression, decision trees, bagged decision trees, and support vector machines as the underlying base learners. In particular, we compare hierarchical and flat variants of ensembles of nested dichotomies. The latter have been shown to deliver strong classification performance in multiclass settings. We present experimental results for synthetic, fold recognition, enzyme classification, and remote homology detection data. Our results show that exploiting the class hierarchy improves performance on the synthetic data but not in the case of the protein classification problems. Based on this, we recommend that strong flat multiclass methods be used as a baseline to establish the benefit of exploiting class hierarchies in this area.
Protein classification, hierarchical classification, multiclass classification.
Arthur Zimek, Fabian Buchwald, Eibe Frank, Stefan Kramer, "A Study of Hierarchical and Flat Classification of Proteins", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.7, no. 3, pp. 563-571, July-September 2010, doi:10.1109/TCBB.2008.104
[1] A.G. Murzin, S.E. Brenner, T. Hubbard, and C. Chothia, "SCOP a Structural Classification of Proteins Database for the Investigation of Sequences and Structures," J. Molecular Biology, vol. 247, pp. 536-540, 1995.
[2] C.A. Orengo, A. Michie, S. Jones, D.T. Jones, M.B. Swindells, and J.M. Thornton, "CATH—A Hierarchic Classification of Protein Domain Structures," Structure, vol. 5, no. 8, pp. 1093-1108, 1997.
[3] "Nomenclature Committee of the Int'l Union of Biochemistry and Molecular Biology," Enzyme Nomenclature. Academic Press, 1992.
[4] H. Saigo, J.-P. Vert, N. Ueda, and T. Akutsu, "Protein Homology Detection Using String Alignment Kernels," Bioinformatics, vol. 20, no. 11, pp. 1682-1689, 2004.
[5] C. Leslie, E. Eskin, A. Cohen, J. Weston, and W. Noble, "Mismatch String Kernels for Discriminative Protein Classification," Bioinformatics, vol. 20, no. 4, pp. 467-476, 2004.
[6] C.H.Q. Ding and I. Dubchak, "Multi-Class Protein Fold Recognition Using Support Vector Machines and Neural Networks," Bioinformatics, vol. 17, no. 4, pp. 349-358, 2001.
[7] K. Marsolo, S. Parthasarathy, and C. Ding, "A Multi-Level Approach to SCOP Fold Recognition," Proc. Fifth IEEE Symp. Bioinformatics and Bioengineering, pp. 57-64, 2005.
[8] I. Melvin, E. Ie, J. Weston, W.S. Noble, and C. Leslie, "Multi-Class Protein Classification Using Adaptive Codes," J. Machine Learning Research, vol. 8, pp. 1557-1581, 2007.
[9] H. Rangwala and G. Karypis, "Building Multiclass Classifiers for Remote Homology Detection and Fold Recognition," BMC Bioinformatics, vol. 7, no. 1, p. 455, 2006.
[10] E. Frank and S. Kramer, "Ensembles of Nested Dichotomies for Multi-Class Problems," Proc. 21st Int'l Conf. Machine Learning (ICML '04), pp. 84-95, 2004.
[11] J.C. Platt, "Probabilistic Outputs for Support Vector Machines and Comparison to Regularized Likelihood Methods," Advances in Large Margin Classifiers. MIT Press, 1999.
[12] T.G. Dietterich and G. Bakiri, "Solving Multiclass Learning Problems via Error-Correcting Output Codes," J. Artificial Intelligence Research, vol. 2, pp. 263-286, 1995.
[13] L. Lo Conte, B. Ailey, T.J.P. Hubbard, S.E. Brenner, A.G. Murzin, and C. Chothia, "SCOP: A Structural Classification of Proteins Database," Nucleic Acids Research, vol. 28, pp. 257-259, 2000.
[14] E. Bindewald, A. Cestaro, J. Hesser, M. Heiler, and S.C.E. Tosatto, "MANIFOLD: Protein Fold Recognition Based on Secondary Structure, Sequence Similarity and Enzyme Classification," Protein Eng., vol. 16, no. 11, pp. 785-789, 2003.
[15] H. Shen and K. Chou, "Ensemble Classifier for Protein Fold Pattern Recognition," Bioinformatics, vol. 22, no. 14, pp. 1717-1722, 2006.
[16] O. Okun, "K-Local Hyperplane Distance Nearest-Neighbor Algorithm and Protein Fold Recognition," Pattern Recognition and Image Analysis, vol. 16, no. 1, pp. 19-22, 2006.
[17] I. Dubchak, I. Muchnik, C. Mayor, I. Dralyuk, and S.-H. Kim, "Recognition of a Protein Fold in the Context of the SCOP Classification," PROTEINS: Structure, Function, and Genetics, vol. 35, pp. 401-407, 1999.
[18] A. Chinnasamy, W.K. Sung, and A. Mittal, "Protein Structure and Fold Prediction Using Tree-Augmented Naïve Bayesian Classifier," Proc. Pacific Symp. Biocomputing (PSB '04), pp. 387-398, 2004.
[19] I.-F. Chung, C.-D. Huang, Y.-H. Shen, and C.-T. Lin, "Recognition of Structure Classification of Protein Folding by NN and SVM Hierarchical Learning Architecture," Proc. Joint Int'l Conf. Artificial Neural Networks and Neural Information Processing (ICANN/ICONIP '03), pp. 1159-1167, 2003.
[20] C.-D. Huang, I.-F. Chung, N.R. Pal, and C.-T. Lin, "Machine Learning for Multi-Class Protein Fold Classification Based on Neural Networks with Feature Gating," Proc. Joint Int'l Conf. Artificial Neural Networks and Neural Information Processing (ICANN/ICONIP '03), pp. 1168-1175, 2003.
[21] D. Voet and J.G. Voet, Biochemistry. John Wiley & Sons, 2004.
[22] H. Lodish, A. Berk, and P. Matsudaira, Molecular Cell Biology. W.H. Freeman, 2003.
[23] B. Wägele, "Feature Transformationen für die Funktionsvorhersage auf Proteinen mittels Analyse der 3D Struktur," master's thesis, Ludwig-Maximilians-Universität München/TU München, 2005.
[24] I.H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, second ed. Morgan Kaufmann, 2005.
[25] E. Ie, J. Weston, W. Noble, and C. Leslie, "Multi-Class Protein Fold Recognition Using Adaptive Codes," Proc. 22nd Int'l Conf. Machine Learning (ICML '05), pp. 329-336, 2005.
[26] D.H.T. Jaakkola and M. Diekhans, "A Discriminative Framework for Detecting Remote Protein Homologies," J. Computational Biology, vol. 7, no. 1/2, pp. 95-114, 2000.
[27] H. Rangwala and G. Karypis, "Profile Based Direct Kernels for Remote Homology Detection and Fold Recognition," Bioinformatics, vol. 21, no. 23, pp. 4239-4247, 2005.
[28] T. Hastie and R. Tibshirani, "Classification by Pairwise Coupling," Advances in Neural Information Processing Systems, vol. 10, 1998.
[29] J.C. Platt, Fast Training of Support Vector Machines Using Sequential Minimal Optimization. MIT Press, pp. 185-208, 1999.
[30] S. Kiritchenko, "Hierarchical Text Categorization and Its Application to Bioinformatics," PhD dissertation, School of Information Technology and Eng., Univ. of Ottawa, 2005.
[31] D. Koller and M. Sahami, "Hierarchically Classifying Documents Using Very Few Words," Proc. 14th Int'l Conf. Machine Learning (ICML '97), pp. 170-178, 1997.
[32] R. Greiner, A. Grove, and D. Schuurmans, "On Learning Hierarchical Classifications," , 1997.
[33] A. McCallum, R. Rosenfeld, T.M. Mitchell, and A.Y. Ng, "Improving Text Classification by Shrinkage in a Hierarchy of Classes," Proc. 15th Int'l Conf. Machine Learning (ICML '98), pp. 359-367, 1998.
[34] H.T. Ng, W.B. Goh, and K.L. Low, "Feature Selection, Perceptron Learning, and a Usability Case Study for Text Categorization," Proc. 20th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 67-73, 1997.
[35] M.E. Ruiz and P. Srinivasan, "Hierarchical Text Categorization Using Neural Networks," Information Retrieval, vol. 5, no. 1, pp. 87-118, 2002.
[36] A.S. Weigend, E.D. Wiener, and J.O. Pedersen, "Exploiting Hierarchy in Text Categorization," Information Retrieval, vol. 1, no. 3, pp. 193-216, 1999.
[37] S. Dumais and H. Chen, "Hierarchical Classification of Web Content," Proc. 23rd Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 256-263, 2000.
[38] A. Sun, E.-P. Lim, and W.K. Ng, "Personalized Classification for Keyword-Based Category Profiles," Proc. Sixth European Conf. Research and Advanced Technology for Digital Libraries, pp. 61-74, 2002.
[39] S. D'Alessio, M. Murray, R. Schiaffino, and A. Kershenbaum, "Category Levels in Hierarchical Text Categorization," Proc. Third Conf. Empirical Methods in Natural Language Processing (EMNLP), 1998.
[40] S. Chakrabarti, B. Dom, R. Agrawal, and P. Raghavan, "Scalable Feature Selection, Classification and Signature Generation for Organizing Large Text Databases into Hierarchical Topic Taxonomies," Very Large Databases J., vol. 7, no. 3, pp. 163-178, 1998.
[41] L. Larkey, "Some Issues in the Automatic Classification of U.S. Patents," Working Notes AAAI '98 Workshop Learning for Text Categorization, 1998.
[42] L. Cai and T. Hofmann, "Hierarchical Document Categorization with Support Vector Machines," Proc. 13th ACM Int'l Conf. Information and Knowledge Management, pp. 78-87, 2004.
[43] O. Dekel, J. Keshet, and Y. Singer, "Large Margin Hierarchical Classification," Proc. 21st Int'l Conf. Machine Learning (ICML '04), pp. 209-216, 2004.
[44] J. Rousu, C. Saunders, S. Szedmak, and J. Shawe-Taylor, "Learning Hierarchical Multi-Category Text Classification Models," Proc. 22nd Int'l Conf. Machine Learning (ICML '05), pp. 744-751, 2005.
33 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool