The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.03 - May/June (2011 vol.8)
pp: 748-759
Qian Xu , Hong Kong University of Science and Technology, Clearwater Bay, Kowloon
Sinno Jialin Pan , Hong Kong University of Science and Technology, Clearwater Bay, Kowloon
Hannah Hong Xue , Hong Kong University of Science and Technology, Clearwater Bay, Kowloon
Qiang Yang , Hong Kong University of Science and Technology, Clearwater Bay, Kowloon
ABSTRACT
Protein subcellular localization is concerned with predicting the location of a protein within a cell using computational methods. The location information can indicate key functionalities of proteins. Thus, accurate prediction of subcellular localizations of proteins can help the prediction of protein functions and genome annotations, as well as the identification of drug targets. Machine learning methods such as Support Vector Machines (SVMs) have been used in the past for the problem of protein subcellular localization, but have been shown to suffer from a lack of annotated training data in each species under study. To overcome this data sparsity problem, we observe that because some of the organisms may be related to each other, there may be some commonalities across different organisms that can be discovered and used to help boost the data in each localization task. In this paper, we formulate protein subcellular localization problem as one of multitask learning across different organisms. We adapt and compare two specializations of the multitask learning algorithms on 20 different organisms. Our experimental results show that multitask learning performs much better than the traditional single-task methods. Among the different multitask learning methods, we found that the multitask kernels and supertype kernels under multitask learning that share parameters perform slightly better than multitask learning by sharing latent features. The most significant improvement in terms of localization accuracy is about 25 percent. We find that if the organisms are very different or are remotely related from a biological point of view, then jointly training the multiple models cannot lead to significant improvement. However, if they are closely related biologically, the multitask learning can do much better than individual learning.
INDEX TERMS
Protein subcellular localization; multitask learning.
CITATION
Qian Xu, Sinno Jialin Pan, Hannah Hong Xue, Qiang Yang, "Multitask Learning for Protein Subcellular Location Prediction", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.8, no. 3, pp. 748-759, May/June 2011, doi:10.1109/TCBB.2010.22
REFERENCES
[1] K.C. Chou and H.B. Shen, "Cell-Ploc: A Package of Web Servers for Predicting Subcellular Localization of Proteins in Various Organisms," Nature Protocol, vol. 3, pp. 153-162, 2008.
[2] E.C. Su, H.S. Chiu, A. Lo, J.K. Hwang, T.Y. Sung, and W.L. Hsu, "Protein Subcellular Localization Prediction Based on Compartment-Specific Feature and Structure Conservation," BMC Bioinformatics, vol. 8, article no. 330, 2007.
[3] M. Claros, S. Brunak, and G. Heijne, "Prediction of N-Terminal Protein Sorting Signals," Current Opinion in Structural Biology, vol. 7, pp. 394-398, 1997.
[4] H. Nakashima and K. Nishikawa, "Discrimination of Intracellular and Extracellular Proteins Using Amino Acid Composition and Residue-Pair Frequencies," J. Molecular Biology, vol. 238, no. 1, pp. 54-61, 1994.
[5] K.C. Chou and D.W. Elrod, "Protein Subcellular Location Prediction," Protein Eng., vol. 12, no. 2, pp. 107-118, 1999.
[6] K.C. Chou and Y.D. Cai, "Using Functional Domain Composition and Support Vector Machines for Prediction of Protein Subcellular Location," J. Biology Chemistry, vol. 277, no. 48, pp. 45765-45769, 2002.
[7] K.C. Chou, "Prediction of Protein Cellular Attributes Using Pseudo-Amino Acid Composition," Proteins, vol. 43, no. 3, pp. 246-255, 2001.
[8] G.P. Zhou and K. Doctor, "Subcellular Location Prediction of Apoptosis Proteins," Proteins, vol. 50, no. 1, pp. 44-48, 2003.
[9] K.C. Chou and H.B. Shen, "Predicting Protein Subcellular Location by Fusing Multiple Classifiers," J. Cellular Biochemistry, vol. 99, no. 2, pp. 517-527, 2006.
[10] K.C. Chou and H.B. Shen, "Predicting Eukaryotic Protein Subcellular Location by Fusing Optimized Evidence-Theoretic K-Nearest Neighbor Classifiers," J. Proteome Research, vol. 5, no. 8, pp. 1888-1897, 2006.
[11] K.C. Chou and H.B. Shen, "Hum-Ploc: A Novel Ensemble Classifier for Predicting Human Protein Subcellular Localization," Biochemical Biophysical Research Comm., vol. 347, no. 8, pp. 150-157, 2006.
[12] H.B. Shen and K.C. Chou, "Gpos-Ploc: An Ensemble Classifier for Predicting Subcellular Localization of Gram-Positive Bacterial Proteins," Protein Eng. Design Selection, vol. 20, no. 1, pp. 39-46, 2007.
[13] H.B. Shen and K.C. Chou, "Nuc-Ploc: A New Web-Server for Predicting Protein Subnuclear Localization by Fusing Pseaa Composition and Psepssm," Protein Eng. Design and Selection, vol. 20, no. 11, pp. 561-567, 2007.
[14] K.C. Chou and H.B. Shen, "Large-Scale Plant Protein Subcellular Location Prediction," J. Cellular Biochemistry, vol. 100, no. 3, pp. 665-678, 2007.
[15] K.C. Chou and H.B. Shen, "Euk-Mploc: A Fusion Classifier for Large-Scale Eukaryotic Protein Subcellular Location Prediction by Incorporating Multiple Sites," J. Proteome Research, vol. 6, no. 5, pp. 1728-1734, 2007.
[16] K.C. Chou, "A Novel Approach to Predicting Protein Structural Classes in a (20-1)-d Amino Acid Composition Space," Proteins: Structure, Function and Genetics, vol. 21, no. 4, pp. 319-344, 1995.
[17] K.C. Chou and Y.D. Cai, "Predicting Protein Structural Class by Functional Domain Composition," Biochemical and Biophysical Research Comm., vol. 321, no. 4, pp. 1007-1009, 2004.
[18] K.D. Kedarisetti, L.A. Kurgan, and S. Dick, "Classifier Ensembles for Protein Structural Class Prediction with Varying Homology," Biochemical and Biophysical Research Comm., vol. 348, no. 3, pp. 981-988, 2006.
[19] X. Xiao, P. Wang, and K.C. Chou, "Predicting Protein Quaternary Structural Attribute by Hybridizing Functional Domain Composition and Pseudo Amino Acid Composition," J. Applied Crystallography, vol. 42, pp. 169-173, 2009.
[20] K.C. Chou and H.B. Shen, "Foldrate: A Web-Server for Predicting Protein Folding Rates from Primary Sequence," Open Bioinformatics J., vol. 3, pp. 31-50, 2009.
[21] H.B. Shen, J.N. Song, and K.C. Chou, "Prediction of Protein Folding Rates from Primary Sequence by Fusing Multiple Sequential Features," J. Biomedical Science and Eng., vol. 2, pp. 136-143, 2009.
[22] K.C. Chou and H.B. Shen, "Memtype-2l: A Web Server for Predicting Membrane Proteins and Their Types by Incorporating Evolution Information through Pse-Pssm," Biochemical and Biophysical Research Comm., vol. 360, no. 2, pp. 339-345, 2007.
[23] H.B. Shen and K.C. Chou, "Ezypred: A Top-Down Approach for Predicting Enzyme Functional Classes and Subclasses," Biochemical and Biophysical Research Comm., vol. 364, no. 1, pp. 53-59, 2007.
[24] X. Xiao, P. Wang, and K.C. Chou, "Gpcr-Ca: A Cellular Automaton Image Approach for Predicting G-Protein-Coupled Receptor Functional Classes," J. Computational Chemistry, vol. 30, pp. 1414-1423, 2009.
[25] K.C. Chou, "Prediction of G-Protein-Coupled Receptor Classes," J. Proteome Research, vol. 4, no. 4, pp. 1413-1418, 2004.
[26] K.C. Chou and H.B. Shen, "Protident: A Web Server for Identifying Proteases and Their Types by Fusing Functional Domain and Sequential Evolution Information," Biochemical Biophysical Research Comm., vol. 376, no. 2, pp. 321-325, 2008.
[27] K.C. Chou, "A Vectorized Sequence-Coupling Model for Predicting hiv Protease Cleavage Sites in Proteins," J. Biological Chemistry, vol. 269, pp. 16938-16948, 1993.
[28] K.C. Chou, "Review: Prediction of hiv Protease Cleavage Sites in Proteins," Analytical Biochemistry, vol. 233, pp. 1-14, 1996.
[29] H.B. Shen and K.C. Chou, "Hivcleave: A Web-Server for Predicting hiv Protease Cleavage Sites in Proteins," Analytical Biochemistry, vol. 375, pp. 388-390, 2008.
[30] K.C. Chou and H.B. Shen, "Review: Recent Advances in Developing Web-Servers for Predicting Protein Attributes," Natural Science, vol. 2, pp. 63-92, 2009.
[31] H.B. Shen, J. Yang, and K.C. Chou, "Euk-Ploc: An Ensemble Classifier for Large-Scale Eukaryotic Protein Subcellular Location Prediction," Amino Acids, vol. 33, pp. 57-67, 2007.
[32] H.B. Shen and K.C. Chou, "Hum-mploc: An Ensemble Classifier for Large-Scale Human Protein Subcellular Location Prediction by Incorporating Samples with Multiple Sites," Biochemical Biophysiacl Research Comm., vol. 355, no. 4, pp. 1006-1011, 2007.
[33] R. Caruana, "Multitask Learning: A Knowledge-Based Source of Inductive Bias," Machine Learning, vol. 28, pp. 41-75, 1997.
[34] B. Bakker and T. Heskes, "Task Clustering and Gating for Bayesian Multi-Task Learning," J. Machine Learing Research, vol. 4, pp. 83-99, 2003.
[35] T. Evgeniou, C.A. Micchelli, and M. Pontil, "Learning Multiple Tasks with Kernel Methods," J. Machine Learing Research, vol. 6, pp. 615-637, 2005.
[36] R.K. Ando and T. Zhang, "A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data," J. Machine Learning Research, vol. 6, pp. 1817-1853, 2005.
[37] J. Baxter, "A Model for Inductive Bias Learning," J. Artifical Intelligence Research, vol. 12, pp. 149-198, 2000.
[38] S. Ben-David and R. Schuller, "Exploiting Task Relatedness for Multiple Task Learning," Proc. Ann. Conf. Computational Learning Theory, 2003.
[39] L. Jacob and J.-P. Vert, "Efficient Peptide-mhc-I Binding Prediction for Alleles with Few Known Binders," Bioinformatics, vol. 3, pp. 358-366, 2008.
[40] E.A. Argyriou and M. Pontil, "Multitask Feature Learning," Proc. Ann. Conf. Neural Information Processing System (NIPS), 2006.
[41] G.M. Allenby and P.E. Rossi, "Marketing Models of Consumer Heterogeneity," J. Econometircs, vol. 89, nos. 1/2, pp. 57-78, 1999.
[42] N. Arora, G.M. Allenby, and J.L. Ginter, "A Hierarchical Bayes Model of Primary and Secondary Demand," Marketing Science, vol. 17, no. 1, pp. 29-44, 1998.
[43] N.D. Lawrence and J.C. Platt, "Learning to Learn with the Informative Vector Machine," Proc. 21st Int'l Conf. Machine Learning, 2004.
[44] E. Bonilla, K.M. Chai, and C. Williams, "Multi-Task Gaussian Process Prediction," Proc. 20th Ann. Conf. Neural Information Processing Systems, 2008.
[45] A. Schwaighofer, V. Tresp, and K. Yu, "Learning Gaussian Process Kernels via Hierarchical Bayes," Proc. 20th Ann. Conf. Neural Information Processing Systems, 2005.
[46] A. Argyriou, C.A. Micchelli, M. Pontil, and Y. Ying, "A Spectral Regularization Framework for Multi-Task Structure Learning," Proc. 20th Ann. Conf. Neural Information Processing Systems, 2008.
[47] T. Jebara, "Multi-Task Feature and Kernel Selection for Svms," Proc. 21st Int'l Conf. Machine Learnings, 2004.
[48] S. Bickel, J. Bogojeska, T. Lengauer, and T. Scheffer, "Multi-Task Learning for hiv Therapy Screening," Proc. 25th Int'l Conf. Machine Learning, pp. 56-63, 2008.
[49] J. Bi, T. Xiong, S. Yu, M. Dundra, and R. Rao, "An Improved Multi-Task Learning Approach with Applications in Medical Diagnosis," Machine Learning and Knowledge Discovery in Databases, vol. 5211, pp. 117-132, 2008.
[50] T. Evgeniou and M. Pontil, "Regularized Multi-Task Learning," Proc. ACM SIGKDD, 2004.
[51] V.N. Vapnik, Statistical Learning Theory. Wiley, 1998.
[52] T. Guo, S. Hua, X. Ji, and Z. Sun, "Dbsubloc: Database of Protein Subcellular Localization," Nucleic Acids Research, vol. 32, pp. 122-124, 2004.
[53] J. Wang, W.-K. Sung, A. Krishnan, and K.-B. Li, "Protein Subcellular Localization Prediction for Gram-Negative Bacteria Using Amino Acid Subalphabets and a Combination of Multiple Support Vector Machines," BMC Bioinformatics, vol. 6, p. 174, 2005.
[54] K.C. Chou and C.T. Zhang, "Review: Prediction of Protein Structural Classes," Critical Rev. Biochemistry and Molecular Biology, vol. 30, no. 4, pp. 275-349, 1995.
[55] K.C. Chou and H.B. Shen, "Review: Recent Progresses in Protein Subcellular Location Prediction," Analytical Biochemistry, vol. 370, no. 1, pp. 1-16, 2007.
[56] X.B. Zhou, C. Chen, Z.C. Li, and X.Y. Zou, "Using Chou's Amphiphilic Pseudo-Amino Acid Composition and Support Vector Machine for Prediction of Enzyme Subfamily Classes," J. Theoretical Biology, vol. 248, no. 3, pp. 546-551, 2007.
[57] H. Lin, "The Modified Mahalanobis Discriminant for Predicting Outer Membrane Proteins by Using Chou's Pseudo Amino Acid Composition," J. Theoretical Biology, vol. 252, no. 2, pp. 350-356, 2008.
[58] G.Y. Zhang and B.S. Fang, "Predicting the Cofactors of Oxidoreductases Based on Amino Acid Composition Distribution and Chou's Amphiphilic Pseudo Amino Acid Composition," J. Theoretical Biology, vol. 253, no. 2, pp. 310-315, 2008.
[59] G.Y. Zhang, H.C. Li, and B.S. Fang, "Predicting Lipase Types by Improved Chou's Pseudo-Amino Acid Composition," Protein and Peptide Letters, vol. 15, pp. 1132-1137, 2008.
[60] X. Jiang, R. Wei, T.L. Zhang, and Q. Gu, "Using the Concept of Chou's Pseudo Amino Acid Composition to Predict Apoptosis Proteins Subcellular Location: An Approach by Approximate Entropy," Protein and Peptide Letters, vol. 15, pp. 392-396, 2008.
[61] F.M. Li and Q.Z. Li, "Predicting Protein Subcellular Location Using Chou's Pseudo Amino Acid Composition and Improved Hybrid Approach," Protein and Peptide Letters, vol. 15, pp. 612-616, 2008.
[62] H. Lin, H. Ding, F.B. Guo, A.Y. Zhang, and J. Huang, "Predicting Subcellular Localization of Mycobacterial Proteins by Using Chou's Pseudo Amino Acid Composition," Protein and Peptide Letters, vol. 15, pp. 739-744, 2008.
[63] T. Wang, J. Yang, H.B. Shen, and K.C. Chou, "Predicting Membrane Protein Types by the llda Algorithm," Protein and Peptide Letters, vol. 15, pp. 915-921, 2008.
[64] Y.S. Ding and T.L. Zhang, "Using Chou's Pseudo Amino Acid Composition to Predict Subcellular Localization of Apoptosis Proteins: An Approach with Immune Genetic Algorithm-Based Ensemble Classifier," Pattern Recognition Letters, vol. 29, no. 13, pp. 1887-1892, 2008.
[65] C. Chen, L. Chen, X. Zou, and P. Cai, "Prediction of Protein Secondary Structure Content by Using the Concept of Chou's Pseudo Amino Acid Composition and Support Vector Machine," Protein and Peptide Letters, vol. 16, pp. 27-31, 2009.
[66] H.B. Shen and K.C. Chou, "Quatident: A Web Server for Identifying Protein Quaternary Structural Attribute by Fusing Functional Domain and Sequential Evolution Information," Biochemical Biophysical Research Comm., vol. 8, no. 3, pp. 1577-1584, 2009.
[67] H.B. Shen and K.C. Chou, "Identification of Proteases and Their Types," Analytical Biochemistry, vol. 385, no. 1, pp. 153-160, 2009.
[68] Y.S. Ding, T.L. Zhang, Q. Gu, P.Y. Zhao, and K.C. Chou, "Using Maximum Entropy Model to Predict Protein Secondary Structure with Single Sequence," Protein and Peptide Letters, vol. 16, pp. 552-560, 2009.
[69] H.B. Shen and K.C. Chou, "Predicting Protein Fold Pattern with Functional Domain and Sequential Evolution Information," J. Theoretical Biology, vol. 256, no. 3, pp. 441-446, 2009.
[70] S. Hua and Z. Sun, "Support Vector Machine Approach for Protein Subcellular Localization Prediction," Bioinformatics, vol. 7, pp. 721-728, 2001.
60 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool