This Article 
 Bibliographic References 
 Add to: 
Discriminative Motif Finding for Predicting Protein Subcellular Localization
March/April 2011 (vol. 8 no. 2)
pp. 441-451
Tien-ho Lin, Carnegie Mellon University, Pittsburgh
Robert F. Murphy, Carnegie Mellon University, Pittsburgh
Ziv Bar-Joseph, Carnegie Mellon University, Pittsburgh
Many methods have been described to predict the subcellular location of proteins from sequence information. However, most of these methods either rely on global sequence properties or use a set of known protein targeting motifs to predict protein localization. Here, we develop and test a novel method that identifies potential targeting motifs using a discriminative approach based on hidden Markov models (discriminative HMMs). These models search for motifs that are present in a compartment but absent in other, nearby, compartments by utilizing an hierarchical structure that mimics the protein sorting mechanism. We show that both discriminative motif finding and the hierarchical structure improve localization prediction on a benchmark data set of yeast proteins. The motifs identified can be mapped to known targeting motifs and they are more conserved than the average protein sequence. Using our motif-based predictions, we can identify potential annotation errors in public databases for the location of some of the proteins. A software implementation and the data set described in this paper are available from

[1] T.R. Kau, J.C. Way, and P.A. Silver, "Nuclear Transport and Cancer: From Mechanism to Intervention," Nature Rev. Cancer, vol. 4, no. 2, pp. 106-117, Feb. 2004.
[2] I.K.H. Poon and D.A. Jans, "Regulation of Nuclear Transport: Central Role in Development and Transformation?" Traffic, vol. 6, no. 3, pp. 173-186, Mar. 2005.
[3] W. Yan, R. Aebersold, and E.W. Raines, "Evolution of Organelle-Associated Protein Profiling," J. Proteomics, vol. 72, no. 1, pp. 4-11, Dec. 2008.
[4] W.-K. Huh, J.V. Falvo, L.C. Gerke, A.S. Carroll, R.W. Howson, J.S. Weissman, and E.K. O'Shea, "Global Analysis of Protein Localization in Budding Yeast," Nature, vol. 425, no. 6959, pp. 686-691, Oct. 2003.
[5] V. Starkuviene, U. Liebel, J.C. Simpson, H. Erfle, A. Poustka, S. Wiemann, and R. Pepperkok, "High-Content Screening Microscopy Identifies Novel Proteins with a Putative Role in Secretory Membrane Traffic," Genome Research, vol. 14, no. 10A, pp. 1948-1956, Oct. 2004.
[6] R.N. Aturaliya, J.L. Fink, M.J. Davis, M.S. Teasdale, K.A. Hanson, K.C. Miranda, A.R.R. Forrest, S.M. Grimmond, H. Suzuki, M. Kanamori, C. Kai, J. Kawai, P. Carninci, Y. Hayashizaki, and R.D. Teasdale, "Subcellular Localization of Mammalian Type II Membrane Proteins," Traffic, vol. 7, no. 5, pp. 613-625, May 2006.
[7] E.G. Osuna, J. Hua, N.W. Bateman, T. Zhao, P.B. Berget, and R.F. Murphy, "Large-Scale Automated Analysis of Location Patterns in Randomly Tagged 3T3 Cells," Annals of Biomedical Eng., vol. 35, no. 6, pp. 1081-1087, June 2007.
[8] P. Horton, K.-J. Park, T. Obayashi, N. Fujita, H. Harada, C.J. Adams-Collier, and K. Nakai, "WoLF PSORT: Protein Localization Predictor," Nucleic Acids Research, vol. 35, suppl. 2, Web server issue, pp. W585-W587, July 2007.
[9] O. Emanuelsson, H. Nielsen, S. Brunak, and G. von Heijne, "Predicting Subcellular Localization of Proteins Based on Their N-Terminal Amino Acid Sequence," J. Molecular Biology, vol. 300, no. 4, pp. 1005-1016, July 2000.
[10] R. Nair and B. Rost, "Mimicking Cellular Sorting Improves Prediction of Subcellular Localization," J. Molecular Biology, vol. 348, no. 1, pp. 85-100, Apr. 2005.
[11] M.S. Scott, S.J. Calafell, D.Y. Thomas, and M.T. Hallett, "Refining Protein Subcellular Localization," PLoS Computational Biology, vol. 1, no. 6, p. e66, Nov. 2005.
[12] M. Rashid, S. Saha, and G.P. Raghava, "Support Vector Machine-Based Method for Predicting Subcellular Localization of Mycobacterial Proteins Using Evolutionary Information and Motifs," BMC Bioinformatics, vol. 8, article no. 337, 2007.
[13] H. Bannai, Y. Tamada, O. Maruyama, K. Nakai, and S. Miyano, "Extensive Feature Detection of n-Terminal Protein Sorting Signals," Bioinformatics, vol. 18, no. 2, pp. 298-305, Feb. 2002.
[14] N.J. Mulder, R. Apweiler, T.K. Attwood, A. Bairoch, D. Barrell, A. Bateman, D. Binns, M. Biswas, P. Bradley, P. Bork, P. Bucher, R.R. Copley, E. Courcelle, U. Das, R. Durbin, L. Falquet, W. Fleischmann, S. Griffiths-Jones, D. Haft, N. Harte, N. Hulo, D. Kahn, A. Kanapin, M. Krestyaninova, R. Lopez, I. Letunic, D. Lonsdale, V. Silventoinen, S.E. Orchard, M. Pagni, D. Peyruc, C.P. Ponting, J.D. Selengut, F. Servant, C.J.A. Sigrist, R. Vaughan, and E.M. Zdobnov, "The InterPro Database, 2003 Brings Increased Coverage and New Features," Nucleic Acids Research, vol. 31, no. 1, pp. 315-318, Jan. 2003.
[15] T.L. Bailey, N. Williams, C. Misleh, and W.W. Li, "MEME: Discovering and Analyzing DNA and Protein Sequence Motifs," Nucleic Acids Research, vol. 34, Web server issue, pp. W369-W373, July 2006.
[16] S.R. Eddy, "Profile Hidden Markov Models," Bioinformatics, vol. 14, no. 9, pp. 755-763, 1998.
[17] R.D. Finn, J. Tate, J. Mistry, P.C. Coggill, S.J. Sammut, H.-R. Hotz, G. Ceric, K. Forslund, S.R. Eddy, E.L.L. Sonnhammer, and A. Bateman, "The Pfam Protein Families Database," Nucleic Acids Research, vol. 36, suppl. 1, Database Issue, pp. D281-D288, Jan. 2008.
[18] C. Dingwall, J. Robbins, S.M. Dilworth, B. Roberts, and W.D. Richardson, "The Nucleoplasmin Nuclear Location Sequence Is Larger and More Complex Than that of sv-40 Large t Antigen," J. Cell Biology, vol. 107, no. 3, pp. 841-849, Sept. 1988.
[19] S. Subramanian, P.S. Sijwali, and P.J. Rosenthal, "Falcipain Cysteine Proteases Require Bipartite Motifs for Trafficking to the Plasmodium Falciparum Food Vacuole," J. Biological Chemistry, vol. 282, no. 34, pp. 24961-24969, Aug. 2007.
[20] M. Doruel, T.A. Down, and T.J. Hubbard, "NestedMICA as an Ab Initio Protein Motif Discovery Tool," BMC Bioinformatics, vol. 9, article no. 19, 2008.
[21] E. Segal, R. Yelensky, and D. Koller, "Genome-Wide Discovery of Transcriptional Modules from DNA Sequence and Gene Expression," Bioinformatics, vol. 19, suppl. 1, pp. i273-i282, 2003.
[22] S. Sinha, "On Counting Position Weight Matrix Matches in a Sequence, with Application to Discriminative Motif Finding," Bioinformatics, vol. 22, no. 14, pp. e454-e463, July 2006.
[23] A.D. Smith, P. Sumazin, D. Das, and M.Q. Zhang, "Mining ChIP-Chip Data for Transcription Factor and Cofactor Binding Sites," Bioinformatics, vol. 21, suppl. 1, pp. i403-i412, June 2005.
[24] E. Redhead and T.L. Bailey, "Discriminative Motif Discovery in DNA and Protein Sequences Using the DEME Algorithm," BMC Bioinformatics, vol. 8, article no. 385, 2007.
[25] K. Nakai and M. Kanehisa, "A Knowledge Base for Predicting Protein Localization Sites in Eukaryotic Cells," Genomics, vol. 14, no. 4, pp. 897-911, Dec. 1992.
[26] P. Horton and K. Nakai, "A Probabilistic Classification System for Predicting the Cellular Localization Sites of Proteins," Proc. Int'l Conf. Intelligent Systems for Molecular Biology, vol. 4, pp. 109-115, 1996.
[27] P. Gopalakrishnan, D. Kanevsky, A. Nadas, and D. Nahamoo, "An Inequality for Rational Functions with Applications to Some Statistical Estimation Problems," IEEE Trans. Information Theory, vol. 37, no. 1 pp. 107-113, Jan. 1991.
[28] B. Boeckmann, A. Bairoch, R. Apweiler, M.-C. Blatter, A. Estreicher, E. Gasteiger, M.J. Martin, K. Michoud, C. O'Donovan, I. Phan, S. Pilbout, and M. Schneider, "The SWISS-PROT Protein Knowledgebase and Its Supplement TrEMBL in 2003," Nucleic Acids Research, vol. 31, no. 1, pp. 365-370, Jan. 2003.
[29] S. Balla, V. Thapar, S. Verma, T. Luong, T. Faghri, C.-H. Huang, S. Rajasekaran, J.J. del Campo, J.H. Shinn, W.A. Mohler, M.W. Maciejewski, M.R. Gryk, B. Piccirillo, S.R. Schiller, and M.R. Schiller, "Minimotif Miner: A Tool for Investigating Protein Function," Nature Methods, vol. 3, no. 3, pp. 175-177, Mar. 2006.
[30] E.M. Zdobnov and R. Apweiler, "InterProScan—An Integration Platform for the Signature-Recognition Methods in InterPro," Bioinformatics, vol. 17, no. 9 pp. 847-848, Sept. 2001.
[31] M. Tompa, N. Li, T.L. Bailey, G.M. Church, B.D. Moor, E. Eskin, A.V. Favorov, M.C. Frith, Y. Fu, W.J. Kent, V.J. Makeev, A.A. Mironov, W.S. Noble, G. Pavesi, G. Pesole, M. RÃgnier, N. Simonis, S. Sinha, G. Thijs, J. van Helden, M. Vandenbogaert, Z. Weng, C. Workman, C. Ye, and Z. Zhu, "Assessing Computational Tools for the Discovery of Transcription Factor Binding Sites," Nature Biotechnology, vol. 23, no. 1, pp. 137-144, Jan. 2005.
[32] T.L. Bailey and C. Elkan, "The Value of Prior Knowledge in Discovering Motifs with Meme," Proc. Int'l Conf. Intelligent Systems for Molecular Biology, vol. 3, pp. 21-29, 1995.
[33] B. Schuster-Bckler, J. Schultz, and S. Rahmann, "HMM Logos for Visualization of Protein Families," BMC Bioinformatics, vol. 5, article no. 7, Jan. 2004.
[34] S.J. Gould, G.A. Keller, and S. Subramani, "Identification of Peroxisomal Targeting Signals Located at the Carboxy Terminus of Four Peroxisomal Proteins," J. Cell Biology, vol. 107, no. 3, pp. 897-905, Sept. 1988.
[35] L.R. Kowalski, K. Kondo, and M. Inouye, "Cold-Shock Induction of a Family of TIP1-Related Proteins Associated with the Membrane in Saccharomyces Cerevisiae," Molecular Microbiology, vol. 15, no. 2, pp. 341-353, Jan. 1995.
[36] R. Nair and B. Rost, "Sequence Conserved for Subcellular Localization." Protein Science, vol. 11, no. 12 pp. 2836-2847, Dec. 2002.
[37] Saccharomyces Genome Database, , 2008.
[38] M. Kellis, N. Patterson, M. Endrizzi, B. Birren, and E.S. Lander, "Sequencing and Comparison of Yeast Species to Identify Genes and Regulatory Elements," Nature, vol. 423, no. 6937 pp. 241-254, May. 2003.
[39] P. Cliften, P. Sudarsanam, A. Desikan, L. Fulton, B. Fulton, J. Majors, R. Waterston, B.A. Cohen, and M. Johnston, "Finding Functional Features in Saccharomyces Genomes by Phylogenetic Footprinting." Science, vol. 301, no. 5629 pp. 71-76, July 2003.
[40] X. Pan, P. Roberts, Y. Chen, E. Kvam, N. Shulga, K. Huang, S. Lemmon, and D.S. Goldfarb, "Nucleus-Vacuole Junctions in Saccharomyces cerevisiae Are Formed through the Direct Interaction of Vac8p with Nvj1p." Molecular Biology of the Cell, vol. 11, no. 7 pp. 2445-2457, July 2000.
[41] S.-C. Chen, T. Zhao, G.J. Gordon, and R.F. Murphy, "Automated Image Analysis of Protein Localization in Budding Yeast." Bioinformatics, vol. 23, no. 13, pp. i66-i71, July 2007.
[42] S.R. Eddy, G. Mitchison, and R. Durbin, "Maximum Discrimination Hidden Markov Models of Sequence Consensus." J. Computational Biology, vol. 2, no. 1 pp. 9-23, 1995.
[43] H. Nielsen, J. Engelbrecht, S. Brunak, and G. von Heijne, "Identification of Prokaryotic and Eukaryotic Signal Peptides and Prediction of Their Cleavage Sites." Protein Eng., vol. 10, no. 1, pp. 1-6, Jan. 1997.
[44] P. Ross-Macdonald, P.S. Coelho, T. Roemer, S. Agarwal, A. Kumar, R. Jansen, K.H. Cheung, A. Sheehan, D. Symoniatis, L. Umansky, M. Heidtman, F.K. Nelson, H. Iwasaki, K. Hager, M. Gerstein, P. Miller, G.S. Roeder, and M. Snyder, "Large-Scale Analysis of the Yeast Genome by Transposon Tagging and Gene Disruption." Nature, vol. 402, no. 6760, pp. 413-418, Nov. 1999.
[45] Y. Normandin, R. Cardin, and R. De Mori, "High-Performance Connected Digit Recognition Using Maximum Mutual Information Estimation," IEEE Trans. Speech and Audio Processing, vol. 2, no. 2, pp. 299-311, Apr. 1994.
[46] P. Woodland and D. Povey, "Large Scale Discriminative Training of Hidden Markov Models for Speech Recognition," Computer Speech and Language, vol. 16, pp. 25-47, 2002.

Index Terms:
Hidden Markov models, maximal mutual information estimate, discriminative motif finding, protein localization.
Tien-ho Lin, Robert F. Murphy, Ziv Bar-Joseph, "Discriminative Motif Finding for Predicting Protein Subcellular Localization," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 8, no. 2, pp. 441-451, March-April 2011, doi:10.1109/TCBB.2009.82
Usage of this product signifies your acceptance of the Terms of Use.