This Article 
 Bibliographic References 
 Add to: 
Subcellular Localization Prediction through Boosting Association Rules
March/April 2012 (vol. 9 no. 2)
pp. 609-618
Yongwook Yoon, Pohang University of Science and Technology, Pohang
Gary Geunbae Lee, Pohang University of Science and Technology, Pohang
Computational methods for predicting protein subcellular localization have used various types of features, including N-terminal sorting signals, amino acid compositions, and text annotations from protein databases. Our approach does not use biological knowledge such as the sorting signals or homologues, but use just protein sequence information. The method divides a protein sequence into short k-mer sequence fragments which can be mapped to word features in document classification. A large number of class association rules are mined from the protein sequence examples that range from the N-terminus to the C-terminus. Then, a boosting algorithm is applied to those rules to build up a final classifier. Experimental results using benchmark data sets show that our method is excellent in terms of both the classification performance and the test coverage. The result also implies that the k-mer sequence features which determine subcellular locations do not necessarily exist in specific positions of a protein sequence. Online prediction service implementing our method is available at

[1] B. Eisenhaber and P. Bork, “Wanted: Subcellular Localization of Proteins Based on Sequence,” Trends in Cell Biology, vol. 9, pp. 169-170, 1998.
[2] G. Schatz and B. Dobberstein, “Common Principles of Protein Translocation across Membranes,” Science, vol. 271, no. 5255, pp. 1519-1526, 1996.
[3] K. Nakai, “Protein Sorting Signals and Prediction of Subcellular Localization,” Advances in Protein Chemistry, vol. 54, pp. 277-344, 2000.
[4] K. Nakai and M. Kanehisa, “A Knowledge Base for Predicting Protein Localization Sites in Eukaryotic Cells,” Genomics, vol. 14, pp. 897-911, 1992.
[5] O. Emanuelsson, H. Nielsen, S. Brunak, and G. von Heijne, “Predicting Subcellular Localization of Proteins Based on Their n-Terminal Amino Acid Sequence,” J. Molecular Biology, vol. 300, pp. 1005-1016, 2000.
[6] H. Bannai, Y. Tamada, O. Maruyama, K. Nakai, and S. Miyano, “Extensive Feature Detection of N-Terminal Protein Sorting Signals,” Bioinformatics, vol. 18, no. 2, pp. 298-305, 2002.
[7] K.-J. Park and M. Kanehisa, “Prediction of Protein Subcellular Locations by Support Vector Machines Using Compositions of Amino Acids and Amino Acid Pairs,” Bioinformatics, vol. 19, no. 13, pp. 1656-1663, 2003.
[8] Y.-D. Cai and K.-C. Chou, “Predicting Subcellular Localization of Proteins in a Hybridization Space,” Bioinformatics, vol. 20, no. 7, pp. 1151-1156, 2004.
[9] K.C. Chou and D.W. Elrod, “Protein Subcellular Location Prediction,” Protein Eng., vol. 12, pp. 107-118, 1999.
[10] E. Eskin and E. Agichtein, “Combining Text Mining and Sequence Analysis to Discover Protein Functional Regions,” Proc. Pacific Symp. Biocomputing, pp. 288-299, 2004.
[11] J. Hawkins, L. Davis, and M. Boden, “Predicting Nuclear Localization,” J. Proteome Research, vol. 6, pp. 1402-1409, 2007.
[12] R. Nair and B. Rost, “Inferring Sub-Cellular Localization through Automated Lexical Analysis,” Bioinformatics, vol. 28, pp. S78-S86, 2002.
[13] A. Bairoch and R. Apweiler, “The Swiss-Prot Protein Sequence Database and Its Supplement Trembl in 2000,” Nucleic Acids Research, vol. 28, no. 1, pp. 45-48, 2000.
[14] Z. Lu, D. Szafron, R. Greiner, P. Lu, D.S. Wishart, B. Poulin, J. Anvik, C. Macdonell, and R. Eisner, “Predicting Subcellular Localization of Proteins Using Machine-Learned Classifiers,” Bioinformatics, vol. 20, no. 4, pp. 547-556, 2004.
[15] A. Höglund, P. Dönnes, T. Blum, H.-W. Adolph, and O. Kohlbacher, “Multiloc: Prediction of Protein Subcellular Localization Using N-Terminal Targeting Sequences, Sequence Motifs and Amino Acid Composition,” Bioinformatics, vol. 22, no. 10, pp. 1158-1165, 2006.
[16] R. Nair, P. Carter, and B. Rost, “NLSdb: Database of Nuclear Localization Signals,” Nucleic Acids Research, vol. 31, pp. 397-399, 2003.
[17] A. Bairoch and P. Bucher, “Prosite: Recent Developments,” Nucleic Acids Research, vol. 22, pp. 3583-3589, 1994.
[18] H. Shatkay, A. Höglund, S. Brady, T. Blum, P. Dönnes, and O. Kohlbacher, “Sherloc: High-Accuracy Prediction of Protein Localization by Integrating Text and Protein Sequence Data,” Bioinformatics, vol. 23, no. 11, pp. 1410-1417, 2007.
[19] K.-C. Chou and H.-B. Shen, “Hum-PLoc: A Novel Ensemble Classifier for Predicting Human Protein Subcellular Localization,” Biochemical and Biophisical Research Comm., vol. 347, pp. 150-157, 2006.
[20] K.-C. Chou and H.-B. Shen, “Review: Recent Progresses in Protein Subcellular Location Prediction,” Analytical Biochemistry, vol. 370, pp. 1-16, 2007.
[21] B. Liu, W. Hsu, and Y. Ma, “Integrating Classification and Association Rule Mining,” Proc. Int'l Conf. Knowledge Discovery and Data Mining, pp. 80-86, 1998.
[22] R. She, F. Chen, K. Wang, and M. Ester, “Frequent-Subsequence-Based Prediction of Outer Membrane Proteins,” Proc. Int'l Conf. Data Mining and Knowledge Discovery, pp. 436-445, 2003.
[23] Y. Jin, B. Niu, K. Feng, W. Lu, Y. Cai, and G. Li, “Predicting Subcellular Localization with Adaboost Learner,” Protein and Peptide Letters, vol. 15, no. 1, pp. 286-289, 2008.
[24] Y. Yoon and G.G. Lee, “Text Categorization Based on Boosting Association Rules,” Proc. Second IEEE Int'l Conf. Semantic Computing (ICSC 2008), pp. 136-143, 2008.
[25] M.A. Larkin, G. Blackshields, N.P. Brown, R. Chenna, P.A. Mcgettigan, H. Mcwilliam, F. Valentin, I.M. Wallace, A. Wilm, R. Lopez, J.D. Thompson, T.J. Gibson, and D.G. Higgins, “Clustal w and Clustal x Version 2.0,” Bioinformatics/Computer Applications in the Biosciences, vol. 23, pp. 2947-2948, 2007.
[26] R. Agrawal, T. Imielinski, and A.N. Swami, “Mining Association Rules between Sets of Items in Large Databases,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 207-216, 1993.
[27] J. Han, J. Pei, and Y. Yin, “Mining Frequent Patterns without Candidate Generation,” Proc. ACM SIGMOD Int'l Conf. Management of Data, W. Chen, J. Naughton, and P. A. Bernstein, eds., pp. 1-12, 2000.
[28] W. Li, J. Han, and J. Pei, “CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules,” Proc. IEEE Int'l Conf. Data Mining (ICDM), pp. 369-376, 2001.
[29] J. Wang and G. Karypis, “Harmony: Efficiently Mining the Best Rules for Classification,” Proc. SIAM Int'l Conf. Data Mining (SDM), 2005.
[30] K.C. Chou and H.B. Shen, “Cell-PLoc: A Package of Web Servers for Predicting Subcellular Localization of Proteins in Various Organisms,” Nature Protocols, vol. 3, no. 2, pp. 153-162,, 2008.
[31] R.E. Schapire, “The Strength of Weak Learnability,” Machine Learning, vol. 5, pp. 197-227, strength.html , 1990.
[32] Y. Freund and R.E. Schapire, “A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting,” J. Computer and System Sciences, vol. 55, no. 1, pp. 119-139, 1997.
[33] R.E. Schapire, Y. Freund, P. Barlett, and W.S. Lee, “Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods,” Proc. Int'l Conf. Machine Learning (ICML), pp. 322-330, 1997.
[34] C.C.L.Z.C. Zhou, X.B., and X.Y. Zou, “Using Chou's Amphiphilic Pseudo-Amino Acid Composition and Support Vector Machine for Prediction of Enzyme Subfamily Classes,” J. Theoretical Biology, vol. 248, pp. 546-551, 2007.
[35] K.C. Chou and H.B. Shen, “Foldrate: A Web-Server for Predicting Protein Folding Rates from Primary Sequence,” The Open Bioinformatics J., vol. 3, pp. 31-50, 2009.
[36] K.C. Chou and C.T. Zhang, “Review: Prediction of Protein Structural Classes,” Critical Rev. in Biochemistry and Molecular Biology, vol. 30, pp. 275-349, 1995.
[37] A.K. McCallum, “Bow: A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering,”, 1996.
[38] J. Li, G. Dong, and K. Ramamohanarao, “Making Use of the Most Expressive Jumping Emerging Patterns for Classification,” Knowledge and Information Systems, vol. 3, no. 2, pp. 131-145, 2001.
[39] T. Gambin and K. Walczak, “A New Classification Method Using Array Comparative Genome Hybridization Data, Based on the Concept of Limited Jumping Emerging Patterns,” BMC Bioinformatics, vol. 10(Suppl 1), article S64, 2009.

Index Terms:
Clustering classification and association rules, bioinformatics (genome or protein) databases, pattern recognition.
Yongwook Yoon, Gary Geunbae Lee, "Subcellular Localization Prediction through Boosting Association Rules," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 9, no. 2, pp. 609-618, March-April 2012, doi:10.1109/TCBB.2011.131
Usage of this product signifies your acceptance of the Terms of Use.