This Article 
 Bibliographic References 
 Add to: 
Localization Site Prediction for Membrane Proteins by Integrating Rule and SVM Classification
December 2005 (vol. 17 no. 12)
pp. 1694-1705
We study the localization prediction of membrane proteins for two families of medically important disease-causing bacteria, called Gram-Negative and Gram-Positive bacteria. Each such bacterium has its cell surrounded by several layers of membranes. Identifying where proteins are located in a bacterial cell is of primary research interest for antibiotic and vaccine drug design. This problem has three requirements: First, with any subsequence of amino acid residues being potentially a dimension, it has an extremely high dimensionality, few being irrelevant. Second, the prediction of a target localization site must have a high precision in order to be useful to biologists, i.e., at least 90 percent or even 95 percent, while recall is as high as possible. Achieving such a precision is made harder by the fact that target sequences are often much fewer than background sequences. Third, the rationale of prediction should be understandable to biologists for taking actions. Meeting all these requirements presents a significant challenge in that a high dimensionality requires a complex model that is often hard to understand. The support vector machine (SVM) model has an outstanding performance in a high-dimensional space, therefore, it addresses the first two requirements. However, the SVM model involves many features in a single kernel function, therefore, it does not address the third requirement. We address all three requirements by integrating the SVM model with a rule-based model, where the understandable if-then rules capture "major structures” and the elaborated SVM model captures "subtle structures.” Importantly, the integrated model preserves the precision/recall performance of SVM and, at the same time, exposes major structures in a form understandable to the human user. We focus on searching for high quality rules and partitioning the prediction between rules and SVM so as to achieve these properties. We evaluate our method on several membrane localization problems. The purpose of this paper is not improving the precision/recall of SVM, but is manifesting the rationale of a SVM classifier through partitioning the classification between if-then rules and the SVM classifier and preserving the precision/recall of SVM.

[1] R. Andrews, J. Diederich, and A. Tickle, “A Survey and Critique of Techniques of Extracting Rules from Trained Artificial Neural Networks,” Knowledge-Based Systems, vol. 8, no. 6, pp. 373-389, 1995.
[2] J. Ayres, J. Gehrke, T. Yiu, and J. Flannick, “Sequential Pattern Mining Using a Bitmap Representation,” Proc. ACM SIGKDD, pp. 215-224, 2002.
[3] R. Agrawal, T. Imilienski, and A. Swami, “Mining Association Rules between Sets of Items in Large Data Sets,” Proc. ACM SIGMOD, 1993.
[4] K. Ali, S. Manganaris, and R. Srikant, “Partial Classification Using Association Rule,” Proc. Int'l Conf. Knowledge Discovery and Data Mining (KDD), pp. 115-118, 1997.
[5] C.B. Anfinsen, “Principles that Govern the Folding of Protein Chains,” Science, vol. 181, no. 96, pp. 223-230, 1973.
[6] R. Agrawal and R. Srikant, “Fast Algorithm for Mining Association Rules,” Proc. 20th Int'l Conf. Very Large Data Bases, 1994.
[7] R. Agrawal and R. Srikant, “Mining Sequential Patterns,” Proc. Int'l Conf. Data Eng., 1995.
[8] A. Bairoch and P. Bucher, “PROSITE: Recent Developments,” Nucleic Acids Research, vol. 22, no. 17, pp. 3583-3589, 1994.
[9] C.J.C. Burges, “A Tutorial on Support Vector Machines for Pattern Recognition,” Data Mining and Knowledge Discovery, vol. 2, pp. 121-167, Kluwer Academic Publishers, 1998.
[10] J. Cedano, P. Aloy, J.A. Perez-Pons, and E. Querol, “Relation between Amino Acid Composition and Cellular Location of Proteins,” J. Molecular Biology, vol. 266, no. 3, pp. 594-600, 1997.
[11] K. Diederichs, J. Freigang, S. Umhau, K. Zeth, and J. Breed, “Prediction by a Neural Network of Outer Membrane B-Strand Topology,” Protein Science, vol. 7, pp. 2413-2420, 1998.
[12] F. Eisenhaber and P. Bork, “Wanted: Subcellular Localization of Proteins Based on Sequences,” Trends in Cell Biology, vol. 8, pp. 169-170, 1998.
[13] J.L. Gardy, C. Spencer, K. Wang, M. Ester, G.E. Tusnady, I. Simon, S. Hua, K. deFays, C. Lambert, K. Nakai, and F.S.L. Brinkman, “PSORT-B: Improving Protein Subcellular Localization Prediction for Gram-Negative Bacteria,” Nucleic Acids Research, vol. 31, no. 13, pp. 3613-3617, 2003.
[14] S. Hua and Z. Sun, “Support Vector Machine Approach for Protein Subcellular Localization Prediction,” Bioinformatics, vol. 17, no. 8, pp. 721-728, 2001.
[15] I. Jacoboni, P. Martelli, P. Fariselli, V. De Pinto, and R. Casadio, “Prediction of the Transmembrane Regions of β-Barrel Membrane Proteins with a Neural Network-Based Predictor,” Protein Science, vol. 10, pp. 779-787, 2001.
[16] T. Joachims, “Making Large-Scale SVM Learning Practical,” Advances in Kernel Methods— Support Vector Learning, MIT Press, 1998.
[17] T. Joachims, “Text Categorization with Support Vector Machines: Learning with Many Relevant Features,” Proc. European Conf. Machine Learning, 1998.
[18] J.R. Koza and D. Andre, “Automatic Discovery of Protein Motifs Using Genetic Programming,” Evolutionary Computation: Theory and Applications, Singapore: World Scientific, 1995.
[19] J.R. Koza, F.H. Bennett III, and D. Andre, “Classifying Proteins as Extra-Cellular Using Programmatic Motifs and Genetic Programming,” Proc. IEEE World Congress Computational Intelligence, 1998.
[20] B. Liu, W. Hsu, and Y. Ma, “Integrating Classification and Association Rule Mining,” Proc. Fourth Int'l Conf. Knowledge Discovery and Data Mining, 1998.
[21] P. Martelli, P. Fariselli, A. Krogh, and R. Casadio, “A Sequence-Profile-Based HMM for Predicting and Discriminating β Barrel Membrane Proteins,” Bioinformatics, vol. 18, no. 1, pp. S46-S53, 2002.
[22] H. Nunez, C. Angulo, and A. Catala, “Rule Extraction from Support Vector Machines,” Proc. European Symp. Artificial Neural Networks, 2002.
[23] J. Nakashima and K. Nishikawa, “Discrimination of Intercellular and Extra-Cellular Proteins Using Amino Acid Composition and Residue-Pair Frequencies,” J. Molecular Biology, vol. 238, pp. 54-61, 1994.
[24] J. Han, B. Asl, Q. Chen, U. Dayal, and M. Hsu, “Freespan: Frequent Pattern-Projected Sequential Pattern Mining,” Proc. ACM SIGKDD, pp. 355-359, 2000.
[25] J.R. Quinlan, C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
[26] I. Rigoutsos and A. Floratos, “Combinatorial Pattern Discovery in Biological Sequences: The TEIRESIAS Algorithm,” Bioinformatics, vol. 14, no. 1, pp. 55-67, 1998.
[27] A. Reinhardt and T. Hubbard, “Using Neural Networks for Prediction of the Subcellular Location of Proteins,” Nucleic Acids Research, vol. 26, no. 9, pp. 2230-2236, 1998.
[28] G. Salton and C. Buckley, “Term Weighting Approaches in Automatic Text Retrieval,” Information Processing and Management, vol. 24, no. 5, pp. 513-523, 1988.
[29] R. She, F. Chen, K. Wang, M. Ester, J.L. Gardy, and F. Brinkman, “Frequent Subsequence-Based Prediction of Outer Membrane Proteins,” Proc. ACM SIGKDD, 2003.
[30] L. Stryer, Biochemistry, fourth ed. New York: W.H. Freeman, 1995.
[31] V. Vapnik, The Nature of Statistical Learning Theory. Springer, 1995.
[32] J.P. Vert, “Support Vector Machine Prediction of Signal Peptide Cleavage Site Using a New Class of Kernels for Strings,” Proc. Pacific Symp. Biocomputing, pp. 649-660, 2002.
[33] J. Wang, G. Chirn, T. Marr, B. Shapiro, D. Shasha, and K. Zhang, “Combinatorial Pattern Discovery for Scientific Data: Some Preliminary Results,” Proc. ACM SIGMOD, 1994.
[34] K. Wang, S. Zhou, and Y. He, “Growing Decision Tree on Support-Less Association Rules,” Proc. ACM SIGKDD, 2000.
[35] Z. Yuan, “Prediction of Protein Subcellular Locations Using Markov Chain Models,” FEBS Letter, vol. 451, pp. 23-26, 1999.
[36] J. Yang, W. Wang, P.S. Yu, J. Han, “Mining Long Sequential Patterns in a Noisy Environment,” Proc. ACM SIGMOD, pp. 406-417, 2002.
[37] M.J. Zaki, “SPADE: An Efficient Algorithm for Mining Frequent Sequences,” Machine Learning J., special issue on unsupervised learning, vol. 42, nos. 1/2, pp. 31-60, 2001.
[38] A. Bairoch and B. Boeckmann, “The SWISS-PROT Protein Sequence Data Bank: Current Status,” Nucleic Acids Research, vol. 22, no. 17, pp. 3578-3580, 1991.
[39] T. Schirmer and S. Cowan, “Prediction of Membrane-Spanning β-Strands and Its Application to Maltoporin,” Protein Science, vol. 2, pp. 1361-1363, 1993.
[40] W. Wimley, “Toward Genomic Identification of β-Barrel Membrane Proteins: Composition and Architecture of Known Structures,” Protein Science, vol. 11, pp. 301-312, 2002.
[41] Y. Zhai and M. Saier, “The β-Barrel Finder (BBF) Program, Allowing Identification of Outer Membrane β-Barrel Proteins Encoded within Prokaryotic Genomes,” Protein Science, vol. 11, pp. 2196-2207, 2002.
[42] P. Martelli, P. Fariselli, A. Krogh, and R. Casadio, “A Sequence-Profile-Based HMM for Predicting and Discrimating Barrel Membrane Proteins,” Bioinformatics, vol. 18, no. 1, pp. S46-S53, 2002.
[43] Z.H. Zhou and Z.Q. Chen, “Hybrid Decision Tree,” Knowledge-Based Systems, vol. 15, no. 8, pp. 515-528, Elsevier, 2002.
[44] C. Leslie, E. Eskin, and W.S. Nobel, “The Spectrum Kernel: A String Kernel for SVM Protein Classification,” Proc. Pacific Symp. Biocomputing, pp. 564-575, 2002.
[45] N. Barakat and J. Diederich, “Learning-Based Rule-Extraction from Support Vector Machines,” Proc. 12th Int'l Conf. Computer Theory and Applications, 2004.
[46] H. Yu, K.C.C. Chang, and J. Han, “Heterogeneous Learner for Web Page Classification,” Proc. Int'l Conf. Data Mining, 2002.
[47] K.P. Bennett, N. Cristianni, and D. Wu, “Enlarging the Margins in Perceptron Decision Trees,” Machine Learning, vol. 41, pp. 295-313, 2000.
[48] TMHMM Server v. 2.0, Prediction of Transmembrane Helices in Proteins,, 2004.
[49] Phobius, “A Combined Transmembrane Topology and Signal Peptide Predictor,”, 2004.
[50] K. Wang, Y. Xu, J. Yu, “Scalable Sequential Pattern Mining for Biological Sequences,” Proc. Conf. Information and Knowledge Management, 2004.
[51] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene Selection for Cancer Classification Using Support Vector Machines,” Machine Learning, vol. 46, nos. 1-3, pp. 389-422, 2002.

Index Terms:
Index Terms- Bioinformatics (genome or protein) databases, clustering, classification, and association rules.
Senqiang Zhou, Ke Wang, "Localization Site Prediction for Membrane Proteins by Integrating Rule and SVM Classification," IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 12, pp. 1694-1705, Dec. 2005, doi:10.1109/TKDE.2005.201
Usage of this product signifies your acceptance of the Terms of Use.