Subscribe
Issue No.03 - May/June (2011 vol.8)
pp: 858-864
Minh N. Nguyen , BioInformatics Institute, Singapore
Jacek M. Zurada , BioInformatics Research Center, Singapore
Jagath C. Rajapakse , BioInformatics Research Center, Singapore
ABSTRACT
Although numerous computational techniques have been applied to predict protein secondary structure (PSS), only limited studies have dealt with discovery of logic rules underlying the prediction itself. Such rules offer interesting links between the prediction model and the underlying biology. In addition, they enhance interpretability of PSS prediction by providing a degree of transparency to the predicting model usually regarded as a black box. In this paper, we explore the generation and use of C4.5 decision trees to extract relevant rules from PSS predictions modeled with two-stage support vector machines (TS-SVM). The proposed rules were derived on the RS126 data set of 126 nonhomologous globular proteins and on the PSIPRED data set of 1,923 protein sequences. Our approach has produced sets of comprehensible, and often interpretable, rules underlying the PSS predictions. Moreover, many of the rules seem to be strongly supported by biological evidence. Further, our approach resulted in good prediction accuracy, few and usually compact rules, and rules that are generally of higher confidence levels than those generated by other rule extraction techniques.
INDEX TERMS
Protein structure, secondary structure prediction, support vector machines, multiclass SVM, C4.5 decision trees, rule extraction.
CITATION
Minh N. Nguyen, Jacek M. Zurada, Jagath C. Rajapakse, "Toward Better Understanding of Protein Secondary Structure: Extracting Prediction Rules", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.8, no. 3, pp. 858-864, May/June 2011, doi:10.1109/TCBB.2010.16
REFERENCES
 [1] J. Garnier, J.F. Gibrat, and B. Robson, "GOR Method for Predicting Protein Secondary Structure from Amino Acid Sequence," Methods in Enzymology, vol. 266, pp. 541-553, 1996. [2] A.A. Salamov and V.V. Solovyev, "Prediction of Protein Secondary Structure by Combining Nearest-Neighbor Algorithms and Multiple Sequence Alignments," J. Molecular Biology, vol. 247, pp. 11-15, 1995. [3] S.C. Schmidler, J.S. Liu, and D.L. Brutlag, "Bayesian Segmentation of Protein Secondary Structure," J. Computational Biology, vol. 7, pp. 233-248, 2000. [4] B. Rost and C. Sander, "Prediction of Protein Secondary Structure at Better than 70 Percent Accuracy," J. Molecular Biology, vol. 232, pp. 584-599, 1993. [5] S.K. Riis and A. Krogh, "Improving Prediction of Protein Secondary Structure Using Structured Neural Networks and Multiple Sequence Alignment," J. Computational Biology, vol. 3, pp. 163-183, 1996. [6] P. Baldi, S. Brunak, P. Frasconi, G. Soda, and G. Pollastri, "Exploiting the Past and the Future in Protein Secondary Structure Prediction," Bioinformatics, vol. 5, pp. 937-946, 1999. [7] D.T. Jones, "Protein Secondary Structure Prediction Based on Position-Specific Scoring Matrices," J. Molecular Biology, vol. 292, pp. 195-202, 1999. [8] J.A. Cuff and G.J. Barton, "Evaluation and Improvement of Multiple Sequence Methods for Protein Secondary Structure Prediction," Proteins, vol. 4, pp. 508-519, 1999. [9] M. Ouali and R.D. King, "Cascaded Multiple Classifiers for Secondary Structure Prediction," Protein Science, vol. 9, pp. 1162-1176, 1999. [10] T.Z. Sen, H. Cheng, A. Kloczkowski, and R.L. Jernigan,"A Consensus Data Mining Secondary Structure Prediction by Combining GOR V and Fragment Database Mining," Protein Science, vol. 15, no. 11, pp. 2499-2506, 2006. [11] S.F. Altschul, T.L. Madden, A.A. Schäffer, J. Zhang, Z. Zhang, W. Miller, and D.J. Lipman, "Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs," Nucleic Acids Research, vol. 25, no. 17, pp. 3389-3402, 1997. [12] J. Meiler and D. Baker, "Coupled Prediction of Protein Secondary and Tertiary Structure," Protein Science, vol. 100, no. 21, pp. 12105-12110, 2003. [13] H. Kim and H. Park, "Protein Secondary Structure Prediction Based on an Improved Support Vector Machines Approach," Protein Eng. vol. 16, pp. 553-560, 2003. [14] M.N. Nguyen and J.C. Rajapakse, "Prediction of Protein Secondary Structure with Two-Stage Multi-Class SVM Approach," Int'l J. Data Mining and Bioinformatics, vol. 1, no. 3, pp. 248-269, 2007. [15] M.N. Nguyen and J.C. Rajapakse, "Prediction of Protein Relative Solvent Accessibility with a Two-Stage SVM Approach," Proteins: Structure, Function, and Bioinformatics, vol. 59, pp. 30-37, 2005. [16] M.N. Nguyen and J.C. Rajapakse, "Two-Stage Support Vector Regression Approach for Predicting Accessible Surface Areas of Amino Acids," Proteins: Structure, Function, and Bioinformatics, vol. 63, pp. 542-550, 2006. [17] J. He, H. Hu, R. Harrison, P.C. Tai, and Y. Pan, "Rule Generation for Protein Secondary Structure Prediction with Support Vector Machines and Decision Tree," IEEE Trans. Nanobioscience, vol. 5, no. 1, pp. 46-53, Mar. 2006. [18] K. Crammer and Y. Singer, "On the Learnability and Design of Output Codes for Multiclass Problems," Machine Learning, vol. 47, pp. 201-233, 2002. [19] M.T. Mitchell,Machine Learning. McGraw-Hill, 1997. [20] J.R. Quinlan, C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. [21] C.W. Hsu and C.J. Lin, "A Comparison on Methods for Multi-Class Support Vector Machines," IEEE Trans. Neural Networks, vol. 13, pp. 415-425, 2002. [22] J.M. Ma, M.N. Nguyen, and J.C. Rajapakse, "Gene Classification Using Codon Usage and Support Vector Machines," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 6, no. 1, pp. 134-143, Jan.-Mar. 2009. [23] I.H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, second ed. Morgan Kaufmann, 2005. [24] S. Padmanabhan, and R.L. Baldwin, "Tests for Helix-Stabilizing Interactions between Various Nonpolar Side Chains in Alanine-Based Peptides," Protein Science, vol. 3, pp. 1992-1997, 1994. [25] P.C. Lyu, J.C. Sherman, A. Chen, and N.R. Kallenbach, "$\alpha$ -Helix Stabilization by Natural and Unnatural Amino Acids with Alkyl Side Chains," Proc. Nat'l Academy of Sciences USA, vol. 88, pp. 5317-5320, 1991. [26] J.S. Richardson and D.C. Richardson, "Amino Acid Preferences for Specific Locations at the Ends of $\alpha$ Helices," Science, vol. 240, no. 4859, pp. 1648-1652, 1988. [27] N. Colloc'h and F.E. Cohen, "$\beta$ -Breakers: An Aperiodic Secondary Structure," J. Molecular Biology, vol. 221, no. 2, pp. 603-613, 1991. [28] C.J. Crasto and J.A. Feng, "Sequence Codes for Extended Conformation: A Neighbor-Dependent Sequence Analysis of Loops in Proteins," Proteins: Structure, Function, and Genetics, vol. 42, no. 3, pp. 399-413, 2001. [29] L.T. Huang, M.M. Gromiha, and S.Y. Ho, "Sequence Analysis and Rule Development of Predicting Protein Stability Change Upon Mutation Using Decision Tree Model," J. Molecular Modeling, vol. 13, no. 8, pp. 879-890, 2007. [30] L.T. Huang, M.M. Gromiha, S.F. Hwang, and S.Y. Ho, "Knowledge Acquisition and Development of Accurate Rules for Predicting Protein Stability Changes," Computational Biology and Chemistry, vol. 30, no. 6, pp. 408-415, 2006. [31] J. He, H.J. Hu, B. Chen, P.C. Tai, R. Harrison, and Y. Pan, "Rule Extraction from SVM for Protein Structure Prediction," Studies in Computational Intelligence, vol. 80, pp. 227-252, 2008. [32] R.B. Cornell and S.G. Taneva, "Amphipathic Helices as Mediators of the Membrane Interaction of Amphitropic Proteins, and as Modulators of Bilayer Physical Properties," Current Protein and Peptide Science, vol. 7, pp. 539-552, 2006. [33] A.L. Eilers, A.N. Billin, J. Li, and D.E. Ayer, "A 13-Amino Acid Amphipathic Alpha-Helix Is Required for the Functional Interaction between the Transcriptional Repressor Mad1 and mSin3A," J. Biological Chemistry, vol. 274, pp. 32750-32756, 1999. [34] D. Eisenberg, R.M. Weiss, and T.C. Terwilliger, "The Helical Hydrophobic Moment: A Measure of the Amphiphilicity of a Helix," Nature, vol. 299, no. 5881, pp. 371-374, 1982.