Issue No. 03 - May/June (2011 vol. 8)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TCBB.2010.16
Minh N. Nguyen , BioInformatics Institute, Singapore
Jacek M. Zurada , BioInformatics Research Center, Singapore
Jagath C. Rajapakse , BioInformatics Research Center, Singapore
Although numerous computational techniques have been applied to predict protein secondary structure (PSS), only limited studies have dealt with discovery of logic rules underlying the prediction itself. Such rules offer interesting links between the prediction model and the underlying biology. In addition, they enhance interpretability of PSS prediction by providing a degree of transparency to the predicting model usually regarded as a black box. In this paper, we explore the generation and use of C4.5 decision trees to extract relevant rules from PSS predictions modeled with two-stage support vector machines (TS-SVM). The proposed rules were derived on the RS126 data set of 126 nonhomologous globular proteins and on the PSIPRED data set of 1,923 protein sequences. Our approach has produced sets of comprehensible, and often interpretable, rules underlying the PSS predictions. Moreover, many of the rules seem to be strongly supported by biological evidence. Further, our approach resulted in good prediction accuracy, few and usually compact rules, and rules that are generally of higher confidence levels than those generated by other rule extraction techniques.
Protein structure, secondary structure prediction, support vector machines, multiclass SVM, C4.5 decision trees, rule extraction.
J. M. Zurada, M. N. Nguyen and J. C. Rajapakse, "Toward Better Understanding of Protein Secondary Structure: Extracting Prediction Rules," in IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 8, no. , pp. 858-864, 2010.