Issue No.12 - December (2005 vol.17)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TKDE.2005.201
We study the localization prediction of membrane proteins for two families of medically important disease-causing bacteria, called Gram-Negative and Gram-Positive bacteria. Each such bacterium has its cell surrounded by several layers of membranes. Identifying where proteins are located in a bacterial cell is of primary research interest for antibiotic and vaccine drug design. This problem has three requirements: First, with any subsequence of amino acid residues being potentially a dimension, it has an extremely high dimensionality, few being irrelevant. Second, the prediction of a target localization site must have a high precision in order to be useful to biologists, i.e., at least 90 percent or even 95 percent, while recall is as high as possible. Achieving such a precision is made harder by the fact that target sequences are often much fewer than background sequences. Third, the rationale of prediction should be understandable to biologists for taking actions. Meeting all these requirements presents a significant challenge in that a high dimensionality requires a complex model that is often hard to understand. The support vector machine (SVM) model has an outstanding performance in a high-dimensional space, therefore, it addresses the first two requirements. However, the SVM model involves many features in a single kernel function, therefore, it does not address the third requirement. We address all three requirements by integrating the SVM model with a rule-based model, where the understandable if-then rules capture "major structures” and the elaborated SVM model captures "subtle structures.” Importantly, the integrated model preserves the precision/recall performance of SVM and, at the same time, exposes major structures in a form understandable to the human user. We focus on searching for high quality rules and partitioning the prediction between rules and SVM so as to achieve these properties. We evaluate our method on several membrane localization problems. The purpose of this paper is not improving the precision/recall of SVM, but is manifesting the rationale of a SVM classifier through partitioning the classification between if-then rules and the SVM classifier and preserving the precision/recall of SVM.
Index Terms- Bioinformatics (genome or protein) databases, clustering, classification, and association rules.
Senqiang Zhou, Ke Wang, "Localization Site Prediction for Membrane Proteins by Integrating Rule and SVM Classification", IEEE Transactions on Knowledge & Data Engineering, vol.17, no. 12, pp. 1694-1705, December 2005, doi:10.1109/TKDE.2005.201