Issue No. 01 - January-March (2009 vol. 6)
Abstract-- A novel approach for gene classification, which adopts codon usage bias as input feature vector for classification by support vector machines (SVM) is proposed. The DNA sequence is first converted to a 59-dimensional feature vector where each element corresponds to the relative synonymous usage frequency of a codon. As the input to the classifier is independent of sequence length and variance, our approach is useful when the sequences to be classified are of different lengths, a condition that homology-based methods tend to fail. The method is demonstrated by using 1,841 Human Leukocyte Antigen (HLA) sequences which are classified into two major classes: HLA-I and HLA-II; each major class is further subdivided into sub-groups of HLA-I and HLA-II molecules. Using codon usage frequencies, binary SVM achieved accuracy rate of 99.3% for HLA major class classification and multi-class SVM achieved accuracy rates of 99.73% and 98.38% for sub-class classification of HLA-I and HLA-II molecules, respectively. The results show that gene classification based on codon usage bias is consistent with the molecular structures and biological functions of HLA molecules.
Cluster analysis, codon usage bias, gene classification, Human Leukocyte Antigen (HLA), Major Histocompatibility Complex (MHC), Relative Synonymous Codon Use (RSCU) frequency
J. C. Rajapakse, J. Ma and M. N. Nguyen, "Gene Classification Using Codon Usage and Support Vector Machines," in IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 6, no. , pp. 134-143, 2007.