The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.01 - January-March (2009 vol.6)
pp: 134-143
ABSTRACT
Abstract-- A novel approach for gene classification, which adopts codon usage bias as input feature vector for classification by support vector machines (SVM) is proposed. The DNA sequence is first converted to a 59-dimensional feature vector where each element corresponds to the relative synonymous usage frequency of a codon. As the input to the classifier is independent of sequence length and variance, our approach is useful when the sequences to be classified are of different lengths, a condition that homology-based methods tend to fail. The method is demonstrated by using 1,841 Human Leukocyte Antigen (HLA) sequences which are classified into two major classes: HLA-I and HLA-II; each major class is further subdivided into sub-groups of HLA-I and HLA-II molecules. Using codon usage frequencies, binary SVM achieved accuracy rate of 99.3% for HLA major class classification and multi-class SVM achieved accuracy rates of 99.73% and 98.38% for sub-class classification of HLA-I and HLA-II molecules, respectively. The results show that gene classification based on codon usage bias is consistent with the molecular structures and biological functions of HLA molecules.
INDEX TERMS
Cluster analysis, codon usage bias, gene classification, Human Leukocyte Antigen (HLA), Major Histocompatibility Complex (MHC), Relative Synonymous Codon Use (RSCU) frequency
CITATION
Minh N. Nguyen, Jagath C. Rajapakse, "Gene Classification Using Codon Usage and Support Vector Machines", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.6, no. 1, pp. 134-143, January-March 2009, doi:10.1109/TCBB.2007.70240
REFERENCES
[1] R. Grantham, C. Gautier, M. Gouy, R. Mercier, and A. Pave, “Codon Catalog Usage and the Genome Hypothesis,” Nucleic Acids Research, vol. 8, pp. r49-r62, 1980.
[2] T.C. Ghosh, S.K. Gupta, and S. Majumdar, “Studies on Codon Usage in Entamoeba histolytica,” Int'l J. Parasitology, vol. 30, pp.715-722, 2000.
[3] P.M. Sharp, E. Cowe, and D.G. Higgins, “Codon Usage Patterns in Escherichia coli, Bacillus subtilis, Saccharomyces cerevisiae, Schizosaccharomyces pombe, Drosophila melanogaster, and Homo sapiens: A Review of the Considerable Within-Species Diversity,” Nucleic Acids Research, vol. 16, pp. 8207-8211, 1988.
[4] J.M. Ma, T. Zhou, W.J. Gu, X. Sun, and Z.H. Lu, “Cluster Analysis of the Codon Use Frequency of MHC Genes from Different Species,” Biosystems, vol. 65, pp. 199-207, 2002.
[5] J.M. Ma, N.M. Nguyen, G.B. Fogel, and J.C. Rajapakse, “Determination of the Relative Importance of Gene Function or Taxonomic Grouping to Codon Usage Bias Using Cluster Analysis and SVMs,” Proc. IEEE Symp. Computational Intelligence in Bioinformatics and Computational Biology, Sept. 2006.
[6] W.J. Gu, T. Zhou, J.M. Ma, X. Sun, and Z.H. Lu, “The Relationship between Synonymous Codon Usage and Protein Structure in Escherichia coli and Homo sapiens,” Biosystems, vol. 73, pp. 89-97, 2004.
[7] T. Ikemura, “Correlation between the Abundance of Escherichia coli Transfer RNAs and the Occurrence of the Respective Codons in Its Protein Genes: A Proposal for a Synonymous Codon Choice That Is Optimal for the E. coli Translational System,” J. Molecular Biology, vol. 151, pp. 389-409, 1981.
[8] B.R. Morton, “Chloroplast DNA Codon Use: Evidence for Selection at the PSB A Locus Based on tRNA Availability,” J.Molecular Evolution, vol. 37, pp. 273-280, 1993.
[9] R. Grantham, C. Gautier, M. Gouy, M. Jacobzone, and R. Mercier, “Codon Catalog Usage Is a Genome Strategy Modulated for Gene Expressivity,” Nucleic Acids Research, vol. 9, pp. r43-r74, 1981.
[10] M. Gouy and C. Gautier, “Codon Usage in Bacteria: Correlation with Gene Expressivity,” Nucleic Acids Research, vol. 10, pp. 7055-7074, 1982.
[11] B.R. Morton, “Codon Use and the Rate of Divergence of Land Plant Chloroplast Genes,” Molecular Biology and Evolution, vol. 11, pp. 231-238, 1994.
[12] W.J. Gu, T. Zhou, J.M. Ma, X. Sun, and Z.H. Lu, “Analysis of Synonymous Codon Usage in SARS Corona Virus and other Viruses in the Nidovirales,” Virus Research, vol. 101, pp. 155-161, 2004.
[13] P.M. Sharp, T. Tuohy, and K. Mosurski, “Codon Usage in Yeast: Cluster Analysis Clearly Differentiates Highly and Lowly Expressed Genes,” Nucleic Acids Research, vol. 14, pp. 5125-5143, 1986.
[14] M.A. Freire-Picos, M.I. Gonzalez-Sisco, A.M. Rodriguez-Torres, E. Ramil, and M.E. Cerdan, “Codon Usage in Kluyveromyces lactis and in Yeast Cytochrome C-Encoding Genes,” Gene, vol. 139, pp.43-49, 1994.
[15] M. Stenico, A.T. Lloyd, and P.M. Sharp, “Codon Usage in Caenorhabditis elegans: Delineation of Translational Selection and Mutational Biases,” Nucleic Acids Research, vol. 22, pp. 2437-2446, 1994.
[16] H. Chiapello, F. Lisacek, M. Caboche, and A. Henaut, “Codon Usage and Gene Function Are Related in Sequences of Arabidopsis thaliana,” Gene, vol. 209, pp. GC1-GC38, 1998.
[17] C. Mathe, A. Peresetsky, P. Dehais, M. Van Montagu, and P. Rouze, “Classification of Arabidopsis thaliana Gene Sequences: Clustering of Coding Sequences into Two Groups According to Codon Usage Improves Gene Prediction,” J. Molecular Biology, vol. 285, pp. 1977-1991, 1999.
[18] A.C. Eyre-Walker, “An Analysis of Codon Usage in Mammals: Selection or Mutation Bias,” J. Molecular Evolution, vol. 33, pp. 442-449, 1991.
[19] X. Pan and J. Fu, “Molecular Evolution of MHC DQA Genes. II. Phylogenetic Analysis Based on Nucleotide Substitution and SCU Bias,” Yi Chuan Xue Bao (Chinese), vol. 24, pp. 394-402, 1997.
[20] N.G. Smith and L.D. Hurst, “The Causes of Synonymous Rate Variation in the Rodent Genome. Can Substitution Rates Be Used to Estimate the Sex Bias in Mutation Rate,” Genetics, vol. 152, pp.661-673, 1999.
[21] S.K. McWeeney and A.M. Valdes, “Codon Usage Bias and Base Composition in MHC Genes in Humans and Common Chimpanzees,” Immunogenetics, vol. 49, pp. 272-279, 1999.
[22] I.M. Wallace, G. Blackshields, and D.G. Higgins, “Multiple Sequence Alignments,” Current Opinion in Structural Biology, vol. 15, pp. 261-266, 2005.
[23] V.N. Grishin and N.V. Grishin, “Euclidian Space and Grouping of Biological Objects,” Bioinformatics, vol. 18, pp. 1523-1534, 2002.
[24] M. Shatsky, R. Nussinov, and H.J. Wolfson, “Optimization of Multiple-Sequence Alignment Based on Multiple-Structure Alignment,” Proteins, vol. 62, pp. 209-217, 2006.
[25] S. Kanaya, Y. Yamada, Y. Kudo, and T. Ikemura, “Studies of Codon Usage and tRNA Genes of 18 Unicellular Organisms and Quantification of Bacillus subtilis tRNAs: Gene Expression Level and Species-Specific Diversity of Codon Usage Based on Multivariate Analysis,” Gene, vol. 238, pp. 143-155, 1999.
[26] M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein, “Cluster Analysis and Display of Genome-Wide Expression Patterns,” Proc. Nat'l Academy of Sciences, vol. 95, pp. 14863-14868, 1998.
[27] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E.S. Lander, and T.R. Golub, “Interpreting Patterns of Gene Expression with Self-Organizing Maps: Methods and Application to Hematopoietic Differentiation,” Proc. Nat'l Academy of Sciences, vol. 96, pp. 2907-2912, 1999.
[28] L. Lancashire, O. Schmid, H. Shah, and G. Ball, “Classification of Bacterial Species from Proteomic Data Using Combinatorial Approaches Incorporating Artificial Neural Networks, Cluster Analysis and Principal Components Analysis,” Bioinformatics, vol. 21, pp. 2191-2199, 2005.
[29] E. Oja and A. Hyvaerinen, “A Fast Fixed-Point Algorithm for Independent Component Analysis,” Neural Computation, vol. 9, pp. 1483-1492, 1997.
[30] X.W. Zhang, Y.L. Yap, D. Wei, F. Chen, and A. Danchin, “Molecular Diagnosis of Human Cancer Type by Gene Expression Profiles and Independent Component Analysis,” European. J. Human Genetics, vol. 13, pp. 1303-1311, 2005.
[31] N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines. Cambridge Univ. Press, 2000.
[32] M.N. Nguyen and J.C. Rajapakse, “Prediction of Protein Relative Solvent Accessibility with a Two-Stage SVM Approach,” Proteins: Structure, Function, and Bioinformatics, vol. 59, pp. 30-37, 2005.
[33] M.N. Nguyen and J.C. Rajapakse, “Two-Stage Support Vector Regression Approach for Predicting Accessible Surface Areas of Amino Acids,” Proteins: Structure, Function, and Bioinformatics, vol. 63, pp. 542-550, 2006.
[34] K.B. Duan, J.C. Rajapakse, H. Wang, and F. Azuaje, “Multiple SVM-RFE for Gene Selection in Cancer Classification with Expression Data,” IEEE Trans. Nanobioscience, vol. 4, pp. 228-234, 2005.
[35] J.C. Rajapakse, K.B. Duan, and W.K. Yeo, “Proteomic Cancer Classification with Mass Spectra Data,” Am. J. Pharmacology, vol. 5, pp. 281-292, 2005.
[36] V. Vapnik, The Nature of Statistical Learning Theory. Springer, 1995.
[37] V. Vapnik, Statistical Learning Theory. John Wiley & Sons, 1998.
[38] K. Lin, Y. Kuang, J.S. Joseph, and P.R. Kolatkar, “Conserved Codon Composition of Ribosomal Protein Coding Genes in Escherichia coli, Mycobacterium tuberculosis and Saccharomyces cerevisiae: Lessons from Supervised Machine Learning in Functional Genomics,” Nucleic Acids Research, vol. 30, pp. 2599-2607, 2002.
[39] M. Bhasin and G.P. Raghava, “SVM Based Method for Predicting ${\rm HLA}\hbox{-}{\rm DRB}1^{\ast}0401$ Binding Peptides in an Antigen Sequence,” Bioinformatics, vol. 20, pp. 421-423, 2004.
[40] M. Bhasin and G.P. Raghava, “Prediction of CTL Epitopes Using QM, SVM and ANN Techniques,” Vaccine, vol. 22, pp. 3195-3204, 2004.
[41] P. Donnes and A. Elofsson, “Prediction of MHC Class I Binding Peptides, Using SVMHC,” BMC Bioinformatics, vol. 3, pp. 25-32, 2002.
[42] Y. Zhao, C. Pinilla, D. Valmori, R. Martin, and R. Simon, “Application of Support Vector Machines for T-Cell Epitopes Prediction,” Bioinformatics, vol. 19, pp. 1978-1984, 2003.
[43] J.M. Ma, N.M. Nguyen, W.L. Pang, and J.C. Rajapakse, “Gene Classification Using Codon Usage and SVMs,” Proc. IEEE Symp. Computational Intelligence in Bioinformatics and Computational Biology, pp. 435-442, 2005.
[44] J. Robinson, A. Malik, P. Parham, J.G. Bodmer, and S.G.E. Marsh, “IMGT/HLA Sequence Database—A Sequence Database for the Human Major Histocompatibility Complex,” Tissue Antigens, vol. 55, pp. 280-287, 2000.
[45] J. Robinson, M.J. Waller, P. Parham, J.G. Bodmer, and S.G.E. Marsh, “IMGT/HLA Sequence Database—A Sequence Database for the Human Major Histocompatibility Complex,” Nucleic Acids Research, vol. 29, pp. 210-213, 2001.
[46] J. Robinson, M.J. Waller, P. Parham, N. de Groot, R. Bontrop, L.J. Kennedy, P. Stoehr, and S.G.E. Marsh, “IMGT/HLA and IMGT/MHC: Sequence Databases for the Study of the Major Histocompatibility Complex,” Nucleic Acids Research, vol. 31, pp. 311-314, 2003.
[47] M. Galperin, “The Molecular Biology Database Collection: 2004 Update,” Nucleic Acids Research, vol. 32, pp. D3-D22, 2004.
[48] J.G. Bodmer, S.G.E. Marsh, E.D. Albert, W.F. Bodmer, R.E. Bontrop, D. Charron, B. Dupont, H.A. Erlish, B. Mach, W.R. Mayr, P. Parham, T. Sasazuki, G.M.T. Schreuder, J.L. Strominger, A. Svejgaard, and P.I. Terasaki, “Nomenclature for Factors of the HLA System, 1995,” Tissue Antigens, vol. 46, pp. 1-18, 1995.
[49] A.S. Rosenthal and E. Shevach, “Function of Macrophages in Antigen Recognition by Guinea Pig T Lymphocytes: I. Requirement for Histocompatible Macrophages and Lymphocytes,” J.Experimental Medicine, vol. 138, pp. 1194-1212, 1973.
[50] R.M. Zinkernagel and P.C. Doherty, “Restriction of in Vitro T Cell-Mediated Cytotoxicity in Lymphocytic Choriomeningitis within a Syngeneic or Semiallogeneic System,” Nature, vol. 248, pp. 701-702, 1974. B.Kindred, D.C. Shreffler, “H-2 Dependence of Co-operation between T and B Cells In Vivo,” J. Immunology, vol. 109, pp. 940-943, 1972.
[51] D.H. Katz, T. Hamoaka, and B. Benacerraf, “Cell interactions between Histocompatible T and B Lymphocytes. Failure of Physiologic Cooperation Interactions between T and B Lymphocytes from Allogeneic Donor Strains in Humoral Response to Hapten-Protein Conjugates,” J. Experimental Medicine, vol. 137, pp.1405-1418, 1973.
[52] H.X. Han, F.H. Kong, and Y.Z. Xi, “Progress of Studies on the Function of MHC in Immuno-Recognition,” J. Immunology (Chinese), vol. 16, no. 4, pp. 15-17, 2000.
[53] K.V. Mardia, J.T. Kent, and J.M. Bibby, Multivariate Analysis. Academic Press, 1979.
[54] J.W. Han and M. Kamber, Data Mining: Concepts and Techniques. Academic Press, 2001.
[55] P. Winkel and E. Juhl, “Assumptions in Linear Discriminant Analysis,” Lancet, vol. 2, pp. 435-436, 1971.
[56] D. Aha and D. Kibler, “Instance-Based Learning Algorithms,” Machine Learning, vol. 6, pp. 37-66, 1991.
[57] P.M. Sharp and W.H. Li, “The Codon Adaptation Index—A Measure of Directional Synonymous Codon Usage Bias, and Its Potential Applications,” Nucleic Acids Research, vol. 15, pp. 1281-1295, 1987.
[58] J.M. Comeron and M. Aguade, “An Evaluation of Measures of Synonymous Codon Usage Bias,” J. Molecular Evolution, vol. 47, pp. 268-274, 1998.
[59] K. Crammer and Y. Singer, “On the Learnability and Design of Output Codes for Multiclass Problems,” Machine Learning, vol. 47, pp. 201-233, 2002.
[60] M.N. Nguyen and J.C. Rajapakse, “Two-Stage Multi-Class SVMs for Protein Secondary Structure Prediction,” Proc. Pacific Symp. Biocomputing, 2005.
[61] C.W. Hsu and C.J. Lin, “A Comparison on Methods for Multi-Class Support Vector Machines,” IEEE Trans. Neural Networks, vol. 13, pp. 415-425, 2002.
[62] J. Platt, “Fast Training of Support Vector Machines Using Sequential Minimal Optimization,” Advances in Kernel Methods—Support Vector Learning, B. Scholkopf, C.J.C. Burges, and A.J.Smola, eds., pp. 185-208, MIT Press, 1999.
[63] LNKnet Software Package, http://www.ll.mit.edu/ISTlnknet/, 2008.
[64] J.M. Su, R.H. Fu, J.B. Zhou, and L.H. Zhang, Practical Guide for the Statistical Software of SPSS for Windows, pp. 465-477. Publishing House of Electronics Industry, 2000.
[65] C.E. Bonferroni, “Il Calcolo Delle Assicurazioni su Gruppi di Teste,” Studi in Onore del Professore Salvatore Ortu Carboni, pp. 13-60, 1935.
[66] W.R. Rice, “Analyzing Tables of Statistical Tests,” Evolution, vol. 43, pp. 223-225, 1989.
[67] J.D. Thompson, T.J. Gibson, F. Plewniak, F. Jeanmougin, and D.G. Higgins, “The ClustalX Windows Interface: Flexible Strategies for Multiple Sequence Alignment Aided by Quality Analysis Tools,” Nucleic Acids Research, vol. 24, pp. 4876-4882, 1997.
[68] K. Pearson, “On Lines and Planes of Closest Fit to Systems of Points in Space,” Philosophical Magazine, vol. 2, pp. 559-572, 1901.
20 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool