Subscribe
Issue No.05 - Sept.-Oct. (2012 vol.9)
pp: 1387-1398
Uday Kamath , Dept. of Comput. Sci., George Mason Univ., Ashburn, VA, USA
Jack Compton , Barquin Int., Alexandria, VA, USA
Rezarta Islamaj-Dogan , Nat. Center for Biotechnol. Inf. (NCBI), Nat. Inst. of Health (NIH), Bethesda, MD, USA
Kenneth A. De Jong , Dept. of Comput. Sci., George Mason Univ., Fairfax, VA, USA
Amarda Shehu , Dept. of Comput. Sci., George Mason Univ., Fairfax, VA, USA
ABSTRACT
Associating functional information with biological sequences remains a challenge for machine learning methods. The performance of these methods often depends on deriving predictive features from the sequences sought to be classified. Feature generation is a difficult problem, as the connection between the sequence features and the sought property is not known a priori. It is often the task of domain experts or exhaustive feature enumeration techniques to generate a few features whose predictive power is then tested in the context of classification. This paper proposes an evolutionary algorithm to effectively explore a large feature space and generate predictive features from sequence data. The effectiveness of the algorithm is demonstrated on an important component of the gene-finding problem, DNA splice site prediction. This application is chosen due to the complexity of the features needed to obtain high classification accuracy and precision. Our results test the effectiveness of the obtained features in the context of classification by Support Vector Machines and show significant improvement in accuracy and precision over state-of-the-art approaches.
INDEX TERMS
support vector machines, biological techniques, DNA, evolutionary computation, genetic algorithms, molecular biophysics, genetic programming, evolutionary algorithm approach, feature generation, DNA splice site prediction, biological sequence data, machine learning methods, gene-finding problem, support vector machines, state-of-the-art approach, DNA, Support vector machines, Bioinformatics, Accuracy, Training data, Prediction algorithms, DNA splice sites., Evolutionary computation, genetic programming, feature extraction and construction, classifier design and evaluation, data mining
CITATION
Uday Kamath, Jack Compton, Rezarta Islamaj-Dogan, Kenneth A. De Jong, Amarda Shehu, "An Evolutionary Algorithm Approach for Feature Generation from Sequence Data and Its Application to DNA Splice Site Prediction", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.9, no. 5, pp. 1387-1398, Sept.-Oct. 2012, doi:10.1109/TCBB.2012.53
REFERENCES
 [1] M.S. Boguski, T.M. Lowe, and C.M. Tolstoshev, "dbest-Database for 'Expressed Sequence Tags'," Nature Genetics, vol. 4, no. 4, pp. 332-333, 1993. [2] F.A. Brill, D.E. Brown, and W.N. Martin, "Fast Genetic Selection of Features for Neural Networks," IEEE Trans. Neural Networks, vol. 3, no. 2, pp. 324-328, Mar. 1992. [3] L.R. Coulter, M.A. Landree, and T.A. Cooper, "Identification of a New Class of Exonic Splicing Enhancers by in Vivo Selection," Molecular Cellular Biology, vol. 17, no. 4, pp. 2143-2150, 1997. [4] N.L. Cramer, "A Representation for the Adaptive Generation of Simple Sequential Programs," Proc. Int'l Conf. Genetics Algorithms and the Applications, pp. 183-187, 1985. [5] R.A. Davis, A.J. Chariton, S. Oehlschlager, and J.C. Wilson, "Novel Feature Selection Method for Genetic Programming Using Metabolomic $^{1}$ H NMR Data," Chemometrics and Intelligent Laboratory Systems, vol. 81, no. 1, pp. 50-59, 2005. [6] K.A. De Jong, Evolutionary Computation: A Unified Approach. MIT Press, 2001. [7] C.D. Dosin and R.K. Belew, "New Methods of Competitive Coevolution," Evolutionary Computation, vol. 5, no. 1, pp. 1-29, 1997. [8] J.A. Driscoll, B. Worzel, and D. MacLean, "Classification of Gene Expression Data with Genetic Programming," Genetic Programming: Theory and Practice, R.L. Riolo and B. Worzel, eds., Kluwer, pp. 25-42, 2003. [9] L. Falquet, M. Pagni, P. Bucher, N. Hulo, C.J.A. Sigrist, K. Hofmann, and A. Bairoch, "The PROSITE Database, Its Status in 2002," Nucleic Acids Research, vol. 30, no. 1, pp. 235-238, 2002. [10] R. Guigo, P. Filcek, J. Abril, A. Reymond, J. Lagarde, F. Denoeud, S. Antonarakis, M. Ashburner, V. Bajic, E. Birney, R. Castelo, E. Eyras, C. Ucla, T. Gingeras, J. Harrow, T. Hubbard, S. Lewis, and M. Reese, "Egasp: The Human ENCODE Genome Annotation Assessment Project," Genome Biology, vol. 7, no. S2, pp. 1-31, 2006. [11] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, "Gene Selection for Cancer Classification Using Support Vector Machines," Machine Learning, vol. 46, pp. 389-422, 2002. [12] T. Habib, C. Zhang, J.Y. Yang, M.Q. Yang, and Y. Deng, "Supervised Learning Method for the Prediction of Subcellular Localization of Proteins Using Amino Acid and Amino Acid Pair Composition," BMC Genomics, vol. 9, no. Suppl 1, pp. S1-16, 2008. [13] J.-H. Hong and S.-B. Cho, "Lymphoma Cancer Classification Using Genetic Programming," Proc. Seventh European Conf. Genetic Programming (EuroGP), pp. 78-88, 2004. [14] J. Huang, Y. Cai, and X. Xu, "A Hubrid Genetic Algorithm for Feature Selection Wrapper Based on Mutual Information," J. Pattern Recognition Letters, vol. 28, pp. 1825-1844, 2007. [15] R. Islamaj-Dogan, L. Getoor, and W.J. Wilbur, "A Feature Generation Algorithm with Applications to Biological Sequence Classification," Computational Methods of Feature Selection, H. Liu and H. Motoda, eds., pp. 355-376, Chapman and Hall, 2007. [16] R. Islamaj-Dogan, L. Getoor, W.J. Wilbur, and S.M. Mount, "Features Generated for Computational Splice-site Prediction Correspond to Functional Elements," BMC Bioinformatics, vol. 8, pp. 410-416, 2007. [17] U. Kamath, A. Shehu, and K.A. De Jong, "Feature and Kernel Evolution for Recognition of Hypersensitive Sites in DNA Sequences," Proc. Int'l Conf. Bio-Inspired Models of Network, Information, and Computing Systems (BIONETICS), LNICST, vol. 87, pp. 213-238, Springer, 2010. [18] U. Kamath, K.A. De Jong, and A. Shehu, "Selecting Predictive Features for Recognition of Hypersensitive Sites of Regulatory Genomic Sequences with an Evolutionary Algorithm," Proc. 12th Ann. Conf. Genetic and Evolutionary Computation, pp. 179-186, 2010. [19] U. Kamath, A. Shehu, and K.A. De Jong, "Using Evolutionary Computation to Improve SVM Classification," Proc. IEEE World Conf. Evolutionary Computation, 2010. [20] A. Kernytsky and B. Rost, "Using Genetic Algorithms to Select Most Predictive Protein Features," Proteins: Structure Function Bioinfomatics, vol. 75, no. 1, pp. 75-88, 2009. [21] W. Kim and W.J. Wilbur, "DNA Splice Site Detection: A Comparison of Specific and General Methods," Proc. Assoc. Moving Image Archivists Symp, pp. 390-394, 2002. [22] R. Kohavi and G.H. John, "Wrappers for Feature Subset Selection," Artificial Intelligence, vol. 97, nos. 1/2, pp. 273-324, 1997. [23] G. Kol, G. Lev-Maor, and G. Ast, "Human-Mouse Comparative Analysis Reveals that Branch-Site Plasticity Contributes to Splicing Regulation," Human Molecular Genetics, vol. 14, no. 11, pp. 1559-1568, 2005. [24] D. Koller and M. Sahami, "Toward Optimal Feature Selection," Proc. Int'l Conf. Machine Learning, pp. 284-292, 1996. [25] J. Koza, On the Programming of Computers by Means of Natural Selection. MIT Press, 1992. [26] J. Královicová and I. Vorechovsky, "Position-Dependent Repression and Promotion of DQB1 Intron 3 Splicing by GGGG Motifs," J. Immunology, vol. 176, no. 4, pp. 2381-2388, 2006. [27] L.I. Kuncheva and L.C. Jain, "Nearest Neighbor Classifier: Simultaneous Editing and Feature Selection," Pattern Recognition Letters, vol. 20, nos. 11-13, pp. 1149-1156, 1999. [28] W. Langdon and B. Buxton, "Genetic Programming for Mining DNA Chip Data from Cancer Patients," Genetic Programming and Evolvable Machines, vol. 5, no. 3, pp. 251-257, 2004. [29] R. Leardi, R. Boggia, and M. Terrile, "Genetic Algorithms as a Strategy for Feature Selection," J. Chemometrics, vol. 6, no. 5, pp. 267-281, 2005. [30] N.W. Leslie CS, E. Eskin, "The Spectrum Kernel: A String Kernel for SVM Protein Classification," Proc. Pacific Symp. Biocomputing, vol. 7, pp. 564-575, 2002. [31] T.M. Mitchell, Machine Learning, first ed. Mc-Graw Hill Companies, Inc., 1997. [32] J.H. Moore, J.S. Parker, N.J. Olsen, and T.M. Aune, "Symbolic Discriminant Analysis of Microarray Data in Autoimmune Disease," Genetic Epidemiology, vol. 23, no. 1, pp. 57-69, 2002. [33] D.P. Muni, N.R. Pal, and J. Das, "Genetic Programming for Simultaneous Feature Selection and Classifier Design," Ann. Rev. Genomics and Human Genetics, vol. 36, no. 1, pp. 106-117, 2006. [34] W.S. Noble, S. Kuehn, R. Thurman, M. Yu, and J.A. Stamatoyannopoulos, "Predicting the in Vivo Signature of Human Gene Regulatory Sequences," Bioinformatics, vol. 21, no. Suppl 1, pp. 338-343, 2005. [35] I.-S. Oh, J.-S. Lee, and B.-R. Moon, "Hybrid Genetic Algorithms for Feature Selection," IEEE Trans. Pattern Analysis and Machine Learning, vol. 26, no. 11, pp. 1424-1437, Nov. 2004. [36] M. Pertea, X. Lin, and S.L. Salzberg, "Genesplicer: A New Computational Method for Splice Site Prediction," Nucleic Acids Research, vol. 29, no. 5, pp. 1185-1190, 2001. [37] R. Ramirez and M. Puiggros, "A Genetic Programming Approach to Feature Selection and Classification of Instantaneous Cognitive States," Proc. European Workshop Evolutionary Computation in Image Analysis and Signal Processing (EvoWorkshop), LNCS, vol. 4448, pp. 311-319, Springer, 2007. [38] M.L. Raymer, W.F. Punch, E.D. Goodman, L.A. Kuhn, and A.K. Jain, "Dimensionality Reduction Using Genetic Algorithms," IEEE Trans. Evolutionary Computing, vol. 4, no. 2, pp. 164-171, July 2000. [39] M.L. Raymer, W.F. Punch, E.D. Goodman, L.A. Kuhn, and A.K. Jain, Accurate Splice Site Detection for Caenorhabditis Elegans, pp. 277-298. MIT Press, 2004. [40] R. Riviere, D. Barth, J. Cohen, and A. Denise, "Shuffling Biological Sequences with Motif Constraints," J. Discrete Algorithms, vol. 6, no. 2, pp. 192-204, 2007. [41] L. Salwinski and D. Eisenberg, "Motif-Based Fold Assignment," Protein Science, vol. 10, no. 12, pp. 2460-2469, 2008. [42] J. Schmidhuber, "Evolutionary Principles in Self-Referential Learning," PhD thesis, Technical Univ. Munich, 1987. [43] W. Siedlecki and J. Sklansky, "A Note on Genetic Algorithms for Large-Scale Feature Selection," Pattern Recognition Letters, vol. 10, no. 5, pp. 335-347, 1989. [44] S.F. Smith, A Learning System Based on Genetic Adaptive Algorithms," PhD thesis, Univ. of Pittsburgh, 1980. [45] S. Sonnenburg, G. Rätsch, A. Jagota, and K. Müller, "New Methods for Splice-Site Recognition," Proc Int'l Conf. Artificial Neural Networks, pp. 329-336, 2002. [46] S. Sonnenburg, G. Schweikert, P. Philips, J. Behr, and G. Rätsch, "Accurate Splice Site Prediction Using Support Vector Machines," BMC Bioinformatics, vol. 8, no. S10,article S7, 2007. [47] R. Staden, "Methods to Locate Signals in Nucleic Acid Sequences," Nucleic Acids Research, vol. 12, no. 1, pp. 505-519, 1984. [48] V.N. Vapnik, Statistical Learning Theory. Wiley & Sons, 1998. [49] V. Venkatraman, A.R. Dalby, and Z.R. Yang, "Evaluation of Mutual Information and Genetic Programming for Feature Selection in QSAR," J. Chemical Information and Computer Sciences, vol. 44, no. 5, pp. 1686-1692, 2004. [50] G. Yamamura and O. Gotoh, "Detection of the Splicing Sites with Kernel Method Approaches Dealing with Nucleotide Doublets," Genome Informatics, vol. 14, pp. 426-427, 2003. [51] G. Yeo, "Maximum Entropy Modeling of Short Sequence Motifs with Applications to RNA Splicing Signals," J. Computational Biology, vol. 11, no. 2, pp. 377-394, 2004. [52] J. Yu, J. Yu, A.A. Almal, S.M. Dhanasekaran, G.D., W.P. Worzel, and A.M. Chinnaaiyan, "Feature Selection and Molecular Classification of Cancer Using Genetic Programming," Neoplasia, vol. 9, no. 4, pp. 292-303, 2007. [53] T. Zhang and F.J. Oles, "Text Categorization Based on Regularized Linear Classification Methods," Information Retrieval, vol. 4, no. 1, pp. 5-31, 2000. [54] X.H. Zhang, K.A. Heller, I. Hefter, C.S. Leslie, and L.A. Chasin, "Sequence Information for the Splicing of Human Pre-mRNA Identified by Support Vector Machine Classification," Genome Research, vol. 13, no. 12, pp. 2637-2650, 2003.