An Evolutionary Algorithm Approach for Feature Generation from Sequence Data and Its Application to DNA Splice Site Prediction
Issue No. 05 - Sept.-Oct. (2012 vol. 9)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TCBB.2012.53
Uday Kamath , Dept. of Comput. Sci., George Mason Univ., Ashburn, VA, USA
Jack Compton , Barquin Int., Alexandria, VA, USA
Rezarta Islamaj-Dogan , Nat. Center for Biotechnol. Inf. (NCBI), Nat. Inst. of Health (NIH), Bethesda, MD, USA
Kenneth A. De Jong , Dept. of Comput. Sci., George Mason Univ., Fairfax, VA, USA
Amarda Shehu , Dept. of Comput. Sci., George Mason Univ., Fairfax, VA, USA
Associating functional information with biological sequences remains a challenge for machine learning methods. The performance of these methods often depends on deriving predictive features from the sequences sought to be classified. Feature generation is a difficult problem, as the connection between the sequence features and the sought property is not known a priori. It is often the task of domain experts or exhaustive feature enumeration techniques to generate a few features whose predictive power is then tested in the context of classification. This paper proposes an evolutionary algorithm to effectively explore a large feature space and generate predictive features from sequence data. The effectiveness of the algorithm is demonstrated on an important component of the gene-finding problem, DNA splice site prediction. This application is chosen due to the complexity of the features needed to obtain high classification accuracy and precision. Our results test the effectiveness of the obtained features in the context of classification by Support Vector Machines and show significant improvement in accuracy and precision over state-of-the-art approaches.
support vector machines, biological techniques, DNA, evolutionary computation, genetic algorithms, molecular biophysics, genetic programming, evolutionary algorithm approach, feature generation, DNA splice site prediction, biological sequence data, machine learning methods, gene-finding problem, support vector machines, state-of-the-art approach, DNA, Support vector machines, Bioinformatics, Accuracy, Training data, Prediction algorithms, DNA splice sites., Evolutionary computation, genetic programming, feature extraction and construction, classifier design and evaluation, data mining
A. Shehu, K. A. De Jong, J. Compton, R. Islamaj-Dogan and U. Kamath, "An Evolutionary Algorithm Approach for Feature Generation from Sequence Data and Its Application to DNA Splice Site Prediction," in IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 9, no. , pp. 1387-1398, 2012.