The Community for Technology Leaders
RSS Icon
Issue No.06 - Nov.-Dec. (2012 vol.9)
pp: 1676-1689
W. Ashlock , Dept. of Comput. Sci. & Eng., York Univ., Toronto, ON, Canada
S. Datta , Dept. of Comput. Sci. & Eng., York Univ., Toronto, ON, Canada
Side effect machines produce features for classifiers that distinguish different types of DNA sequences. They have the, as yet unexploited, potential to give insight into biological features of the sequences. We introduce several innovations to the production and use of side effect machine sequence features. We compare the results of using consensus sequences and genomic sequences for training classifiers and find that more accurate results can be obtained using genomic sequences. Surprisingly, we were even able to build a classifier that distinguished consensus sequences from genomic sequences with high accuracy, suggesting that consensus sequences are not always representative of their genomic counterparts. We apply our techniques to the problem of distinguishing two types of transposable elements, solo LTRs and SINEs. Identifying these sequences is important because they affect gene expression, genome structure, and genetic diversity, and they serve as genetic markers. They are of similar length, neither codes for protein, and both have many nearly identical copies throughout the genome. Being able to efficiently and automatically distinguish them will aid efforts to improve annotations of genomes. Our approach reveals structural characteristics of the sequences of potential interest to biologists.
Bioinformatics, Genomics, Genetic algorithms, DNA, Feature extraction, Machine learning,side effect machines, Endogenous retroviruses, LTR retrotransposons, SINE elements, feature evaluation and selection, machine learning, evolutionary computing and genetic algorithms
W. Ashlock, S. Datta, "Distinguishing Endogenous Retroviral LTRs from SINE Elements Using Features Extracted from Evolved Side Effect Machines", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.9, no. 6, pp. 1676-1689, Nov.-Dec. 2012, doi:10.1109/TCBB.2012.116
[1] R.A. Weiss, “The Discovery of Endogenous Retroviruses,” Retrovirology, vol. 3, p. 67, 2006.
[2] B. Pierce, Genetics: A Conceptual Approach. W.H. Freeman and Company, 2005.
[3] P. Jern and J. Coffin, “Effects of Retroviruses on Host Genome Function,” Ann. Rev. Genetics, vol. 42, pp. 709-732, 2008.
[4] P. Perot et al., “Microarray-Based Sketches of the HERV Transcriptome Landscape,” PLoS ONE, vol. 7, no. 6, p. e40194, 2012.
[5] H. Nishihara, A. Smit, and N. Okada, “Functional Noncoding Sequences Derived from SINEs in the Mammalian Genome,” Genome Research, vol. 16, no. 7, pp. 864-874, 2006.
[6] J. Hughes and J. Coffin, “Human Endogenous Retrovirus K solo-LTR Formation and Insertional Polymorphisms: Implications for Human and Viral Evolution,” Proc. Nat'l Academy of Sciences USA, vol. 101, no. 6, pp. 1668-1672, 2004.
[7] C. Vitte and O. Panaud, “Formation of Solo-LTRs through Unequal Homologous Recombination Counterbalances Amplifications of LTR Retrotransposons in Rice oryza sativa l,” Molecular Biology and Evolution, vol. 20, no. 4, pp. 528-540, 2003.
[8] F. Sun et al., “Common Evolutionary Trends for SINE RNA Structures,” TRENDS in Genetics, vol. 23, no. 1, pp. 26-33, 2006.
[9] D. Dramerov and N. Vassetzky, “SINEs,” Wiley Interdisciplinary Reviews: RNA. John Wiley & Sons, 2011.
[10] J. Jurka et al., “Repbase Update, a Database of Eukaryotic Repetitive Elements,” Cytogentic and Genome Research, vol. 110, pp. 462-467, 2005.
[11] A. Smit, R. Hubley, and P. Green, “RepeatMasker Open-3.0,” http:/, 1996-2010.
[12] E.M. McCarthy and J.F. McDonald, “LTR_STRUC: A Novel Search and Identification Program for LTR Retrotransposons,” Bioinformatics, vol. 19, no. 3, pp. 362-367, 2003.
[13] Z. Xu and H. Wang, “LTR_FINDER: An Efficient Tool for the Prediction of Full-Length LTR Retrotransposons,” Nucleic Acids Research, vol. 35, pp. W265-W268, 2007.
[14] G. Sperber et al., “Automated Recognition of Retroviral Sequences in Genomic Data - RetroTector,” Nucleic Acids Research, vol. 35, pp. 4964-4976, 2007.
[15] C. Bergman and H. Quesneville, “Discovering and Detecting Transposable Elements in Genome Sequences,” Briefings in Bioinformatics, vol. 8, no. 6, pp. 382-392, 2007.
[16] E. Lerat, “Identifying Repeats and Transposable Elements in Sequenced Genomes: How to Find Your Way through the Dense Forest of Programs,” Heredity, vol. 104, pp. 520-533, 2010.
[17] H. Quesneville et al., “Combined Evidence Annotation of Transposable Elements in Genome Sequences,” PLoS Computational Biology, vol. 1, no. 2, p. e22, 2005.
[18] J. Stoye, “Endogenous Retroviruses: Still Active After All These Years?” Current Biology, vol. 11, pp. R914-R916, 2001.
[19] J. Ma, K. Devos, and J. Bennetzen, “Analyses of LTR-retrotransposon Structures Reveal Recent and Rapid Genomic DNA Loss in Rice,” Genome Research, vol. 14, pp. 860-869, 2004.
[20] A. Buzdin et al., “At Least 50 Percent of Human-Specific HERV-K (HML-2) Long Terminal Repeats Serve in Vivo as Active Promoters for Host Nonrepetitive DNA Transcription,” J. Virology, vol. 80, no. 21, pp. 10752-10762, 2006.
[21] Y. Chew et al., “Biotinylation of Histones Represses Transposable Elements in Human and Mouse Cells and Cell Lines and in Drosophila Melanogaster,” J. Nutrition, vol. 138, no. 12, pp. 2316-2322, 2008.
[22] J. Ling et al., “The Solitary Long Terminal Repeats of ERV-9 Endogenous Retrovirus are Conserved during Primate Evolution and Possess Enhancer Activities in Embryonic and Hematopoietic Cells,” J. Virology, vol. 76, no. 5, pp. 2410-2423, 2002.
[23] M. Romanish, C. Cohen, and D. Mager, “Potential Mechanisms of Endogenous Retroviral-Mediated Genomic Instability in Human Cancer,” Seminars in Cancer Biology, vol. 20, pp. 246-253, 2010.
[24] E. Sverdlov, “Perpetually Mobile Footprints of Ancient Infections in Human Genome,” FEBS Letters, vol. 428, pp. 1-6, 1998.
[25] W. Pi et al., “Long-Range Function of an Intergenic Retrotransposon,” Proc. Nat'l Academy of Sciences USA, vol. 107, pp. 12 992-12 997, 2010.
[26] G. Abrusan et al., “TEclass - A Tool for Automated Classification of Unknown Eukaryotic Transposable Elements,” Bioinformatics, vol. 25, no. 10, pp. 1329-1330, 2009.
[27] C. Feschotte et al., “Exploring Repetitive DNA Landscapes Using REPCLASS, A Tool that Automates the Classification of Transposable Elements in Eukaryotic Genomes,” Genome Biology and Evolution, vol. 1, pp. 205-220, 2009.
[28] F. Benachenhou et al., “Evolutionary Conservation of Orthoretroviral Long Terminal Repeats (LTRs) and ab initio Detection of Single LTRs in Genomic Data,” PLoS ONE, vol. 4, no. 4, p. e5179, 2009.
[29] M. Mohri, “Weighted Finite-State Transducer Algorithms: An Overview,” Formal Languages and Applications, Springer, 2004.
[30] D. Ashlock and E. Warner, “Classifying Synthetic and Biological DNA Sequences with Side Effect Machines,” Proc. IEEE Symp. Computational Intelligence in Bioinformatics and Computational Biology, pp. 22-29, 2008.
[31] D. Ashlock and A. McEachern, “Ring Optimization of Side Effect Machines,” Intelligent Eng. Systems through Artificial Neural Networks, vol. 19, pp. 165-172, 2009.
[32] W. Ashlock and S. Datta, “Detecting Retroviruses Using Reading Frame Information and Side Effect Machines,” Proc. IEEE Symp. Computational Intelligence in Bioinformatics and Computational Biology, pp. 1-8, 2010.
[33] J. Schonfeld and D. Ashlock, “Classifying Cytochrome c Oxidase Subunit 1 by Translation Initiation Mechanism Using Side Effect Machines,” Proc. IEEE Symp. Computational Intelligence in Bioinformatics and Computational Biology, pp. 1-7, 2010.
[34] J. Brown, S. Houghten, and D. Ashlock, “Side Effect Machines for Quaternary Edit Metric Decoding,” Proc. IEEE Symp. Computational Intelligence in Bioinformatics and Computational Biology, pp. 1-8, 2010.
[35] M. Sipser, Introduction to the Theory of Computation, second ed. Thomson, 2006.
[36] Y. Saeys, I. Inza, and P. Larranaga, “A Review of Feature Selection Techniques in Bioinformatics,” Bioinformatics, vol. 23, no. 19, pp. 2507-2517, 2007.
[37] W. Ashlock and S. Datta, “Evolved Features for DNA Sequence Classification and Their Fitness Landscapes,” IEEE Trans. Evolutionary Computation, vol. PP, no. 99, 2012.
[38] T. Bäck, D. Fogel, and Z. Michalewicz, Handbook of Evolutionary Computation. Oxford Univ. Press, 1997.
[39] D. Ashlock, Evolutionary Computation for Modeling and Optimization. Springer, 2006.
[40] L. Breiman, “Random Forests,” Machine Learning, vol. 45, pp. 5-32, 2001.
[41] T. Hastie, R. Tibshirani, and J. Friedman, Elements of Statistical Learning, second ed. Springer, 2009.
[42] S. Bochkanov and V. Bystritsky, alglib,, 1999-2012.
[43] M. Kargar and A. An, “Evaluation of Different Complexity Measures for Signal Detection in Genome Sequences,” Proc. First ACM Int'l Conf. Bioinformatics and Computational Biology, pp. 422-425, 2010.
[44] C. Allauzen, C. Cortes, and M. Mohri, “Large-Scale Training of SVMs with Automata Kernels,” Proc. Int'l Conf. Implementation and Application of Automata. pp. 17-27, 2011.
[45] C. Leslie, E. Eskin, and W. Noble, “The Spectrum kernel: A String Kernel for SVM Protein Classification,” Proc. Pacific Symp. Biocomputing, vol. 7, pp. 566-575, 2002.
[46] T. Cover and J. Thomas, Elements of Information Theory. John Wiley and Sons, Inc., 2006.
[47] S. Mahfoud, “Niching Methods for Genetic Algorithms,” technical report, 1995.
[48] A. Liaw and M. Wiener, “Classification and Regression by Randomforest,” R News, vol. 2, no. 3, pp. 18-22,, 2002.
[49] V. Svetnik et al., “Application of Breiman's Random Forest to Modeling Structure-Activity Relationships of Pharmaceutical Molecules,” Proc. Fifth Int'l Workshop Multiple Classifier Systems, pp. 334-343, 2004,
[50] R Development Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, http:/, 2011.
[51] A. Armitage et al., “Conserved Footprints of APOBEC3G on Hypermutated Human Immunodeficiency Virus Type 1 and Human Endogenous Retrovirus HERV-K(HML2) Sequences,” J. Virology, vol. 82, no. 17, pp. 8743-8761, 2008.
54 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool