This Article 
 Bibliographic References 
 Add to: 
Markov Encoding for Detecting Signals in Genomic Sequences
April-June 2005 (vol. 2 no. 2)
pp. 131-142

Abstract—We present a technique to encode the inputs to neural networks for the detection of signals in genomic sequences. The encoding is based on lower-order Markov models which incorporate known biological characteristics in genomic sequences. The neural networks then learn intrinsic higher-order dependencies of nucleotides at the signal sites. We demonstrate the efficacy of the Markov encoding method in the detection of three genomic signals, namely, splice sites, transcription start sites, and translation initiation sites.

[1] V.B. Bajic, S.H. Seah, A. Chong, S.P.T. Krishnan, J.L.Y. Koh, and V. Brusic, “Computer Model for Recognition of Functional Transcription Start Sites in RNA Polymerase II Promoters of Vertebrates,” J. Molecular Graphics and Modeling, vol. 21, pp. 323-332, 2003.
[2] P. Baldi and S. Brunak, Bioinformatics: The Machine Learning Approach, first ed. MIT press, 1998.
[3] V. Brendel and J. Kleffe, “Prediction of Locally Optimal Splice Sites in Plant Pre-mRNA with Application to Gene Identification in Arabidopsis Thaliana Genomic DNA,” Nucleic Acids Research, vol. 26, pp. 4748-4757, 1998.
[4] S. Brunak, J. Engelbrecht, and S. Knudsen, “Prediction of Human Mrna Donor and Acceptor Sites from the DNA Sequence,” J. Molecular Biology, vol. 220, pp. 49-65, 1991.
[5] M. Burset and R. Guigo, “Evaluation of Gene Structure Prediction Programs,” Genomic, vol. 34, pp. 353-367, 1996.
[6] C. Burge and S. Karlin, “Prediction of Complete Gene Structures in Human Genomic DNA,” J. Molecular Biology, vol. 268, pp. 78-94, 1997.
[7] D. Corne, A. Meade, and R. Sibly, “Evolving Core Promoter Signal Motifs,” Proc. 2001 Congress on Evolutionary Computation, pp. 1162-1169, 2001.
[8] M.O. Dayhoff, R.M. Schwartz, and B.C. Orcutt, “A Model of Evolutionary Change in Proteins,” Atlas of Protein Sequence and Structure, vol. 15, no. 3, pp. 345-358, 1978.
[9] J.W. Fickett and A.G. Hatzigeorgious, “Eukaryotic Promoter Recognition,” Genome Research, pp. 861-878, 1997.
[10] R. Guigo, P. Agarwal, J.F. Abril, M. Burset, and J.W. Fickett, “An Assessment of Gene Prediction Accuracy in Large DNA Sequences,” Genome Research, vol. 10, pp. 1631-1642, 2000.
[11] A. Hatziggeorgious, N. Mache, and M. Reczko, “Functional Site Prediction on the DNA Sequence by Artificial Neural Networks,” Proc. IEEE Int'l Joint Symp. Intelligence and Systems, pp. 12-17, 1996.
[12] A.G. Hatzigeorgiou, “Translation Inititation Start Prediction in Human cDNA with High Accuracy,” Bioinformatics, vol. 18, pp. 343-350, 2002.
[13] S. Haykin, Neural Networks: A Compreshensive Foundation, second ed. Prentice-Hall Press, 1999.
[14] S.M. Hebsgaard, P.G. Korning, N. Tolstrup, J. Engelbrecht, P. Rouze, and S. Brunak, “Splice Site Prediction in Arabidopsis Thaliana Pre-mRNA by Combining Local and Global Sequence Information,” Nucleic Acids Research, vol. 24, pp. 3439-3452, 1996.
[15] K. Hornik, M. Stinchcombe, and H. White, “Multilayer Feedforward Networks Are Universal Approximatiors,” Neural Networks, vol. 2, pp. 359-366, 1989.
[16] J. Kleffe, K. Hermann, W. Vahrson, B. Wittig, and V. Brendel, “Logitlinear Models for the Prediction of Splice Sites in Plant Pre-mRNA Sequences,” Nucleic Acids Research, vol. 24, pp. 4709-4718, 1996.
[17] M. Kozak, “An Analysis of 5'-Noncoding Sequences from 699 Vertebrate Messenger RNAs,” Nucleic Acids Research, vol. 15, pp. 8125-8148, 1987.
[18] H. Liu and L. Wong, “Data Mining Tools for Biological Sequences,” J. Bioinformatics and Computational Biology, vol. 1, pp. 139-160, 2003.
[19] J.P. Martens and N. Weymaere, “An Equalized Error Back Propagration Algorithm for the On-Line Training of Multilayer Perceptrons,” IEEE Trans. Neural Networks, vol. 13, pp. 532-541, 2002.
[20] D. Nguyen and B. Widrow, “Improving the Learning Speed of 2-Layer Neural Networks by Choosing Initial Values of the Adaptive Weights,” Proc. Int'l Joint Conf. Neural Networks, vol. 3, pp. 21-26, 1990.
[21] U. Ohler, S. Harback, H. Niemann, E. Noth, and G.M. Rubin, “Joint Modeling of DNA Sequence and Physical Properties to Improve Eukaryotic Promoter Recognition,” Bioinformatics, vol. 17, pp. 199-206, 2001.
[22] U. Ohler, H. Niemann, G. Liao, and M.G. Reese, “Interpolated Markov Chains for Eukaryotic Promoter Recognition,” Bioinformatics, vol. 15, pp. 362-369, 1999.
[23] D.J. Patterson, K. Yasuhara, and W.L. Ruzzo, “Pre-mRNA Secondary Structure Prediction Aids Splice Site Prediction,” Proc. Pacific Symp. Biocomputing, pp. 223-234, 2002.
[24] A.G. Pedersen and H. Nielsen, “Neural Network Prediction of Translation Initiation Sites in Eukaryotes: Perspective for EST and Genome Analysis,” Intelligent Systems for Molecular Biology, vol. 5, pp. 226-233, 1997.
[25] A.G. Pedersen, P. Baldi, Y. Chauvin, and S. Brunak, “The Biology of Eukaryotic Promoter Prediction— A Review,” Computer Chem, vol. 23, pp. 191-207, 1999.
[26] M. Pertea, L. XiaoYing, and S.L. Salzberg, “GeneSplicer: A New Computational Method for Splice Site Detection,” Nucleic Acids Research, vol. 29, pp. 1185-1190, 2001.
[27] V.P. Plagianakos, G.D. Magoulas, and M.N. Vrahatis, “Learning Rate Adaptation in Stochastic Gradient Descent,” Advances in Convex Analysis and Global Optimization, chapter 2, pp. 15-26, 2000.
[28] A. Pinkus, “Approximation Theory of the MLP Model in Neural Networks,” Acta Numerica, pp. 143-195, 1999.
[29] M.G. Reese, F.H. Eeckman, D. Kulp, and D. Haussler, “Improved Splice Site Detection in Genie,” J. Computational Biology, vol. 4, pp. 311-324, 1997.
[30] M.G. Reese, “Application of a Time-Delay Neural Network to Promoter Annotation in the Drosophila Melanogaster Genome,” Computer Chem, vol. 26, pp. 51-56, 2001.
[31] A.A. Salamov, T. Nishikawa, and M.B. Swindells, “Assessing Protein Coding Region Integrity in cDNA Sequencing Projects,” Bioinformatics, vol. 14, pp. 384-390, 1998.
[32] S.L. Salzberg, A.L. Delcher, K. Fasman, and J. Henderson, “A Decision Tree System for Finding Genes in DNA,” J. Computational Biology, vol. 5, pp. 667-680, 1998.
[33] M. Scherf, A. Klingenhoff, and T. Werner, “Highly Specific Localization of Promoter Regions in Large Genomic Sequences by PromoterInspector: A Novel Analysis Approach,” J. Molecular Biology, vol. 297, pp. 599-606, 2000.
[34] S. Sonnenburg, “New Methods for Splice Site Recognition,” master's thesis, Humbold Univ., Germany, 2002.
[35] T.A. Thanaraj, “Positional Characterisation of False Positives from Computational Prediction of Human Splice Sites,” Nucleic Acids Research, vol. 28, pp. 744-754, 2000.
[36] J.T.L. Wang, Q. Ma, D. Shasha, and C.H. Wu, “New Techniques for Extracting Features from Protein Sequences,” IBM Systems J., vol. 40, no. 2, pp. 426-441, 2001.
[37] L. Wong, F. Zeng, and R. Yap, “Using Feature Generation and Feature Selection for Accurate Prediction of Translation Initiation Sites,” Proc. Int'l Conf. Genome Informatics, pp. 192-200, 2002.
[38] M.M. Yin and J.T.L. Wang, “Effective Hidden Markov Models for Detecting Splice Junction Sites in DNA Sequences,” Information Sciences, vol. 139, pp. 139-163, 2001.
[39] M.Q. Zhang, “Identification of Protein Coding Regions in Human Genome by Quadratic Discriminal Analysis,” Proc. Nat'l Academy of Sciences, pp. 565-568, 1997.
[40] M.Q. Zhang, “Computational Methods for Promoter Prediction,” Current Topics in Computational Molecular Biology, chapter 10, pp. 249-267, 2002.
[41] A. Zien, G. Raetsch, S. Mika, B. Schoelkopf, C. Lemmen, A. Smola, T. Lengauer, and K.R. Mueller, “Engineering Support Vector Machine Kernels that Recognize Translation Initiation Sites,” Bioinformatics, vol. 16, pp. 799-807, 2000.
[42] GeneSplicer, 2005, .
[43] GENIO: GENIOsplice/, 2005.
[44] HSPL: http:/, 2005.
[45] NNSPlice: , 2005.
[46] NNSplice Data Set: , 2005.
[47] PromoterData: Humanpromoter/, 2005.
[48] SpliceView: view.html , 2005.

Index Terms:
Genomic sequences, gene structure prediction, Markov chain, neural networks, splice sites, transcription start site, translation initiation site.
Jagath C. Rajapakse, Loi Sy Ho, "Markov Encoding for Detecting Signals in Genomic Sequences," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 2, no. 2, pp. 131-142, April-June 2005, doi:10.1109/TCBB.2005.27
Usage of this product signifies your acceptance of the Terms of Use.