This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Designing Filters for Fast-Known NcRNA Identification
May-June 2012 (vol. 9 no. 3)
pp. 774-787
J. Buhler, Dept. of Comput. Sci., Washington Univ., St. Louis, MO, USA
Yanni Sun, Dept. of Comput. Sci. & Eng., Michigan State Univ., East Lansing, MI, USA
Cheng Yuan, Dept. of Comput. Sci. & Eng., Michigan State Univ., East Lansing, MI, USA
Detecting members of known noncoding RNA (ncRNA) families in genomic DNA is an important part of sequence annotation. However, the most widely used tool for modeling ncRNA families, the covariance model (CM), incurs a high-computational cost when used for genome-wide search. This cost can be reduced by using a filter to exclude sequences that are unlikely to contain the ncRNA of interest, applying the CM only where it is likely to match strongly. Despite recent advances, designing an efficient filter that can detect ncRNA instances lacking strong conservation while excluding most irrelevant sequences remains challenging. In this work, we design three types of filters based on multiple secondary structure profiles (SSPs). An SSP augments a regular profile (i.e., a position weight matrix) with secondary structure information but can still be efficiently scanned against long sequences. Multi-SSP-based filters combine evidence from multiple SSP matches and can achieve high sensitivity and specificity. Our SSP-based filters are extensively tested in BRAliBase III data set, Rfam 9.0, and a published soil metagenomic data set. In addition, we compare the SSP-based filters with several other ncRNA search tools including Infernal (with profile HMMs as filters), ERPIN, and tRNAscan-SE. Our experiments demonstrate that carefully designed SSP filters can achieve significant speedup over unfiltered CM search while maintaining high sensitivity for various ncRNA families. The designed filters and filter-scanning programs are available at our website: www.cse.msu.edu/~yannisun/ssp/.

[1] S.R. Eddy, “Non-Coding RNA Genes and the Modern RNA World,” Nature Rev. Genetics, vol. 2, pp. 919-929, 2001.
[2] S.R. Eddy, “A Memory-Efficient Dynamic Programming Algorithm for Optimal Alignment of a Sequence to an RNA Secondary Structure,” BMC Bioinformatics, vol. 3, pp. 3-18, 2002.
[3] S.R. Eddy and R. Durbin, “RNA Sequence Analysis Using Covariance Models,” Nucleic Acids Research, vol. 22, pp. 2079-2088, 1994.
[4] R. Durbin, S.R. Eddy, A. Krogh, and G. Mitchison, Biological Sequence Analysis Probabilistic Models of Proteins and Nucleic Acids. Cambridge Univ. Press, 1998.
[5] D.H. Younger, “Recognition and Parsing of Context-Free Languages in Time $n^3$ ,” Information and Control, vol. 10, pp. 189-208, 1967.
[6] S. Griffiths-Jones, S. Moxon, M. Marshall, A. Khanna, S.R. Eddy, and A. Bateman, “Rfam: Annotating Non-coding RNAs in Complete Genomes,” Nucleic Acids Research, vol. 33, pp. D121- D124, 2005.
[7] Z. Weinberg Z and W.L. Ruzzo, “Faster Genome Annotation of Non-Coding RNA Families without Loss of Accuracy,” Proc. Eighth Ann. Int'l Conf. Research Computational Moleculer Biology (RECOMB '04), pp. 243-51, 2004.
[8] B. Brejova, D.G. Brown, and T. Vinar, “Optimal Spaced Seeds for Hidden Markov Models, with Application to Homologous Coding Regions,” Proc. 14th Ann. Symp. Combinatorial Pattern Matching (CPM '03), pp. 42-54, 2003.
[9] J. Buhler, U. Keich, and Y. Sun, “Designing Seeds for Similarity Search in Genomic DNA,” Proc. Seventh Ann. Int'l Conf. Research Computational Moleculer Biology (RECOMB '03), pp. 67-75, 2003.
[10] M. Li, B. Ma, D. Kisman, and J. Tromp, “PatternHunter II: Highly Sensitive and Fast Homology Search,” J. Bioinformatics and Computational Biology, vol. 2, pp. 417-39, 2004.
[11] L. Noe and G. Kucherov, “Improved Hit Criteria for DNA Local Alignment,” BMC Bioinformatics, vol. 5, pp. 149-158, 2004.
[12] Y. Sun and J. Buhler, “Designing Multiple Simultaneous Seeds for DNA Similarity Search,” Proc. Eighth Ann. Int'l Conf. Research Computational Moleculer Biology (RECOMB '04), pp. 76-84, 2004.
[13] Y. Sun and J. Buhler, “Designing Patterns and Profiles for Profile HMM Search,” IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 6, no. 2, pp. 232-243, Apr.-June, 2008.
[14] T. Lowe and S.R. Eddy, “TRNAscan-SE: A Program For Improved Detection of Transfer RNA Genes in Genomic Sequence,” Nucleic Acids Research, vol. 25, pp. 955-64, 1997.
[15] Z. Weinberg and W.L. Ruzzo, “Sequence-Based Heuristics for Faster Annotation of Non-Coding RNA Families,” Bioinformatics, vol. 22, pp. 35-39, 2006.
[16] Z. Weinberg and W.L. Ruzzo, “Exploiting Conserved Structure for Faster Annotation of Non-Coding RNAs without Loss of Accuracy,” Bioinformatics, vol. 20, no. 1, pp. i334-i340, 2004.
[17] S. Zhang, B. Haas, E. Eskin, and V. Bafna, “Searching Genomes for Noncoding RNA Using FastR,” IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 2, no. 4, pp. 366-379, Oct.-Dec. 2005.
[18] S. Zhang, I. Borovok, Y. Aharonowitz, R. Sharan, and V. Bafna, “A Sequence-Based Filtering Method for ncRNA Identification and Its Application to Searching for Riboswitch Elements,” Bioinformatics, vol. 22, pp. e557-e565, 2006.
[19] D. Gautheret and A. Lambert, “Direct DNA Motif Definition and Identification from Multiple Sequence Alignments Using Secondary Structure Profiles,” J. Moleculer Biology, vol. 313, pp. 1003-1011, 2001.
[20] V. Bafna and S. Zhang, “FastR: Fast Database Search Tool for Non-Coding RNA,” Proc. IEEE Computational Systems Bioinformatics Conf. (CSB '04), pp. 52-61, 2004.
[21] E.K. Freyhult, J.B. Bollback, and P.P. Gardner, “Exploring Genomic Dark Matter: A Critical Assessment of the Performance of Homology Search Methods on Noncoding RNA,” Genome Research, vol. 17, pp. 117-25, 2006.
[22] S.G. Tringe, C.v. Mering, A. Kobayashi, A.A. Salamov, K. Chen, H.W. Chang, M. Podar, J.M. Short, E.J. Mathur, J.C. Detter, P. Bork, P. Hugenholtz, and E.M. Rubin, “Comparative Metagenomics of Microbial Communities,” Science, vol. 308, pp. 554-557, 2005.
[23] E.P. Nawrocki, “Structural RNA Homology Search and Alignment Using Covariance Models,,” PhD thesis, Washington University's School of Medicine, 2009.
[24] M. Beckstette, R. Homann, R. Giegerich, and S. Kurtz, “Fast Index Based Algorithms and Software for Matching Position Specific Scoring Matrices,” BMC Bioinformatics, vol. 7, article 389, 2006.
[25] J. Oosterhoff, “Combination of One-Sided Statistical Tests,” Mathematisch Centrum, Amsterdm, 1969.
[26] T.L. Bailey and W.N. Grundy, “Classifying Proteins by Family Using the Product of Correlated p-Values,” Proc. Third Ann. Int'l Conf. Computational Molecular Biology, pp. 10-14, 1999.
[27] E.P. Nawrocki, D.L. Kolbe, and S.R. Eddy, “Infernal 1.0: Inference of RNA alignments,” Bioinformatics, vol. 25, pp. 1335-1337, 2009.
[28] Y. Sun and J. Buhler, “Designing Secondary Structure Profiles for Fast ncRNA Identification,” Proc. Computational Systems Bioinformatics (CSB '08), pp. 145-156, 2008.
[29] P.P. Gardner, J. Daub, J.G. Tate, E.P. Nawrocki, D.L. Kolbe, S. Lindgreen, A.C. Wilkinson, R.D. Finn, S. Griffiths-Jones, S.R. Eddy, and A. Bateman, “Rfam: Updates to the RNA Families Database,” Nucleic Acids Research, vol. 37, no. database issue, pp. D136-D140, 2008.
[30] R.J. Klein and S.R. Eddy, “RSEARCH: Finding Homologs of Single Structured RNA Sequences,” BMC Bioinformatics, vol. 4, article 44, 2003.

Index Terms:
Web sites,biology computing,covariance analysis,DNA,filters,genomics,molecular biophysics,physiological models,RNA,website,designing filters,fast-known RNA identification,noncoding RNA families,genomic DNA,sequence annotation,covariance model,genome-wide search,multiple secondary structure profiles,secondary structure information,SSP-based filters,soil metagenomic data set,Sensitivity,Bioinformatics,RNA,Hidden Markov models,Dynamic programming,Algorithm design and analysis,Heuristic algorithms,formal languages.,Algorithms for data and knowledge,bioinformatics (genome or protein),feature extraction or construction
Citation:
J. Buhler, Yanni Sun, Cheng Yuan, "Designing Filters for Fast-Known NcRNA Identification," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 9, no. 3, pp. 774-787, May-June 2012, doi:10.1109/TCBB.2011.149
Usage of this product signifies your acceptance of the Terms of Use.