This Article 
 Bibliographic References 
 Add to: 
Semantics and Ambiguity of Stochastic RNA Family Models
March/April 2011 (vol. 8 no. 2)
pp. 499-516
Robert Giegerich, Bielefeld University, Bielefeld
Christian Höner zu Siederdissen, University of Vienna, Vienna
Stochastic models, such as hidden Markov models or stochastic context-free grammars (SCFGs) can fail to return the correct, maximum likelihood solution in the case of semantic ambiguity. This problem arises when the algorithm implementing the model inspects the same solution in different guises. It is a difficult problem in the sense that proving semantic nonambiguity has been shown to be algorithmically undecidable, while compensating for it (by coalescing scores of equivalent solutions) has been shown to be NP-hard. For stochastic context-free grammars modeling RNA secondary structure, it has been shown that the distortion of results can be quite severe. Much less is known about the case when stochastic context-free grammars model the matching of a query sequence to an implicit consensus structure for an RNA family. We find that three different, meaningful semantics can be associated with the matching of a query against the model—a structural, an alignment, and a trace semantics. Rfam models correctly implement the alignment semantics, and are ambiguous with respect to the other two semantics, which are more abstract. We show how provably correct models can be generated for the trace semantics. For approaches, where such a proof is not possible, we present an automated pipeline to check post factum for ambiguity of the generated models. We propose that both the structure and the trace semantics are worth-while concepts for further study, possibly better suited to capture remotely related family members.

[1] P. Baldi and S. Brunak, Bioinformatics: The Machine Learning Approach. MIT Press, 1998.
[2] A. Bateman, E. Birney, L. Cerruti, R. Durbin, L. Etwiller, S.R. Eddy, S. Griffiths-Jones, K.L. Howe, M. Marshall, and E.L. Sonnhammer, "The Pfam Protein Families Database," Nucleic Acids Research, vol. 30, no. 1, pp. 276-280, Jan. 2002.
[3] C. Brabrand, R. Giegerich, and A. Møller, "Analyzing Ambiguity of Context-Free Grammars," Proc. 12th Int'l Conf. Implementation and Application of Automata, (CIAA '07), July 2007.
[4] B. Brejová, D.G. Brown, and T. Vinař, "The Most Probable Annotation Problem in HMMs and Its Application to Bioinformatics," J. Computer and System Sciences, vol. 73, no. 7, pp. 1060-1077, Mar. 2007.
[5] R.D. Dowell and S.R. Eddy, "Evaluation of Several Lightweight Stochastic Context-Free Grammars for RNA Secondary Structure Prediction," BMC Bioinformatics, vol. 5, pp. 71-71, June 2004.
[6] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, Biological Sequence Analysis. 2006 ed. Cambridge Univ. Press, 1998.
[7] S.R. Eddy, "Profile Hidden Markov Models," Bioinformatics, vol. 14, no. 9, pp. 755-763, 1998.
[8] S.R. Eddy and R. Durbin, "RNA Sequence Analysis Using Covariance Models," Nucleic Acids Research, vol. 22, no. 11, pp. 2079-2088, June 1994.
[9] P.P. Gardner, J. Daub, J.G. Tate, E.P. Nawrocki, D.L. Kolbe, S. Lindgreen, A.C. Wilkinson, R.D. Finn, S. Griffiths-Jones, S.R. Eddy, and A. Bateman, "Rfam: Updates to the RNA Families Database," Nucleic Acids Research, vol. 37, pp. 136-140, Jan. 2009.
[10] R. Giegerich, "Explaining and Controlling Ambiguity in Dynamic Programming," Proc. 11th Ann. Symp. Combinatorial Pattern Matching, pp. 46-59, 2000.
[11] R. Giegerich, C. Meyer, and P. Steffen, "A Discipline of Dynamic Programming over Sequence Data," Science of Computer Programming, vol. 51, no. 3, pp. 215-263, 2004.
[12] J.E. Hopcroft and J.D. Ullman, Formal Languages and Their Relation to Automata. Addison-Wesley, 1969.
[13] B. Knudsen and J. Hein, "Pfold: RNA Secondary Structure Prediction Using Stochastic Context-Free Grammars," Nucleic Acids Research, vol. 31, no. 13, pp. 3423-3428, July 2003.
[14] S.E. Lab, INFERNAL User's Guide. Sequence Analysis Using Profiles of RNA Secondary Structure, version 1.0 ed., http./infernal.janelia. org, Jan. 2009.
[15] E.P. Nawrocki, D.L. Kolbe, and S.R. Eddy, "Infernal 1.0: Inference of RNA Alignments," Bioinformatics, vol. 25, no. 10, pp. 1335-1337, Mar. 2009.
[16] J. Reeder, P. Steffen, and R. Giegerich, "Effective Ambiguity Checking in Biosequence Analysis," BMC Bioinformatics, vol. 6, p. 153, 2005.
[17] S. Eddy, "HMMER User's Guide," technical report, Howard Hughes Medical Inst., 2003.
[18] Y. Sakakibara, M. Brown, R. Hughey, I.S. Mian, K. Sjölander, R.C. Underwood, and D. Haussler, "Stochastic Context-Free Grammars for tRNA Modeling," Nucleic Acids Research, vol. 22, no. 23, pp. 5112-5120, Nov. 1994.
[19] D. Sankoff and J. Kruskal, Time Warps, String Edits, and Macromolecules. Addison-Wesley, 1983.
[20] P. Steffen and R. Giegerich, "Versatile and Declarative Dynamic Programming Using Pair Algebras," BMC Bioinformatics, vol. 6, no. 1, p. 224, Sept. 2005.
[21] B. Voss, R. Giegerich, and M. Rehmsmeier, "Complete Probabilistic Analysis of RNA Shapes," BMC Biology, vol. 4, p. 5, 2006.
[22] M. Waterman, Introduction to Computational Biology. Chapman & Hall, 1994.

Index Terms:
RNA secondary structure, RNA family models, covariance models, semantic ambiguity.
Robert Giegerich, Christian Höner zu Siederdissen, "Semantics and Ambiguity of Stochastic RNA Family Models," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 8, no. 2, pp. 499-516, March-April 2011, doi:10.1109/TCBB.2010.12
Usage of this product signifies your acceptance of the Terms of Use.