Subscribe
Issue No.06 - Nov.-Dec. (2012 vol.9)
pp: 1737-1750
T. Marschall , Life Sci. Group, Centrum Wiskunde & Inf. (CWI), Amsterdam, Netherlands
I. Herms , Fac. of Technol., Bielefeld Univ., Bielefeld, Germany
H. Kaltenbach , Dept. of Biosyst. Sci.e & Eng., Swiss Fed. Inst. of Technol. (ETH), Basel, Switzerland
S. Rahmann , Inst. of Human Genetics, Univ. of Duisburg-Essen, Essen, Germany
ABSTRACT
We present a comprehensive review on probabilistic arithmetic automata (PAAs), a general model to describe chains of operations whose operands depend on chance, along with two algorithms to numerically compute the distribution of the results of such probabilistic calculations. PAAs provide a unifying framework to approach many problems arising in computational biology and elsewhere. We present five different applications, namely 1) pattern matching statistics on random texts, including the computation of the distribution of occurrence counts, waiting times, and clump sizes under hidden Markov background models; 2) exact analysis of window-based pattern matching algorithms; 3) sensitivity of filtration seeds used to detect candidate sequence alignments; 4) length and mass statistics of peptide fragments resulting from enzymatic cleavage reactions; and 5) read length statistics of 454 and IonTorrent sequencing reads. The diversity of these applications indicates the flexibility and unifying character of the presented framework. While the construction of a PAA depends on the particular application, we single out a frequently applicable construction method: We introduce deterministic arithmetic automata (DAAs) to model deterministic calculations on sequences, and demonstrate how to construct a PAA from a given DAA and a finite-memory random text model. This procedure is used for all five discussed applications and greatly simplifies the construction of PAAs. Implementations are available as part of the MoSDi package. Its application programming interface facilitates the rapid development of new applications based on the PAA framework.
INDEX TERMS
Hidden Markov models, Automata, Computational modeling, Markov processes, Probabilistic logic, Bioinformatics,dynamic programming., Probabilistic automaton, text model, hidden Markov model, pattern matching, statistics, clump, string algorithm, analysis of algorithms, alignment seed, peptide mass fingerprinting, DNA sequencing
CITATION
T. Marschall, I. Herms, H. Kaltenbach, S. Rahmann, "Probabilistic Arithmetic Automata and Their Applications", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.9, no. 6, pp. 1737-1750, Nov.-Dec. 2012, doi:10.1109/TCBB.2012.109
REFERENCES
 [1] T. Marschall and S. Rahmann, “Probabilistic Arithmetic Automata and Their Application to Pattern Matching Statistics,” Proc. 19th Ann. Symp. Combinatorial Pattern Matching (CPM), pp. 95-106, 2008. [2] T. Marschall and S. Rahmann, “Exact Analysis of Horspool's and Sunday's Pattern Matching Algorithms with Probabilistic Arithmetic Automata,” Proc. Fourth Int'l Conf. Language and Automata Theory and Applications (LATA), pp. 439-450, 2010. [3] T. Marschall and S. Rahmann, “Efficient Exact Motif Discovery,” Bioinformatics, vol. 25, no. 12, pp. i356-i364, June 2009. [4] I. Herms and S. Rahmann, “Computing Alignment Seed Sensitivity with Probabilistic Arithmetic Automata,” Proc. Eighth Int'l Workshop Algorithms in Bioinformatics (WABI), pp. 318-329, 2008. [5] H.-M. Kaltenbach, S. Böcker, and S. Rahmann, “Markov Additive Chains and Applications to Fragment Statistics for Peptide Mass Fingerprinting,” Proc. Joint Satellite Conf. Systems Biology and Computational Proteomics, pp. 29-41, 2006. [6] T. Marschall, “Construction of Minimal Deterministic Finite Automata from Biological Motifs,” Theoretical Computer Science, vol. 412, pp. 922-930, 2011. [7] E. Cinlar, “Markov Additive Processes I,” Z. Wahrscheinl. Verw. Geb., vol. 24, pp. 85-93, 1972. [8] M. Lladser, M.D. Betterton, and R. Knight, “Multiple Pattern Matching: A Markov Chain Approach,” J. Math. Biology, vol. 56, nos. 1/2, pp. 51-92, 2008. [9] G. Nuel, “Pattern Markov Chains: Optimal Markov Chain Embedding through Deterministic Finite Automata,” J. Applied Probability, vol. 45, pp. 226-243, 2008. [10] M. Mohri, “Weighted Automata Algorithms,” Handbook of Weighted Automata, M. Droste, W. Kuich, and H. Vogler, eds., pp. 213-254, Springer, 2009. [11] G. Kucherov, L. Noé, and M. Roytberg, “A Unifying Framework for Seed Sensitivity and Its Application to Subset Seeds,” J. Bioinformatics Computational Biology, vol. 4, no. 2, pp. 553-569, 2006. [12] P. Nicodème, B. Salvy, and P. Flajolet, “Motif Statistics,” Theoretical Computer Science, vol. 287, pp. 593-617, 2002. [13] D.Y.F. Mak and G. Benson, “All Hits All the Time: Parameter Free Calculation of Seed Sensitivity,” Proc. Fifth Asia Pacific Bioinformatics Conf. (APBC), pp. 327-340, 2007. [14] W. Feller, An Introduction to Probability Theory and its Applications. John Wiley & Sons, 1968. [15] G. Reinert, S. Schbath, and M.S. Waterman, “Probabilistic and Statistical Properties of Words: An Overview,” J. Computational Biology, vol. 7, nos. 1/2, pp. 1-46, 2000. [16] P. Brémaud, Markov Chains, Gibbs fields, Monte Carlo Simulation, and Queues. Springer, 1999. [17] M. Schulz, D. Weese, T. Rausch, A. Döring, K. Reinert, and M. Vingron, “Fast and Adaptive Variable Order Markov Chain Construction,” Proc. Eighth Int'l Workshop Algorithms in Bioinformatics (WABI), pp. 306-317, 2008. [18] N. Hulo, A. Bairoch, V. Bulliard, L. Cerutti, E. De Castro, P.S. Langendijk-Genevaux, M. Pagni, and C.J.A. Sigrist, “The PROSITE Database,” Nucleic Acids Research, vol. 34, no. S1, pp. D227-D230, 2006. [19] M. Lothaire, Applied Combinatorics on Words (Encyclopedia of Mathematics and Its Applications). Cambridge Univ. Press, 2005. [20] M. Régnier, “A Unified Approach to Word Occurrence Probabilities,” Discrete Applied Math., vol. 104, pp. 259-280, 2000. [21] V. Boeva, J. Clément, M. Régnier, M.A. Roytberg, and V.J. Makeev, “Exact P-Value Calculation for Heterotypic Clusters of Regulatory Motifs and Its Application in Computational Annotation of Cis-Regulatory Modules,” Algorithms for Molecular Biology, vol. 2, article 13, Oct. 2007. [22] J. Zhang, B. Jiang, M. Li, J. Tromp, X. Zhang, and M.Q. Zhang, “Computing Exact P-Values for DNA Motifs,” Bioinformatics, vol. 23, no. 5, pp. 531-537, 2007. [23] G. Navarro and M. Raffinot, Flexible Pattern Matching in Strings. Cambridge Univ. Press, 2002. [24] J. Hopcroft, “An $n \log n$ Algorithm for Minimizing the States in a Finite Automaton,” The Theory of Machines and Computations, Z. Kohavi and A. Paz, eds., pp. 189-196, Academic Press, 1971. [25] T. Knuutila, “Re-Describing An Algorithm by Hopcroft,” Theoretical Computer Science, vol. 250, pp. 333-363, 2001. [26] A.V. Aho and M.J. Corasick, “Efficient String Matching: An Aid to Bibliographic Search,” Comm. ACM, vol. 18, no. 6, pp. 333-340, 1975. [27] S. Dori and G.M. Landau, “Construction of Aho Corasick Automaton in Linear Time for Integer Alphabets,” Information Processing Letters, vol. 98, no. 2, pp. 66-72, 2006. [28] S. Schbath, “Compound Poisson Approximation of Word Counts in DNA Sequences,” ESAIM: Probability and Statistics, vol. 1, pp. 1-16, 1995. [29] M.S. Waterman, Introduction to Computational Biology: Maps, Sequences and Genomes. Chapman & Hall/CRC, 1995. [30] E. Roquain and S. Schbath, “Improved Compound Poisson Approximation for the Number of Occurrences of Multiple Words in a Stationary Markov Chain,” Advances in Applied Probability, vol. 39, no. 1, pp. 128-140, 2007. [31] D.E. Knuth, J. Morris, and V.R. Pratt, “Fast Pattern Matching in Strings,” SIAM J. Computing, vol. 6, no. 2, pp. 323-350, 1977. [32] R.S. Boyer and J.S. Moore, “A Fast String Searching Algorithm,” Comm. ACM, vol. 20, no. 10, pp. 762-772, 1977. [33] R.N. Horspool, “Practical Fast Searching in Strings,” Software-Practice and Experience, vol. 10, pp. 501-506, 1980. [34] M. Crochemore, A. Czumaj, L. Gasieniec, S. Jarominek, T. Lecroq, W. Plandowski, and W. Rytter, “Speeding Up Two String-Matching Algorithms,” Algorithmica, vol. 12, nos. 4/5, pp. 247-267, 1994. [35] C. Allauzen, M. Crochemore, and M. Raffinot, “Efficient Experimental String Matching by Weak Factor Recognition,” Proc. 12th Ann. Symp. Combinatorial Pattern Matching (CPM), pp. 51-72, 2001. [36] T. Marschall and S. Rahmann, “An Algorithm to Compute the Character Access Count Distribution for Pattern Matching Algorithms,” Algorithms, vol. 4, pp. 285-306, 2011. [37] R.A. Baeza-Yates, G.H. Gonnet, and M. Régnier, “Analysis of Boyer-Moore-Type String Searching Algorithms,” Proc. First Ann. ACM-SIAM Symp. Discrete Algorithms (SODA), pp. 328-343, 1990. [38] R.A. Baeza-Yates and M. Régnier, “Average Running Time of the Boyer-Moore-Horspool Algorithm,” Theoretical Computer Science, vol. 92, no. 1, pp. 19-31, 1992. [39] H.M. Mahmoud, R.T. Smythe, and M. Régnier, “Analysis of Boyer-Moore-Horspool String-Matching Heuristic,” Random Structures and Algorithms, vol. 10, nos. 1/2, pp. 169-186, 1997. [40] R.T. Smythe, “The Boyer-Moore-Horspool Heuristic with Markovian Input,” Random Structures and Algorithms, vol. 18, no. 2, pp. 153-163, 2001. [41] T.-H. Tsai, “Average Case Analysis of the Boyer-Moore Algorithm,” Random Structures and Algorithms, vol. 28, no. 4, pp. 481-498, 2006. [42] W. Pearson and D. Lipman, “Improved Tools for Biological Sequence Comparison,” Proc. Nat'l Academy of Sciences USA, vol. 85, pp. 2444-2448, 1988. [43] S.F. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman, “Basic Local Alignment Search Tool,” J. Molecular Biology, vol. 215, pp. 403-410, 1990. [44] W.J. Kent, “BLAT-the BLAST-Like Alignment Tool,” Genome Research, vol. 12, no. 4, pp. 656-664, 2002. [45] B. Ma, J. Tromp, and M. Li, “PatternHunter - Faster and More Sensitive Homology Search,” Bioinformatics, vol. 18, pp. 440-445, 2002. [46] J. Buhler, U. Keich, and Y. Sun, “Designing Seeds for Similarity Search in Genomic DNA,” Proc. Seventh Ann. Int'l Conf. Research in Computational Molecular Biology (RECOMB), pp. 67-75, 2003. [47] K.P. Choi and L. Zhang, “Sensitivity Analysis and Efficient Method for Identifying Optimal Spaced Seeds,” J. Computer System Sciences, vol. 68, pp. 22-40, 2004. [48] M. Li, B. Ma, and L. Zhang, “Superiority and Complexity of the Spaced Seeds,” Proc. 17th Ann. ACM-SIAM Symp. Discrete Algorithm (SODA), pp. 444-453, 2006. [49] B. Brejová, D.G. Brown, and T. Vinar, “Optimal Spaced Seeds for Homologous Coding Regions,” J. Bioinformatics Computational Biology, vol. 1, no. 4, pp. 595-610, 2004. [50] K.P. Choi, F. Zeng, and L. Zhang, “Good Spaced Seeds for Homology Search,” Bioinformatics, vol. 20, no. 7, pp. 1053-1059, 2004. [51] B. Brejová, D.G. Brown, and T. Vinar, “Vector Seeds: An Extension to Spaced Seeds,” J. Computer System Sciences, vol. 70, no. 3, pp. 364-380, 2005. [52] D. Mak, Y. Gelfand, and G. Benson, “Indel Seeds for Homology Search,” Bioinformatics, vol. 22, no. 14, pp. e341-e349, 2006. [53] M. Li, B. Ma, D. Kisman, and J. Tromp, “PatternHunter II: Highly Sensitive and Fast Homology Search,” J. Bioinformatics Computational Biology, vol. 2, no. 3, pp. 417-439, 2004. [54] G. Kucherov, L. Noé, and M. Roytberg, “Multiseed Lossless Filtration,” IEEE/ACM Trans. Computational Biology Bioinformatics, vol. 2, no. 1, pp. 51-61, Jan.-Mar. 2005. [55] Y. Sun and J. Buhler, “Designing Multiple Simultaneous Seeds for DNA Similarity Search,” J. Computational Biology, vol. 12, no. 6, pp. 847-861, 2005. [56] H.-M. Kaltenbach, “Statistics and Algorithms for Peptide Mass Fingerprinting,” PhD dissertation, Bielefeld Univ., 2007. [57] T.U. Consortium, “Reorganizing the Protein Space at the Universal Protein Resource (UniProt),” Nucleic Acids Research, vol. 40, no. D1, pp. D71-D75, 2012. [58] I.-J. Wang, C.P. Diehl, and F.J. Pineda, “A Statistical Model of Proteolytic Digestion,” Proc. IEEE CS Conf. Bioinformatics (CSB), pp. 506-508, 2003. [59] S. Rahmann, “Subsequence Combinatorics and Applications to Microarray Production, DNA Sequencing and Chaining Algorithms,” Proc. 17th Ann. Conf. Combinatorial Pattern Matching (CPM), pp. 153-164, 2006. [60] Y. Kong, “Statistical Distributions of Pyrosequencing,” J. Computational Biology, vol. 16, no. 1, pp. 31-42, 2009.