This Article 
 Bibliographic References 
 Add to: 
Statistical Alignment with a Sequence Evolution Model Allowing Rate Heterogeneity along the Sequence
April-June 2009 (vol. 6 no. 2)
pp. 281-295
Ana Arribas-Gil, Universidad Carlos III de Madrid, Spain
Dirk Metzler, Johann Wolfgang Goethe-Universität, Germany
Jean-Louis Plouhinec, Institut de Transgénose, CNRS-IEM, France
We present a stochastic sequence evolution model to obtain alignments and estimate mutation rates between two homologous sequences. The model allows two possible evolutionary behaviors along a DNA sequence in order to determine conserved regions and take its heterogeneity into account. In our model, the sequence is divided into slow and fast evolution regions. The boundaries between these sections are not known. It is our aim to detect them. The evolution model is based on a fragment insertion and deletion process working on fast regions only and on a substitution process working on fast and slow regions with different rates. This model induces a pair hidden Markov structure at the level of alignments, thus making efficient statistical alignment algorithms possible. We propose two complementary estimation methods, namely, a Gibbs sampler for Bayesian estimation and a stochastic version of the EM algorithm for maximum likelihood estimation. Both algorithms involve the sampling of alignments. We propose a partial alignment sampler, which is computationally less expensive than the typical whole alignment sampler. We show the convergence of the two estimation algorithms when used with this partial sampler. Our algorithms provide consistent estimates for the mutation rates and plausible alignments and sequence segmentations on both simulated and real data.

[1] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge Univ. Press, 1998.
[2] F. Ronquist and J.P. Huelsenbeck, “Mrbayes 3: Bayesian Phylogenetic Inference under Mixed Models,” Bioinformatics, vol. 19, pp.1572-1574, 2003.
[3] J. Felsenstein, Inferring Phylogenies. Sinauer Assoc., 2004.
[4] T. Jukes and C. Cantor, Evolution of Protein Molecules in Mammalian Protein Metabolism, H.N. Munro, ed., pp. 21-132. Academic Press, 1969.
[5] M. Dayhoff, R. Schwartz, and B. Orcutt, “A Model of Evolutionary Change in Proteins,” Atlas of Protein Structure, vol. 5, no. 3, pp. 345-352, Nat'l Biomedical Research Foundation, 1978.
[6] J. Thorne, H. Kishino, and J. Felsenstein, “An Evolutionary Model for Maximum Likelihood Alignment of DNA Sequences,” J.Molecular Evolution, vol. 33, pp. 114-124, 1991.
[7] J. Thorne, H. Kishino, and J. Felsenstein, “Inching toward Reality: An Improved Likelihood Model of Sequence Evolution,” J.Molecular Evolution, vol. 34, pp. 3-16, 1992.
[8] D. Metzler, R. Fleissner, A. Wakolbinger, and A. von Haeseler, “Assessing Variability by Joint Sampling of Alignments and Mutation Rates,” J. Molecular Evolution, vol. 53, no. 6, pp. 660-669, 2001.
[9] D. Metzler, “Statistical Alignment Based on Fragment Insertion and Deletion Models,” Bioinformatics, vol. 19, no. 4, pp. 490-499, 2003.
[10] J. Hein, C. Wiuf, B. Knudsen, M. Moller, and G. Wibling, “Statistical Alignment: Computational Properties, Homology Testing and Goodness-of-Fit,” J. Molecular Biology, vol. 302, pp.265-279, 2000.
[11] I. Miklos, G.A. Lunter, and I. Holmes, “A ‘Long Indel’ Model for Evolutionary Sequence Alignment,” Molecular Biology and Evolution, vol. 21, no. 3, pp. 529-540, 2004.
[12] G. Lunter, C. Ponting, and J. Hein, “Genome-Wide Identification of Human Functional DNA Using a Neutral Indel Model,” PLoS Computational Biology, vol. 2, no. 1, p. e5, 2006.
[13] F. Chiaromonte, R. Weber, K. Roskin, M. Diekhans, W. Kent, and D. Haussler, “The Share of Human Genomic DNA under Selection Estimated from Human-Mouse Genomic Alignments,” Proc. Cold Spring Harbor Symp. Quantitative Biology, vol. 68, pp. 245-254, 2003.
[14] G. Bejerano, M. Pheasant, I. Makunin, S. Stephen, W. Kent, J. Mattick, and D. Haussler, “Ultraconserved Elements in the Human Genome,” Science, vol. 304, no. 5675, pp. 1321-1325, 2004.
[15] A. Siepel, G. Bejerano, J. Pedersen, A. Hinrichs, M. Hou, K. Rosenbloom, H. Clawson, J. Spieth, L. Hillier, S. Richards, G. Weinstock, R. Wilson, R. Gibbs, W. Kent, W. Miller, and D. Haussler, “Evolutionarily Conserved Elements in Vertebrate, Insect, Worm, and Yeast Genomes,” Genome Research, vol. 15, no. 8, pp. 1034-1050, 2005.
[16] A. Stathopoulos and M. Levine, “Genomic Regulatory Networks and Animal Development,” Developmental Cell, vol. 9, no. 4, pp.449-462, 2005.
[17] J. Felsenstein and G. Churchill, “A Hidden Markov Model Approach to Variation among Sites in Rate of Evolution,” Molecular Biology and Evolution, vol. 13, pp. 93-104, 1996.
[18] A. Dempster, N. Laird, and D. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” J. Royal Statistical Soc. B, vol. 39, pp. 1-38, 1977.
[19] I. Holmes, “Using Evolutionary Expectation Maximization to Estimate Indel Rates,” Bioinformatics, vol. 21, no. 10, pp. 2294-2300, 2005.
[20] B. Delyon, M. Lavielle, and E. Moulines, “Convergence of a Stochastic Approximation Version of the EM Algorithm,” The Annals of Statistics, vol. 27, pp. 94-128, 1999.
[21] J. Liu and C.E. Lawrence, “Bayesian Inference on Biopolymer Models,” Bioinformatics, vol. 15, no. 1, pp. 38-52, 1999.
[22] C. Robert and G. Casella, Monte Carlo Statistical Methods. Springer, 2004.
[23] N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller, “Equations of State Calculations by Fast Computing Machines,” J. Chemical Physics, vol. 21, no. 6, pp. 1087-1092, 1953.
[24] W. Hastings, “Monte Carlo Sampling Methods Using Markov Chains and Their Applications,” Biometrika, vol. 57, no. 1, pp. 97-109, 1970.
[25] E. Kuhn and M. Lavielle, “Coupling a Stochastic Approximation Version of EM with a MCMC Procedure,” ESAIM Probability and Statistics, vol. 8, pp. 115-131, 2004.
[26] M. Powell, “A Fast Algorithm for Nonlinearly Constrained Optimization Calculations,” Lecture Notes in Math., vol. 630, pp.144-157, 1978.
[27] X. Li and Z. Zhao, “Relative Error Measures for Evaluation of Estimation Algorithms,” Proc. Eighth Int'l Conf. Information Fusion, 2005.
[28] A. Arribas-Gil, E. Gassiat, and C. Matias, “Parameter Estimation in Pair Hidden Markov Models,” Scandinavian J. Statistics, vol. 33, no. 4, pp. 651-671, 2006.
[29] D. Kurokawa, H. Kiyonari, R. Nakayama, C. Kimura-Yoshida, I. Matsuo, and S. Aizawa, “Regulation of Otx2 Expression and Its Functions in Mouse Forebrain and Midbrain,” Development, vol. 131, no. 14, pp. 3319-3331, 2004.
[30] D. Kurokawa, N. Takasaki, H. Kiyonari, R. Nakayama, C. Kimura-Yoshida, I. Matsuo, and S. Aizawa, “Regulation of Otx2 Expression and Its Functions in Mouse Epiblast and Anterior Neuroectoderm,” Development, vol. 131, no. 14, pp. 3307-3317, 2004.
[31] I. Holmes and W. Bruno, “Evolutionary HMMs: A Bayesian Approach to Multiple Alignment,” Bioinformatics, vol. 17, pp. 803-820, 2001.
[32] R. Fleissner, D. Metzler, and A. von Haeseler, “Simultaneous Statistical Multiple Alignment and Phylogeny Reconstruction,” Systematic Biology, vol. 54, no. 4, pp. 548-561, 2005.
[33] G. Lunter, I. Miklos, A. Drummond, J. Jensen, and J. Hein, “Bayesian Coestimation of Phylogeny and Sequence Alignment,” BMC Bioinformatics, pp. 6-83, 2005.

Index Terms:
Markov processes, probabilistic algorithms, mathematics and statistics, sequence evolution, biology and genetics.
Ana Arribas-Gil, Dirk Metzler, Jean-Louis Plouhinec, "Statistical Alignment with a Sequence Evolution Model Allowing Rate Heterogeneity along the Sequence," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 6, no. 2, pp. 281-295, April-June 2009, doi:10.1109/TCBB.2007.70246
Usage of this product signifies your acceptance of the Terms of Use.