This Article 
 Bibliographic References 
 Add to: 
A Novel Heuristic for Local Multiple Alignment of Interspersed DNA Repeats
April-June 2009 (vol. 6 no. 2)
pp. 180-189
Todd J. Treangen, Institut Pasteur, CNRS and UPMC University, France
Aaron E. Darling, University of California-Davis, Davis
Guillaume Achaz, UPMC University, CNRS, MNHN, IRD, UMR, Paris
Mark A. Ragan, University of Queensland and ARC Centre of Excellence, Brisbane
Xavier Messeguer, Technical University of Catalonia, Barcelona
Eduardo P.C. Rocha, Institut Pasteur, CNRS and UPMC University, France
Pairwise local sequence alignment methods have been the prevailing technique to identify homologous nucleotides between related species. However, existing methods that identify and align all homologous nucleotides in one or more genomes have suffered from poor scalability and limited accuracy. We propose a novel method that couples a gapped extension heuristic with an efficient filtration method for identifying interspersed repeats in genome sequences. During gapped extension, we use the MUSCLE implementation of progressive global multiple alignment with iterative refinement. The resulting gapped extensions potentially contain alignments of unrelated sequence. We detect and remove such undesirable alignments using a hidden Markov model (HMM) to predict the posterior probability of homology. The HMM emission frequencies for nucleotide substitutions can be derived from any time-reversible nucleotide substitution matrix. We evaluate the performance of our method and previous approaches on a hybrid data set of real genomic DNA with simulated interspersed repeats. Our method outperforms a related method in terms of sensitivity, positive predictive value, and localizing boundaries of homology. The described methods have been implemented in freely available software, Repeatoire, available from:

[1] B. Ma, J. Tromp, and M. Li, “PatternHunter: Faster and More Sensitive Homology Search,” Bioinformatics, vol. 18, no. 3, pp.440-445, Mar. 2002.
[2] M. Brudno and B. Morgenstern, “Fast and Sensitive Alignment of Large Genomic Sequences,” Proc. IEEE CS Bioinformatics. Conf., vol. 1, pp.138-147, 2002.
[3] L. Noé and G. Kucherov, “Improved Hit Criteria for DNA Local Alignment,” BMC Bioinformatics, vol. 5, p. 149, Oct. 2004.
[4] T. Kahveci, V. Ljosa, and A.K. Singh, “Speeding up Whole-Genome Alignment by Indexing Frequency Vectors,” Bioinformatics, vol. 20, no. 13, pp.2122-2134, Sept. 2004.
[5] S. Kurtz, J.V. Choudhuri, E. Ohlebusch, C. Schleiermacher, J. Stoye, and R. Giegerich, “REPuter: The Manifold Applications of Repeat Analysis on a Genomic Scale,” Nucleic Acids Research, vol. 29, no. 22, pp.4633-4642, Nov. 2001.
[6] J. Jurka, V.V. Kapitonov, A. Pavlicek, P. Klonowski, O. Kohany, and J. Walichiewicz, “Repbase Update, A Database of Eukaryotic Repetitive Elements,” Cytogenetic and Genome Research, vol. 110, nos. 1-4, pp.462-467, 2005.
[7] K.P. Choi, F. Zeng, and L. Zhang, “Good Spaced Seeds for Homology Search,” Bioinformatics, vol. 20, pp.1053-1059, 2004.
[8] M. Li, B. Ma, and L. Zhang, “Superiority and Complexity of the Spaced Seeds,” Proc. Symp. Discrete Algorithms (SODA), pp.444-453, 2006.
[9] Y. Sun and J. Buhler, “Designing Multiple Simultaneous Seeds for DNA Similarity Search,” J. Computational Biology, vol. 12, no. 6, pp.847-861, 2005.
[10] J. Xu, D.G. Brown, M. Li, and B. Ma, “Optimizing Multiple Spaced Seeds for Homology Search,” Proc. Combinatorial Pattern Matching (CPM), pp.47-58, 2004.
[11] J. Flannick and S. Batzoglou, “Using Multiple Alignments to Improve Seeded Local Alignment Algorithms,” Nucleic Acids Research, vol. 33, no. 14, pp.4563-4577, 2005.
[12] K.R. Rasmussen, J. Stoye, and E.W. Myers, “Efficient q-Gram Filters for Finding All Epsilon-Matches over a Given Length,” J.Computational Biology, vol. 13, pp.189-203, 2005.
[13] L. Li, C.J. Stoeckert, and D.S. Roos, “OrthoMCL: Identification of Ortholog Groups for Eukaryotic Genomes,” Genome Research, vol. 13, no. 9, pp.2178-2189, Sept. 2003.
[14] D.B. Jaffe, J. Butler, S. Gnerre, E. Mauceli, K. Lindblad-Toh, J.P. Mesirov, M.C. Zody, and E.S. Lander, “Whole-Genome Sequence Assembly for Mammalian Genomes: Arachne 2,” Genome Research, vol. 13, no. 1, pp.91-96, Jan. 2003.
[15] C. Ane and M. Sanderson, “Missing the Forest for the Trees: Phylogenetic Compression and Its Implications for Inferring Complex Evolutionary Histories,” Systems Biology, vol. 54, no. 1, pp. I311-I317, 2005.
[16] M. Margulies and 55 Other Authors, “Genome Sequencing in Microfabricated High-Density Picolitre Reactors,” Nature, vol. 437, no. 7057, pp.376-380, 2005.
[17] D. Ryan, M. Rahimi, J. Lund, R. Mehta, and B.A.A. Parviz, “Toward Nanoscale Genome Sequencing,” Trends in Biotechnology, vol. 25, pp. 385-389, July 2007.
[18] R.G. Blazej, P. Kumaresan, and R.A. Mathies, “Microfabricated Bioprocessor for Integrated Nanoliter-Scale Sanger DNA Sequencing,” Proc. Nat'l Academy Sciences USA, vol. 103, no. 19, pp.7240-7245, May 2006.
[19] J. Shendure and H. Ji, “Next-Generation DNA Sequencing,” Nature Biotechnology, vol. 26, no. 10, pp.1135-1145, Oct. 2008.
[20] A.C.E. Darling, B. Mau, F.R. Blattner, and N.T. Perna, “Mauve: Multiple Alignment of Conserved Genomic Sequence with Rearrangements,” Genome Research, vol. 14, no. 7, pp.1394-403, 2004.
[21] M. Höhl, S. Kurtz, and E. Ohlebusch, “Efficient Multiple Genome Alignment,” Bioinformatics, vol. 18, no. 1, pp.S312-S320, 2002.
[22] T.J. Treangen and X. Messeguer, “M-GCAT: Interactively and Efficiently Constructing Large-Scale Multiple Genome Comparison Frameworks in Closely Related Species,” BMC Bioinformatics, vol. 7, p.433, Oct. 2006.
[23] C.N. Dewey and L. Pachter, “Evolution at the Nucleotide Level: The Problem of Multiple Whole-Genome Alignment,” Human Molecular Genetics, vol. 15, no. 1, pp. R51-56, Apr. 2006.
[24] M. Sammeth and J. Heringa, “Global Multiple-Sequence Alignment with Repeats,” Proteins, vol. 64, pp. 263-274, Apr. 2006.
[25] B. Raphael, D. Zhi, H. Tang, and P. Pevzner, “A Novel Method for Multiple Alignment of Sequences with Repeated and Shuffled Elements,” Genome Research, vol. 14, no. 11, pp.2336-2346, 2004.
[26] R.C. Edgar and E.W. Myers, “PILER: Identification and Classification of Genomic Repeats,” Bioinformatics, vol. 21, no. 1, pp. I152-I158, June 2005.
[27] S. Kumar and A. Filipski, “Multiple Sequence Alignment: In Pursuit of Homologous DNA Positions,” Genome Research, vol. 17, no. 2, pp.127-135, 2007.
[28] S. Schwartz, J.W. Kent, A. Smit, Z. Zhang, R. Baertsch, R.C. Hardison, D. Haussler, and W. Miller, “Human-Mouse Alignments with BLASTZ,” Genome Research, vol. 13, no. 1, pp.103-107, Jan. 2003.
[29] W.R. Pearson, “Rapid and Sensitive Sequence Comparison with FASTP and FASTA,” Methods in Enzymology, vol. 183, pp.63-98, 1990.
[30] M. Blanchette, W. Kent, C. Riemer, L. Elnitski, A.F. Smit, K.M. Roskin, R. Baertsch, K. Rosenbloom, H. Clawson, E. Green, D. Haussler, and W. Miller, “Aligning Multiple Genomic Sequences with the Threaded Blockset Aligner,” Genome Research, vol. 14, pp.708-715, 2004.
[31] B. Morgenstern, K. French, A. Dress, and T. Werner, “DIALIGN: Finding Local Similarities by Multiple Sequence Alignment,” Bioinformatics, vol. 14, pp.290-294, 1998.
[32] Y. Zhang and M.S. Waterman, “An Eulerian Path Approach to Local Multiple Alignment for DNA Sequences,” Proc. Nat'l Academy Sciences USA, vol. 102, no. 5, pp.1285-1290, 2005.
[33] M. Brudno, C.B. Do, G.M. Cooper, M.F. Kim, E. Davydov, N.C.S. Program, E.D. Green, A. Sidow, and S. Batzoglou, “LAGAN and Multi-LAGAN: Efficient Tools for Large-Scale Multiple Alignment of Genomic DNA,” Genome Research., vol. 13, no. 4, pp.721-731, 2003.
[34] R. Szklarczyk and J. Heringa, “AuberGene-A Sensitive Genome Alignment Tool,” Bioinformatics, vol. 22, no. 12, pp.1431-1436, June 2006.
[35] L. Wang and T. Jiang, “On the Complexity of Multiple Sequence Alignment,” J. Computational Biology, vol. 1, no. 4, pp.337-348, 1994.
[36] J.D. Thompson, D.G. Higgins, and T. Gibson, “CLUSTAL W: Improving the Sensitivity of Progressive Multiple Sequence Alignment through Sequence Weighting, Position Specific Gap Penalties and Weight Matrix Choice,” Nucleic Acids Research, vol. 22, no. 22, pp.4673-80, 1994.
[37] C. Notredame, D.G. Higgins, and J. Heringa, “T-Coffee: A Novel Method for Fast and Accurate Multiple Sequence Alignment,” J.Molecular Biology, vol. 302, no. 1, pp.205-217, Sept. 2000.
[38] R. Edgar, “MUSCLE: Multiple Sequence Alignment with High Accuracy and High Throughput,” Nucleic Acids Research., vol. 32, no. 5, pp. 1792-1797, 2004.
[39] C.B. Do, M.S. Mahabhashyam, M. Brudno, and S. Batzoglou, “ProbCons: Probabilistic Consistency-Based Multiple Sequence Alignment,” Genome Research, vol. 15, no. 2, pp.330-340, Feb. 2005.
[40] A.E. Darling, T.J. Treangen, L. Zhang, C. Kuiken, X. Messeguer, and N.T. Perna, “Procrastination Leads to Efficient Filtration for Local Multiple Alignment,” Algorithms in Bioinformatics, vol. 4175, pp.126-137, 2006.
[41] K.P. Choi, F. Zeng, L. Zhang, “Good Spaced Seeds for Homology Search,” Bioinformatics, vol. 20, pp.1053-1059, 2004.
[42] R. Szklarczyk and J. Heringa, “Tracking Repeats Using Significance and Transitivity,” Bioinformatics, vol. 20, no. 1, pp.I311-I317, 2004.
[43] S.F. Altschul, T.L. Madden, A.A. Schäffer, J. Zhang, Z. Zhang, W. Miller, and D.J. Lipman, “Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs,” Nucleic Acids Research, vol. 25, no. 17, pp.3389-3402, Sept. 1997.
[44] W.J. Kent, “BLAT—The BLAST-Like Alignment Tool,” Genome Research., vol. 12, no. 4, pp.656-664, Apr. 2002.
[45] F. Chiaromonte, V.B. Yap, and W. Miller, “Scoring Pairwise Genomic Sequence Alignments,” Proc. Pacific Symp. Biocomputing, pp.115-126, 2002.
[46] Y.-K. Yu, J.C. Wootton, and S.F. Altschul, “The Compositional Adjustment of Amino Acid Substitution Matrices,” Proc. Nat'l Academy Sciences USA, vol. 100, no. 26, pp.15 688-15 693, Dec. 2003.
[47] Y. Yi-Kuo and F. Altschul, “The Construction of Amino Acid Substitution Matrices for the Comparison of Proteins with Non-Standard Compositions,” Bioinformatics, vol. 21, no. 7, pp.902-911, Apr. 2005.
[48] G. Lunter, “HMMoC a Compiler for Hidden Markov Models,” Bioinformatics, vol. 23, no. 18, pp.2485-2487, Sept. 2007.
[49] A.L. Price, N.C. Jones, and P.A. Pevzner, “De Novo Identification of Repeat Families in Large Genomes,” Bioinformatics, vol. 21, no. 1, pp. I351-I358, June 2005.
[50] G. Achaz, F. Boyer, E.P.C. Rocha, A. Viari, and E. Coissac, “Repseek, a Tool to Retrieve Approximate Repeats from Large DNA Sequences,” Bioinformatics, vol. 23, pp. 119-121, Oct. 2006.
[51] E.P. Rocha and A. Blanchard, “Genomic Repeats, Genome Plasticity and the Dynamics of Mycoplasma, Evolution,” Nucleic Acids Research, vol. 30, no. 9, pp.2031-2042, May 2002.
[52] J.D. Thompson, F. Plewniak, and O. Poch, “A Comprehensive Comparison of Multiple Sequence Alignment Programs,” Nucleic Acids Research, vol. 27, no. 13, pp.2682-2690, 1999.
[53] A. Prakash and M. Tompa, “Statistics of Local Multiple Alignments,” Bioinformatics, vol. 21, no. 1, pp. I344-I350, June 2005.

Index Terms:
Sequence alignment, genome comparison, DNA repeats, local multiple alignment, hidden Markov model, gapped extension.
Todd J. Treangen, Aaron E. Darling, Guillaume Achaz, Mark A. Ragan, Xavier Messeguer, Eduardo P.C. Rocha, "A Novel Heuristic for Local Multiple Alignment of Interspersed DNA Repeats," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 6, no. 2, pp. 180-189, April-June 2009, doi:10.1109/TCBB.2009.9
Usage of this product signifies your acceptance of the Terms of Use.