This Article 
 Bibliographic References 
 Add to: 
Assessing the Discordance of Multiple Sequence Alignments
October-December 2009 (vol. 6 no. 4)
pp. 542-551
Multiple sequence alignments have wide applicability in many areas of computational biology, including comparative genomics, functional annotation of proteins, gene finding, and modeling evolutionary processes. Because of the computational difficulty of multiple sequence alignment and the availability of numerous tools, it is critical to be able to assess the reliability of multiple alignments. We present a tool called StatSigMA to assess whether multiple alignments of nucleotide or amino acid sequences are contaminated with one or more unrelated sequences. There are numerous applications for which StatSigMA can be used. Two such applications are to distinguish homologous sequences from nonhomologous ones and to compare alignments produced by various multiple alignment tools. We present examples of both types of applications.

[1] S. Altschul, “Evaluating the Statistical Significance of Multiple Distinct Local Alignments,” Theoretical and Computational Methods in Genome Research, S. Suhai, ed., pp. 1-14, 1997.
[2] S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman, “Basic Local Alignment Search Tool,” J. Molecular Biology, vol. 215, pp.403-410, 1990.
[3] S. Batzoglou, “The Many Faces of Sequence Alignment,” Briefings in Bioinformatics, vol. 6, no. 1, pp. 6-22, 2005.
[4] E. Birney, T.D. Andrews, P. Bevan, M. Caccamo, Y. Chen, L. Clarke, G. Coates, J. Cuff, V. Curwen, T. Cutts et al., “An Overview of Ensembl,” Genome Research, vol. 14, pp.925-928, 2004.
[5] M. Blanchette, W.J. Kent, C. Riemer, L. Elnitski, A.F. Smit, K.M. Roskin, R. Baertsch, K. Rosenbloom, H. Clawson, E.D. Green, D. Haussler, and W. Miller, “Aligning Multiple Genomic Sequences with the Threaded Blockset Aligner,” Genome Research, vol. 14, no. 4, pp. 708-715, Apr. 2004.
[6] M. Brudno, C. Do, G. Cooper, M.F. Kim, E. Davydov, E.D. Green, A. Sidow, and S. Batzoglou, “LAGAN and Multi-LAGAN: Efficient Tools for Large-Scale Multiple Alignment of Genomic DNA,” Genome Research, vol. 13, no. 4, pp. 721-731, 2003.
[7] R. Chenna, H. Sugawara, T. Koike, R. Lopez, T.J. Gibson, D.G. Higgins, and J.D. Thompson, “Multiple Sequence Alignment with the Clustal Series of Programs,” Nucleic Acids Research, vol. 31, pp. 3497-3500, 2003.
[8] M. Cline, R. Hughey, and K. Karplus, “Predicting Reliable Regions in Protein Sequence Alignments,” Bioinformatics, vol. 18, pp. 306-314, 2002.
[9] G.M. Cooper, E.A. Stone, G. Asimenos, NISC Comparative Sequencing Program, E.D. Green, S. Batzoglou, and A. Sidow, “Distribution and Intensity of Constraint in Mammalian Genomic Sequence,” Genome Research, vol. 15, pp.901-913, 2005.
[10] R.C. Edgar, “MUSCLE: Multiple Sequence Alignment with High Accuracy and High Throughput,” Nucleic Acids Research, vol. 32, no. 5, pp. 1792-1797, 2004.
[11] M. Errami, C. Geourjon, and G. Deléage, “Detection of Unrelated Proteins in Sequences Multiple Alignments by Using Predicted Secondary Structures,” Bioinformatics, vol. 19, no. 4, pp. 506-512, 2003.
[12] J. Felsenstein, “Evolutionary Trees from DNA Sequences: A Maximum Likelihood Approach,” J. Molecular Evolution, vol. 17, pp. 368-376, 1981.
[13] N. Kaplan, O. Sasson, U. Inbar, M. Friedlich, M. Fromer, H. Fleischer, E. Portugaly, N. Linial, and M. Linial, “ProtoNet 4.0: A Hierarchical Classification of One Million Protein Sequences,” Nucleic Acids Research, vol. 33, pp. D216-D218, 2005.
[14] S. Karlin and S.F. Altschul, “Methods for Assessing the Statistical Significance of Molecular Sequence Features by Using General Scoring Schemes,” Proc. Nat'l Academy of Science USA, vol. 87, no. 6, pp. 2264-2268, Mar. 1990.
[15] S. Karlin and S.F. Altschul, “Applications and Statistics for Multiple High-Scoring Segments in Molecular Sequences,” Proc. Nat'l Academy of Science USA, vol. 90, pp.5873-5877, June 1993.
[16] W. Kent, C.W. Sugnet, T.S. Furey, K. Roskin, T.H. Pringle, A.M. Zahler, and D. Haussler, “The Human Genome Browser at UCSC,” Genome Research, vol. 12, no. 6, pp. 996-1006, 2002.
[17] S. Kumar and A. Filipski, “Multiple Sequence Alignment: In Pursuit of Homologous DNA Positions,” Genome Research, vol. 17, no. 2, pp. 127-135, Feb. 2007.
[18] T. Lassmann and E.L.L. Sonnhammer, “Automatic Assessment of Alignment Quality,” Nucleic Acids Research, vol. 33, no. 22, pp. 7120-7128, 2005.
[19] E. Margulies, M. Blanchette, NISC Comparative Sequencing Program, D. Haussler, and E. Green, “Identification and Characterization of Multi-Species Conserved Sequences,” Genome Research, vol. 13, no. 12, pp. 2507-2518, 2003.
[20] E.H. Margulies, G.M. Cooper, G. Asimenos, D.J. Thomas, C.N. Dewey, A. Siepel, E. Birney, D. Keefe, A.S. Schwartz, M. Hou, J. Taylor, S. Nikolaev, J.I. Montoya-Burgos, A. Lvytynoja, S. Whelan, F. Pardi, T. Massingham, J.B. Brown, P. Bickel, I. Holmes, J.C. Mullikin, A. Ureta-Vidal, B. Paten, E.A. Stone, K.R. Rosenbloom, W.J. Kent, G.G. Bouffard, X. Guan, N.F. Hansen, J.R. Idol, V.V. Maduro, B. Maskeri, J.C. McDowell, M. Park, P.J. Thomas, A.C. Young, R.W. Blakesley, D.M. Muzny, E. Sodergren, D.A. Wheeler, K.C. Worley, H. Jiang, G.M. Weinstock, R.A. Gibbs, T. Graves, R. Fulton, E.R. Mardis, R.K. Wilson, M. Clamp, J. Cuff, S. Gnerre, D.B. Jaffe, J.L. Chang, K. Lindblad-Toh, E.S. Lander, A. Hinrichs, H. Trumbower, H. Clawson, A. Zweig, R.M. Kuhn, G. Barber, R. Harte, D. Karolchik, M.A. Field, R.A. Moore, C.A. Matthewson, J.E. Schein, M.A. Marra, S.E. Antonarakis, S. Batzoglou, N. Goldman, R. Hardison, D. Haussler, W. Miller, L. Pachter, E.D. Green, and A. Sidow, “Analyses of Deep Mammalian Sequence Alignments and Constraint Predictions for 1% of the Human Genome,” Genome Research, vol. 17, no. 6, pp. 760-774, June 2007.
[21] W. Miller, “Comparison of Genomic Sequences: Solved and Unsolved Problems,” Bioinformatics, vol. 17, pp. 391-397, 2000.
[22] S.B. Needleman and C.D. Wunsch, “A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins,” J. Molecular Biology, vol. 48, pp. 443-453, 1970.
[23] J. Pei and N.V. Grishin, “AL2CO: Calculation of Positional Conservation in a Protein Sequence Alignment,” Bioinformatics, vol. 17, pp. 700-712, 2001.
[24] P. Pevzner, Computational Molecular Biology: An Algorithmic Approach, chapter 6, pp. 102-109. MIT Press, 2000.
[25] D.A. Pollard, C.M. Bergman, J. Stoye, S.E. Celniker, and M.B. Eisen, “Benchmarking Tools for the Alignment of Functional Noncoding DNA,” BMC Bioinformatics, vol. 5, article 6, 2004.
[26] D.A. Pollard, A.M. Moses, V.N. Iyer, and M.B. Eisen, “Detecting the Limits of Regulatory Element Conservation and Divergence Estimation Using Pairwise and Multiple Alignments,” BMC Bioinformatics, vol. 7, article 376, 2006.
[27] S. Prabhakar, F. Poulin, M. Shoukry, V. Afzal, E.M. Rubin, O. Couronne, and L.A. Pennacchio, “Close Sequence Comparisons Are Sufficient to Identify Human CIS-Regulatory Elements,” Genome Research, vol. 16, no. 7, pp. 855-863, July 2006.
[28] A. Prakash and M. Tompa, “Discovery of Regulatory Elements in Vertebrates through Comparative Genomics,” Nature Biotechnology, vol. 23, no. 10, pp. 1249-1256, 2005.
[29] A. Prakash and M. Tompa, “Statistics of Local Multiple Alignments,” Bioinformatics, vol. 21, pp. i344-i350, 2005.
[30] A. Prakash and M. Tompa, “Measuring the Accuracy of Genome-Size Multiple Alignments,” Genome Biology, vol. 8, no. 6, p. R124, 2007.
[31] M.S. Rosenberg, “Multiple Sequence Alignment Accuracy and Evolutionary Distance Estimation,” BMC Bioinformatics, vol. 6, article 278, 2005.
[32] W.L. Ruzzo and M. Tompa, “A Linear Time Algorithm for Finding All Maximal Scoring Subsequences,” Proc. Seventh Int'l Conf. Intelligent Systems for Molecular Biology, pp. 234-241, Aug. 1999.
[33] N. Saitou and M. Nei, “The Neighbor-Joining Method: A New Method for Reconstructing Phylogenetic Trees,” Molecular Biology and Evolution, vol. 4, pp. 406-425, 1987.
[34] A.A. Schäffer, L. Aravind, T.L. Madden, S. Shavirin, J.L. Spouge, Y.I. Wolf, E.V. Koonin, and S.F. Altschul, “Improving theAccuracy of PSI-BLAST Protein Database Searches with Composition-Based Statistics and Other Refinements,” Nucleic Acids Research, vol. 29, no. 14, pp. 2994-3005, 2001.
[35] A.A. Schäffer, Y.I. Wolf, C.P. Ponting, E.V. Koonin, L. Aravind, and S.F. Altschul, “IMPALA: Matching a Protein Sequence against a Collection of PSI-BLAST-Constructed Position-Specific Score Matrices,” Bioinformatics, vol. 15, no. 2, pp. 1000-1011, 1999.
[36] A. Siepel, G. Bejerano, J. Pedersen, A. Hinrichs, M. Hou, K. Rosenbloom, H. Clawson, J. Spieth, L. Hillier, S. Richards, G. Weinstock, R.K. Wilson, R. Gibbs, W. Kent, W. Miller, and D. Haussler, “Evolutionarily Conserved Elements in Vertebrate, Insect, Worm, and Yeast Genomes,” Genome Research, vol. 15, no. 8, pp. 1034-1050, 2005.
[37] A.F.A. Smit, R. Hubley, and P. Green Repeatmasker Open-3.0, http:/, 2004.
[38] P.H.A. Sneath, and R.R. Snokal, Numerical Taxonomy, pp.230-234. W.H. Freeman, 1973.
[39] R.L. Tatusov and D.J. Lipman, unpublished, http://blast.wustl. edu/pubdust, 2007.
[40] “The ENCODE (ENCyclopedia Of DNA Elements) Project,” Science, vol. 306, no. 5696, pp. 636-640, The ENCODE Project Consortium, 2004.
[41] J. Thompson, F. Plewniak, and O. Poch, “BAliBASE: A Benchmark Alignments Database for the Evaluation of Multiple Sequence Alignment Programs,” Bioinformatics, vol. 15, no. 1, pp. 87-88, 1999.
[42] J.D. Thompson, F. Plewniak, and O. Poch, “A Comprehensive Comparison of Multiple Sequence Alignment Programs,” Nucleic Acids Research, vol. 27, pp. 2682-2690, 1999.
[43] J.D. Thompson, F. Plewniak, R. Ripp, J.C. Thierry, and O. Poch, “Towards a Reliable Objective Function for Multiple Sequence Alignments,” J. Molecular Biology, vol. 314, pp. 937-951, 2001.
[44] J.D. Thompson, V. Prigent, and O. Poch, “LEON: Multiple Alignment Evaluation of Neighbours,” Nucleic Acids Research, vol. 32, no. 4, pp. 1298-1307, 2004.
[45] A.X. Wang, W.L. Ruzzo, and M. Tompa, “How Accurately Is ncRNA Aligned Within Whole-Genome Multiple Alignments?” BMC Bioinformatics, vol. 8, article 417, 2007.
[46] L. Wang and T. Jiang, “On the Complexity of Multiple Sequence Alignment,” J. Computational Biology, vol. 1, pp. 337-348, 1994.
[47] D. Wheeler, T. Barrett, D. Benson, S. Bryant, K. Canese, D. Church, M. DiCuccio, R. Edgar, S. Federhen, W. Helmberg, D. Kenton, O. Khovayko, D. Lipman, T. Madden, D. Maglott, J. Ostell, J. Pontius, K. Pruitt, G. Schuler, L. Schriml, E. Sequeira, S. Sherry, K. Sirotkin, G. Starchenko, T. Suzek, R. Tatusov, T. Tatusova, L. Wagner, and E. Yaschenko, “Database Resources of the National Center for Biotechnology Information,” Nucleic Acids Research, vol. 33, pp. D39-D45, 2005.

Index Terms:
Multiple sequence alignment, discordance, alignment accuracy, Karlin-Altschul statistics, biology and genetics, life and medical sciences, computer applications.
Amol Prakash, Martin Tompa, "Assessing the Discordance of Multiple Sequence Alignments," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 6, no. 4, pp. 542-551, Oct.-Dec. 2009, doi:10.1109/TCBB.2007.70271
Usage of this product signifies your acceptance of the Terms of Use.