The Community for Technology Leaders
RSS Icon
Issue No.04 - July/August (2011 vol.8)
pp: 1108-1119
Li-San Wang , University of Pennsylvania, Philadelphia
Jim Leebens-Mack , University of Georgia, Athens
P. Kerr Wall , BASF Plant Science, Research Triangle Park
Kevin Beckmann , The Huck Institutes of Life Sciences and Pennsylvania State University
Claude W. dePamphilis , The Huck Institutes of Life Sciences and Pennsylvania State University
Tandy Warnow , University of Texas at Austin, Austin
Multiple sequence alignment is typically the first step in estimating phylogenetic trees, with the assumption being that as alignments improve, so will phylogenetic reconstructions. Over the last decade or so, new multiple sequence alignment methods have been developed to improve comparative analyses of protein structure, but these new methods have not been typically used in phylogenetic analyses. In this paper, we report on a simulation study that we performed to evaluate the consequences of using these new multiple sequence alignment methods in terms of the resultant phylogenetic reconstruction. We find that while alignment accuracy is positively correlated with phylogenetic accuracy, the amount of improvement in phylogenetic estimation that results from an improved alignment can range from quite small to substantial. We observe that phylogenetic accuracy is most highly correlated with alignment accuracy when sequences are most difficult to align, and that variation in alignment accuracy can have little impact on phylogenetic accuracy when alignment error rates are generally low. We discuss these observations and implications for future work.
Simulation, biology and genetics, multiple protein sequence alignment, phylogeny reconstruction.
Li-San Wang, Jim Leebens-Mack, P. Kerr Wall, Kevin Beckmann, Claude W. dePamphilis, Tandy Warnow, "The Impact of Multiple Protein Sequence Alignment on Phylogenetic Estimation", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.8, no. 4, pp. 1108-1119, July/August 2011, doi:10.1109/TCBB.2009.68
[1] R.B. Russell and G.J. Barton, “Multiple Protein Sequence Alignment from Tertiary Structure Comparison: Assignment of Global and Residue Confidence Levels,” Proteins, vol. 14, pp. 309-323, Oct. 1992.
[2] A. Andreeva et al., “SCOP Database in 2004: Refinements Integrate Structure and Sequence Family Data,” Nucleic Acids Research, vol. 32, pp. D226-D229, Jan. 2004.
[3] A.G. Murzin et al., “SCOP: A Structural Classification of Proteins Database for the Investigation of Sequences and Structures,” J. Molecular Biology, vol. 247, pp. 536-540, Apr. 1995.
[4] K. Mizuguchi et al., “HOMSTRAD: A Database of Protein Structure Alignments for Homologous Families,” Protein Science, vol. 7, pp. 2469-2471, Nov. 1998.
[5] A. Bahr et al., “BAliBASE (Benchmark Alignment dataBASE): Enhancements for Repeats, Transmembrane Sequences and Circular Permutations,” Nucleic Acids Research, vol. 29, pp. 323-326, Jan. 2001.
[6] J.D. Thompson et al., “BAliBASE 3.0: Latest Developments of the Multiple Sequence Alignment Benchmark,” Proteins, vol. 61, pp. 127-136, Oct. 2005.
[7] J.D. Thompson et al., “BAliBASE: A Benchmark Alignment Database for the Evaluation of Multiple Alignment Programs,” Bioinformatics, vol. 15, pp. 87-88, Jan. 1999.
[8] R.C. Edgar, “MUSCLE: Multiple Sequence Alignment with High Accuracy and High Throughput,” Nucleic Acids Research, vol. 32, pp. 1792-1797, 2004.
[9] R.A. Cartwright, “DNA Assembly with Gaps (Dawg): Simulating Sequence Evolution,” Bioinformatics, vol. 21, no. 3, pp. iii31-iii38, Nov. 2005.
[10] A. Pang et al., “SIMPROT: Using an Empirically Determined Indel Distribution in Simulations of Protein Evolution,” BMC Bioinformatics, vol. 6, p. 236, 2005, doi:10.1186/1471-2105-6-236.
[11] J. Stoye et al., “Rose: Generating Sequence Families,” Bioinformatics, vol. 14, pp. 157-163, 1998.
[12] C.L. Strope et al., “indel-Seq-Gen: A New Protein Family Simulator Incorporating Domains, Motifs, and Indels,” Molecular Biology and Evolution, vol. 24, pp. 640-649, Mar. 2007.
[13] K. Katoh et al., “MAFFT Version 5: Improvement in Accuracy of Multiple Sequence Alignment,” Nucleic Acids Research, vol. 33, pp. 511-518, 2005.
[14] C.B. Do et al., “ProbCons: Probabilistic Consistency-Based Multiple Sequence Alignment,” Genome Research, vol. 15, pp. 330-340, Feb. 2005.
[15] C. Notredame et al., “T-Coffee: A Novel Method for Fast and Accurate Multiple Sequence Alignment,” J. Molecular Biology, vol. 302, pp. 205-217, Sept. 2000.
[16] A.R. Subramanian et al., “DIALIGN-T: An Improved Algorithm for Segment-Based Multiple Sequence Alignment,” BMC Bioinformatics, vol. 6, p. 66, 2005, doi:10.1186/1471-2105-6-66.
[17] W. Wheeler et al., Dynamic Homology and Phylogenetic Systematics: A Unified Approach Using POY. Am. Museum of Natural History, 2006.
[18] A. Loytynoja and N. Goldman, “Phylogeny-Aware Gap Placement Prevents Errors in Sequence Alignment and Evolutionary Analysis,” Science, vol. 320, pp. 1632-1635, June 2008.
[19] A.S. Schwartz and L. Pachter, “Multiple Alignment by Sequence Annealing,” Bioinformatics, vol. 23, pp. e24-e29, Jan. 2007.
[20] U. Roshan and D.R. Livesay, “Probalign: Multiple Sequence Alignment Using Partition Function Posterior Probabilities,” Bioinformatics, vol. 22, pp. 2715-2721, Nov. 2006.
[21] J.D. Thompson et al., “CLUSTAL W: Improving the Sensitivity of Progressive Multiple Sequence Alignment Through Sequence Weighting, Position-Specific Gap Penalties and Weight Matrix Choice,” Nucleic Acids Research, vol. 22, pp. 4673-4680, Nov. 1994.
[22] G. Blackshields et al., “Analysis and Comparison of Benchmarks for Multiple Sequence Alignment,” In Silico Biology, vol. 6, pp. 321-339, 2006.
[23] R.C. Edgar and S. Batzoglou, “Multiple Sequence Alignment,” Current Opinion in Structural Biology, vol. 16, pp. 368-373, June 2006.
[24] S. Nelesen et al., “The Effect of the Guide Tree on Multiple Sequence Alignments and Subsequent Phylogenetic Analyses,” Proc. Pacific Symp. Biocomputing, pp. 15-24, 2008.
[25] G.P. Raghava et al., “OXBench: A Benchmark for Evaluation of Protein Multiple Sequence Alignment Accuracy,” BMC Bioinformatics, vol. 4, p. 47, Oct. 2003, doi:10.1186/1471-2105-4-47.
[26] D.A. Morrison and J.T. Ellis, “Effects of Nucleotide Sequence Alignment on Phylogeny Estimation: A Case Study of 18S rDNAs of Apicomplexa,” Molecular Biology and Evolution, vol. 14, pp. 428-441, Apr. 1997.
[27] K.M. Wong et al., “Alignment Uncertainty and Genomic Analysis,” Science, vol. 319, pp. 473-476, Jan. 2008.
[28] B.L. Cantarel et al., “Exploring the Relationship Between Sequence Similarity and Accurate Phylogenetic Trees,” Molecular Biology and Evolution, vol. 23, pp. 2090-2100, Nov. 2006.
[29] B.G. Hall, “Comparison of the Accuracies of Several Phylogenetic Methods Using Protein and DNA Sequences,” Molecular Biology and Evolution, vol. 22, pp. 792-802, Mar. 2005.
[30] T.H. Ogden and M.S. Rosenberg, “Multiple Sequence Alignment Accuracy and Phylogenetic Inference,” Systematic Biology, vol. 55, pp. 314-328, Apr. 2006.
[31] U. Roshan and D.R. Livesay, “Improving Progressive Alignment for Phylogeny Reconstruction Using Parsimonious Guide-Trees,” Proc. Sixth IEEE Symp. Bioinformatics and Bioeng., 2006.
[32] D.F. Robinson and L.R. Foulds, “Comparison of Phylogenetic Trees,” Math. Biosciences, vol. 53, pp. 131-147, 1981.
[33] C. Grasso and C. Lee, “Combining Partial Order Alignment and Progressive Multiple Sequence Alignment Increases Alignment Speed and Scalability to Very Large Alignment Problems,” Bioinformatics, vol. 20, pp. 1546-1556, July 2004.
[34] J.D. Thompson et al., “DbClustal: Rapid and Reliable Global Multiple Alignments of Protein Sequences Detected by Database Searches,” Nucleic Acids Research, vol. 28, pp. 2919-2926, Aug. 2000.
[35] M.O. Dayhoff, “Observed Frequencies of Amino Acid Replacements between Closely Related Proteins,” Atlas of Protein Sequence and Structure, M.O. Dayhoff, ed., vol. 5, Nat'l Biomedical Research Foundation, 1978.
[36] D.L. Swofford, PAUP*: Phylogenetic Analysis Using Parsimony (* and Other Methods). Version 4., Sinauer Assoc., 2003.
[37] A. Stamatakis, “RAxML-VI-HPC: Maximum Likelihood-Based Phylogenetic Analyses with Thousands of Taxa and Mixed Models,” Bioinformatics, vol. 22, pp. 2688-2690, Nov. 2006.
[38] J. Felsenstein, “PHYLIP—Phylogeny Inference Package (Version 3.2),” Cladistics, vol. 5, pp. 164-166, 1989.
[39] M. Nei and S. Kumar, Molecular Evolution and Phylogenetics. Oxford Univ. Press, 2000.
[40] M. Cline et al., “Predicting Reliable Regions in Protein Sequence Alignments,” Bioinformatics, vol. 18, pp. 306-314, Feb. 2002.
[41] B. Rannala et al., “Taxon Sampling and the Accuracy of Large Phylogenies,” Systematic Biology, vol. 47, pp. 702-710, Dec. 1998.
[42] S. Guindon and O. Gascuel, “A Simple, Fast, and Accurate Algorithm to Estimate Large Phylogenies by Maximum Likelihood,” Systematic Biology, vol. 52, pp. 696-704, Oct. 2003.
[43] M.J. Sanderson, “r8s: Inferring Absolute Rates of Molecular Evolution and Divergence Times in the Absence of a Molecular Clock,” Bioinformatics, vol. 19, pp. 301-302, Jan. 2003.
[44] M.P. Simmons et al., “The Relative Performance of Indel-Coding Methods in Simulations,” Molecular Phylogenetics and Evolution, vol. 44, pp. 724-740, Aug. 2007.
[45] D. Sankoff, “Minimal Mutation Trees of Sequences,” SIAM J. Applied Math., vol. 28, pp. 35-42, 1975.
[46] R. Fleissner et al., “Simultaneous Statistical Multiple Alignment and Phylogeny Reconstruction,” Systematic Biology, vol. 54, pp. 548-561, Aug. 2005.
[47] G. Lunter, A.J. Drummond, I. Miklos, and J. Hein, “Statistical Alignment: Recent Progress, New Applications, and Challenges,” Statistical Methods in Molecular Evolution (Statistics for Biology and Health), R. Nielsen, ed., pp. 375-406, Springer, 2005.
[48] B.D. Redelings and M.A. Suchard, “Joint Bayesian Estimation of Alignment and Phylogeny,” Systematic Biology, vol. 54, pp. 401-418, June 2005.
[49] K. Kjer et al., “Opinions on Multiple Sequence Alignment, and An Empirical Comparison of Repeatability and Accuracy between POY and Structural Alignment,” Systematic Biology, vol. 56, pp. 133-146, 2007.
[50] T.H. Ogden and M.S. Rosenberg, “Alignment and Topological Accuracy of the Direct Optimization Approach via POY and Traditional Phylogenetics via ClustalW + PAUP*,” Systematic Biology, vol. 56, pp. 182-193, Apr. 2007.
[51] G. Lunter et al., “Bayesian Coestimation of Phylogeny and Sequence Alignment,” BMC Bioinformatics, vol. 6, p. 83, 2005, doi:10.1186/1471-2105-6-83.
[52] G. Lunter et al., “Uncertainty in Homology Inferences: Assessing and Improving Genomic Sequence Alignment,” Genome Research, vol. 18, pp. 298-309, Feb. 2008.
20 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool