This Article 
 Bibliographic References 
 Add to: 
Uncovering Hidden Phylogenetic Consensus in Large Data Sets
July/August 2011 (vol. 8 no. 4)
pp. 902-911
Nicholas D. Pattengale, Sandia National Laboratories, Albuquerque
Andre J. Aberer, TUM, Munich
Krister M. Swenson, University of Ottawa/Université du Québec à Montréal, Montreal
Alexandros Stamatakis, TUM, Munich
Bernard M.E. Moret, Ecole Polytechnique Federale de Lausanne
Many of the steps in phylogenetic reconstruction can be confounded by "rogue” taxa—taxa that cannot be placed with assurance anywhere within the tree, indeed, whose location within the tree varies with almost any choice of algorithm or parameters. Phylogenetic consensus methods, in particular, are known to suffer from this problem. In this paper, we provide a novel framework to define and identify rogue taxa. In this framework, we formulate a bicriterion optimization problem, the relative information criterion, that models the net increase in useful information present in the consensus tree when certain taxa are removed from the input data. We also provide an effective greedy heuristic to identify a subset of rogue taxa and use this heuristic in a series of experiments, with both pathological examples from the literature and a collection of large biological data sets. As the presence of rogue taxa in a set of bootstrap replicates can lead to deceivingly poor support values, we propose a procedure to recompute support values in light of the rogue taxa identified by our algorithm; applying this procedure to our biological data sets caused a large number of edges to move from "unsupported” to "supported” status, indicating that many existing phylogenies should be recomputed and reevaluated to reduce any inaccuracies introduced by rogue taxa. We also discuss the implementation issues encountered while integrating our algorithm into RAxML v7.2.7, particularly those dealing with scaling up the analyses. This integration enables practitioners to benefit from our algorithm in the analysis of very large data sets (up to 2,500 taxa and 10,000 trees, although we present the results of even larger analyses).

[1] A. Amir and D. Keselman, “Maximum Agreement Subtree in a Set of Evolutionary Trees,” SIAM J. Computing, vol. 26, pp. 758-769, 1994.
[2] H. Bandelt and A. Dress, “Split Decomposition: A New and Useful Approach to Phylogenetic Analysis of Distance Data,” Molecular Phylogenetics and Evolution, vol. 1, no. 3, pp. 242-252, 1992.
[3] D. Bryant, “Hunting for Trees, Building Trees and Comparing Trees: Theory and Method in Phylogenetic Analysis,” PhD thesis, Univ. of Canterbury, 1997.
[4] D. Bryant, “A Classification of Consensus Methods for Phylogenetics,” Bioconsensus: DIMACS Series in Discrete Mathematics, and Theoretical Computer Science, vol. 61, pp. 163-184, AMS Press, 2002.
[5] D. Bryant, V. Moulton, “Neighbor-Net: An Agglomerative Method for the Construction of Phylogenetic Networks,” Molecular Biology and Evolution, vol. 21, no. 2, pp. 255-265, 2004.
[6] K.A. Cranston, B. Rannala, “Summarizing a Posterior Distribution of Trees Using Agreement Subtrees,” Systematic Biology, vol. 56, no. 4, pp. 578-590, 2007.
[7] M. Farach, T.M. Przytycka, and M. Thorup, “On the Agreement of Many Trees,” J. Information Processing Letters, vol. 55, no. 6, pp. 297-301, 1995.
[8] J. Felsenstein, Inferring Phylogenies. Sinauer Associates, Inc., 2004.
[9] T.L. Fulton, C. Strobeck, “Molecular Phylogeny of the Arctoidea (Carnivora): Effect of Missing Data on Supertree and Supermatrix Analyses of Multiple Gene Data Sets,” Moleculer Phylogenetics and Evolution, vol. 41, no. 1, pp. 165-181, 2006.
[10] O. Gauthier and F.J. Lapointe, “Seeing the Trees for the Network: Consensus, Information Content, and Superphylogenies,” Systematic Biology, vol. 56, no. 2, pp. 345-355, 2007.
[11] D. Huson, “SplitsTree: Analyzing and Visualizing Evolutionary Data,” Bioinformatics, vol. 14, no. 1, pp. 68-73, 1998.
[12] T. Margush, F.R. McMorris, “Consensus N-Trees,” Bull. of Math. Biology, vol. 43, pp. 239-244, 1981.
[13] N.D. Pattengale, M. Alipour, O.R.P. Bininda-Emonds, B.M.E. Moret, and A. Stamatakis, “How Many Bootstrap Replicates Are Necessary?” Proc. 13th Ann. Int'l Conf. Research in Computational Molecular Biology (RECOMB '09), pp. 184-200, 2009.
[14] N.D. Pattengale, K.M. Swenson, M.M. Morin, and B.M.E. Moret, “Higher Fidelity Subtree Merging for Disk-Covering Methods,” Poster, Algorithmic Biology, algorithmicbio/ filesPattengaleAlgoBio2006.pdf, 2006.
[15] N.D. Pattengale, K.M. Swenson, and B.M.E. Moret, “Uncovering Hidden Phylogenetic Consensus,” Proc. Int'l Symp. Bioinformatics Research and Applications (ISBRA '10), pp. 128-139, 2010.
[16] B. Redelings, “Bayesian Phylogenies Unplugged: Majority Consensus Trees with Wandering Taxa,” , 2011.
[17] C. Semple and M. Steel, “Tree Reconstruction via a Closure Operation on Partial Splits,” Proc. Journ Ouvertes: Biologie, Informatique et Mathmatiques (JOBIM '00), pp. 126-134, 2001.
[18] A. Stamatakis, “RAxML-VI-HPC: Maximum Likelihood-Based Phylogenetic Analyses with Thousands of Taxa and Mixed Models,” Bioinformatics, vol. 22, no. 21, pp. 2688-2690, 2006.
[19] J.L. Thorley, M. Wilkinson, and M. Charleston, “The Information Content of Consensus Trees,” Studies in Classification, Data Analysis, and Knowledge Organization, pp. 91-98, Springer, 1998.
[20] M. Wilkinson, “Common Cladistic Information and Its Consensus Representation: Reduced Adams and Reduced Cladistic Consensus Trees and Profiles,” Systematic Biology, vol. 43, no. 3, pp. 343-368, 1994.
[21] M. Wilkinson, “More on Reduced Consensus Methods,” Systematic Biology, vol. 44, pp. 435-439, 1995.
[22] M. Wilkinson, “Majority-Rule Reduced Consensus Trees and Their Use in Bootstrapping,” Molecular Biology and Evolution, vol. 13, no. 3, pp. 437-444, 1996.

Index Terms:
Phylogeny, consensus methods, bootstrapping, support values, MAST.
Nicholas D. Pattengale, Andre J. Aberer, Krister M. Swenson, Alexandros Stamatakis, Bernard M.E. Moret, "Uncovering Hidden Phylogenetic Consensus in Large Data Sets," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 8, no. 4, pp. 902-911, July-Aug. 2011, doi:10.1109/TCBB.2011.28
Usage of this product signifies your acceptance of the Terms of Use.