The Community for Technology Leaders
Subscribe
Issue No.04 - July-Aug. (2013 vol.10)
pp: 939-956
Mukul S. Bansal , Comput. Sci. & Artificial Intell. Lab., Massachusetts Inst. of Technol., Cambridge, MA, USA
Oliver Eulenstein , Dept. of Comput. Sci., Iowa State Univ., Ames, IA, USA
ABSTRACT
The use of genomic data sets for phylogenetics is complicated by the fact that evolutionary processes such as gene duplication and loss, or incomplete lineage sorting (deep coalescence) cause incongruence among gene trees. One well-known approach that deals with this complication is gene tree parsimony, which, given a collection of gene trees, seeks a species tree that requires the smallest number of evolutionary events to explain the incongruence of the gene trees. However, a lack of efficient algorithms has limited the use of this approach. Here, we present efficient algorithms for SPR and TBR-based local search heuristics for gene tree parsimony under the 1) duplication, 2) loss, 3) duplication-loss, and 4) deep coalescence reconciliation costs. These novel algorithms improve upon the time complexities of previous algorithms for these problems by a factor of n, where n is the number of species in the collection of gene trees. Our algorithms provide a substantial improvement in runtime and scalability compared to previous implementations and enable large-scale gene tree parsimony analyses using any of the four reconciliation costs. Our algorithms have been implemented in the software packages DupTree and iGTP, and have already been used to perform several compelling phylogenetic studies.
INDEX TERMS
Vegetation, Search problems, Phylogeny, Bioinformatics, Algorithm design and analysis, Complexity theory, Genomics,phylogenetics, Gene tree parsimony, gene duplication, gene loss, incomplete lineage sorting, minimizing deep coalescences (MDC), phylogenomics
CITATION
Mukul S. Bansal, Oliver Eulenstein, "Algorithms for Genome-Scale Phylogenetics Using Gene Tree Parsimony", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.10, no. 4, pp. 939-956, July-Aug. 2013, doi:10.1109/TCBB.2013.103
REFERENCES
 [1] W.P. Maddison, "Gene Trees in Species Trees," Systematic Biology, vol. 46, pp. 523-536, 1997. [2] M. Goodman, J. Czelusniak, G.W. Moore, A.E. Romero-Herrera, and G. Matsuda, "Fitting the Gene Lineage into Its Species Lineage, a Parsimony Strategy Illustrated by Cladograms Constructed from Globin Sequences," Systematic Zoology, vol. 28, pp. 132-163, 1979. [3] O. Eulenstein, S. Huzurbazar, and D.A. Liberles, "Reconciling Phylogenetic Trees," Evolution after Gene Duplication, pp. 185-206, John Wiley & Sons, 2010. [4] R.D.M. Page, "Maps between Trees and Cladistic Analysis of Historical Associations among Genes, Organisms, and Areas," Systematic Biology, vol. 43, no. 1, pp. 58-77, 1994. [5] R. Guigó, I. Muchnik, and T.F. Smith, "Reconstruction of Ancient Molecular Phylogeny," Molecular Phylogenetics and Evolution, vol. 6, no. 2, pp. 189-213, 1996. [6] B. Mirkin, I. Muchnik, and T.F. Smith, "A Biologically Consistent Model for Comparing Molecular Phylogenies," J. Computational Biology, vol. 2, no. 4, pp. 493-507, 1995. [7] O. Eulenstein and M. Vingron, "On the Equivalence of Two Tree Mapping Measures," Discrete Applied Math., vol. 88, pp. 101-126, 1998. [8] L. Zhang, "On a Mirkin-Muchnik-Smith Conjecture for Comparing Molecular Phylogenies," J. Computational Biology, vol. 4, no. 2, pp. 177-187, 1997. [9] J. Slowinski and R.D.M. Page, "How Should Species Phylogenies Be Inferred from Sequence Data?" Systematic Biology, vol. 105, pp. 147-158, 1999. [10] B. Ma, M. Li, and L. Zhang, "From Gene Trees to Species Trees," SIAM J. Computing, vol. 30, no. 3, pp. 729-752, 2000. [11] R.D.M. Page, "Extracting Species Trees from Complex Gene Trees: Reconciled Trees and Vertebrate Phylogeny," Molecular Phylogenetics and Evolution, vol. 14, pp. 89-106, 2000. [12] M.T. Hallett and J. Lagergren, "New Algorithms for the Duplication-Loss Model," Proc. Fourth Ann. Int'l Conf. Computational Molecular Biology (RECOMB '00), pp. 138-146, 2000. [13] R.D.M. Page and J. Cotton, "Vertebrate Phylogenomics: Reconciled Trees and Gene Duplications," Proc. Pacific Symp. Biocomputing, pp. 536-547, 2002. [14] J.A. Cotton and R.D.M. Page, "Tangled Tales from Multiple Markers: Reconciling Conflict Between Phylogenies to Build Molecular Supertrees," Phylogenetic Supertrees: Combining Information to Reveal the Tree of Life, O.R. P. Bininda-Emonds, eds., pp. 107-125, Springer-Verlag, 2004. [15] P. Bonizzoni, G.D. Vedova, and R. Dondi, "Reconciling a Gene Tree to a Species Tree under the Duplication Cost Model," Theoretical Computer Science, vol. 347, nos. 1/2, pp. 36-53, 2005. [16] P. Górecki and J. Tiuryn, "DLS-Trees: A Model of Evolutionary Scenarios," Theoretical Computer Science, vol. 359, nos. 1-3, pp. 378-399, 2006. [17] M.J. Sanderson and M.M. McMahon, "Inferring Angiosperm Phylogeny from EST Data with Widespread Gene Duplication," BMC Evolutionary Biology, vol. 7, no. suppl. 1, article 3, 2007. [18] M.S. Bansal, O. Eulenstein, and A. Wehe, "The Gene-Duplication Problem: Near-Linear Time Algorithms for NNI-Based Local Searches," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 6, no. 2, pp. 221-231, Apr.-June 2009. [19] J.-P. Doyon and C. Chauve, "Branch-and-Bound Approach for Parsimonious Inference of a Species Tree from a Set of Gene Family Trees," Software Tools and Algorithms for Biological Systems, Advances in Experimental Medicine and Biology, H.R.R. Arabnia and Q.-N. Tran, eds., vol. 696, pp. 287-295, Springer, 2011. [20] W.-C. Chang, G. Burleigh, D. Fernandez-Baca, and O. Eulenstein, "An ILP Solution for the Gene Duplication Problem," BMC Bioinformatics, vol. 12, no. Suppl 1, article S14, 2011. [21] R. Chaudhary, J.G. Burleigh, and O. Eulenstein, "Algorithms for Rapid Error Correction for the Gene Duplication Problem," Proc. Seventh Int'l Conf. Bioinformatics Research and Applications (ISBRA '11), pp. 227-239, 2011. [22] M.S. Bansal and R. Shamir, "A Note on the Fixed Parameter Tractability of the Gene-Duplication Problem," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 8, no. 3, pp. 848-850, May/June 2011. [23] P. Górecki, J.G. Burleigh, and O. Eulenstein, "GTP Supertrees from Unrooted Gene Trees: Linear Time Algorithms for NNI Based Local Searches," Proc. Eighth Int'l Conf. Bioinformatics Research and Applications (ISBRA '12), pp. 102-114, 2012. [24] W.P. Maddison and L.L. Knowles, "Inferring Phylogeny Despite Incomplete Lineage Sorting," Systematic Biology, vol. 55, no. 1, pp. 21-30, 2006. [25] C. Than, R. Sugino, H. Innan, and L. Nakhleh, "Efficient Inference of Bacterial Strain Trees from Genome-Scale Multilocus Data," Bioinformatics, vol. 24, no. 13, pp. i123-i131, 2008. [26] C. Than and L. Nakhleh, "Species Tree Inference by Minimizing Deep Coalescences," PLoS Computational Biology, vol. 5, no. 9,article e1000501, 2009. [27] C.V. Than and L. Nakhleh, "Inference of Parsimonious Species Phylogenies from Multi-Locus Data by Minimizing Deep Coalescences," Estimating Species Trees: Practical and Theoretical Aspects, pp. 79-98, Wiley-VCH, 2010. [28] C.V. Than and N.A. Rosenberg, "Consistency Properties of Species Tree Inference by Minimizing Deep Coalescences," J. Computational Biology, vol. 18, no. 1, pp. 1-15, 2011. [29] H.T. Lin, J.G. Burleigh, and O. Eulenstein, "The Deep Coalescence Consensus Tree Problem Is Pareto on Clusters," Proc. Seventh Int'l Conf. Bioinformatics Research and Applications (ISBRA '11), pp. 172-183, 2011. [30] Y. Yu, T. Warnow, and L. Nakhleh, "Algorithms for MDC-Based Multi-Locus Phylogeny Inference: Beyond Rooted Binary Gene Trees on Single Alleles," J. Computational Biology, vol. 18, no. 11, pp. 1543-1559, 2011. [31] T. Wu and L. Zhang, "Structural Properties of the Reconciliation Space and Their Applications in Enumerating Nearly-Optimal Reconciliations between a Gene Tree and a Species Tree," BMC Bioinformatics, vol. 12, no. Suppl. 9, article S7, 2011. [32] L. Zhang, "From Gene Trees to Species Trees II: Species Tree Inference by Minimizing Deep Coalescence Events," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 8, no. 6, pp. 1685-1691, Nov./Dec. 2011. [33] M.S. Bayzid and T. Warnow, "Estimating Optimal Species Trees from Incomplete Gene Trees under Deep Coalescence," J. Computational Biology, vol. 19, no. 6, pp. 591-605, 2012. [34] J.B. Slowinski, A. Knight, and A.P. Rooney, "Inferring Species Trees from Gene Trees: A Phylogenetic Analysis of the Elapidae (Serpentes) Based on the Amino Acid Sequences of Venom Proteins," Molecular Phylogenetics and Evolution, vol. 8, pp. 349-362, 1997. [35] C. Chauve and N. El-Mabrouk, "New Perspectives on Gene Family Evolution: Losses in Reconciliation and a Link with Supertrees," Proc. 13th Ann. Int'l Conf. Research in Computational Molecular Biology (RECOMB '09), pp. 46-58, 2009. [36] M.A. Bender, M. Farach-Colton, G. Pemmasani, S. Skiena, and P. Sumazin, "Lowest Common Ancestors in Trees and Directed Acyclic Graphs," J. Algorithms, vol. 57, no. 2, pp. 75-94, 2005. [37] W.-C. Chang, A. Wehe, P. Gorecki, and O. Eulenstein, "Exact Solutions for Classic Gene Tree Parsimony Problems," Proc. Fifth Int'l Conf. Bioinformatics and Computational Biology (BICOB '13), pp. 225-230, 2013. [38] A. Wehe, J.G. Burleigh, and O. Eulenstein, "Efficient Algorithms for Knowledge-Enhanced Supertree and Supermatrix Phylogenetic Problems," IEEE/ACM Trans. Computational Biology and Bioinformatics, http://doi.ieeecomputersociety.org/10.1109 TCBB.2012. 162, 2012. [39] R.D.M. Page, "GeneTree: Comparing Gene and Species Phylogenies Using Reconciled Trees," Bioinformatics, vol. 14, no. 9, pp. 819-820, 1998. [40] W.P. Maddison and D. Maddison, "Mesquite: A Modular System for Evolutionary Analysis. Version 2.6," http:/mesquiteproject. org, 2009. [41] M. Bordewich and C. Semple, "On the Computational Complexity of the Rooted Subtree Prune and Regraft Distance," Annals of Combinatorics, vol. 8, pp. 409-423, 2004. [42] D. Chen, O. Eulenstein, D. Fernández-Baca, and J.G. Burleigh, "Improved Heuristics for Minimum-Flip Supertree Construction," Evolutionary Bioinformatics, vol. 2, pp. 347-356, 2006. [43] Y.S. Song, "On the Combinatorics of Rooted Binary Phylogenetic Trees," Annals of Combinatorics, vol. 7, no. 3, pp. 365-379, 2003. [44] M.S. Bansal, J.G. Burleigh, O. Eulenstein, and A. Wehe, "Heuristics for the Gene-Duplication Problem: A $\Theta (n)$ Speed-Up for the Local Search," Proc. 11th Ann. Int'l Conf. Research in Computational Molecular Biology (RECOMB '07), pp. 238-252, 2007. [45] M.S. Bansal, J.G. Burleigh, and O. Eulenstein, "Efficient Genome-Scale Phylogenetic Analysis under the Duplication-Loss and Deep Coalescence Cost Models," BMC Bioinformatics, vol. 11, no. Suppl. 1, article S42, 2010. [46] A. Wehe, M.S. Bansal, J.G. Burleigh, and O. Eulenstein, "Duptree: A Program for Large-Scale Phylogenetic Analyses Using Gene Tree Parsimony," Bioinformatics, vol. 24, no. 13, pp. 1540-1541, 2008. [47] R. Chaudhary, M.S. Bansal, A. Wehe, D. Fernandez-Baca, and O. Eulenstein, "iGTP: A Software Package for Large-Scale Gene Tree Parsimony Analysis," BMC Bioinformatics, vol. 11, no. 1,article 574, 2010. [48] J.G. Burleigh, M.S. Bansal, O. Eulenstein, S. Hartmann, A. Wehe, and T.J. Vision, "Genome-Scale Phylogenetics: Inferring the Plant Tree of Life from 18,896 Gene Trees," Systematic Biology, vol. 60, no. 2, pp. 117-125, 2011. [49] R.W. Ness, S.W. Graham, and S.C.H. Barrett, "Reconciling Gene and Genome Duplication Events: Using Multiple Nuclear Gene Families to Infer the Phylogeny of the Aquatic Plant Family Pontederiaceae," Molecular Biology & Evolution, vol. 28, no. 11, pp. 3009-3018, 2011. [50] L.A. Katz, J.R. Grant, L.W. Parfrey, and J.G. Burleigh, "Turning the Crown Upside Down: Gene Tree Parsimony Roots the Eukaryotic Tree of Life," Systematic Biology, vol. 61, pp. 653-660, 2012. [51] G. Blin, P. Bonizzoni, R. Dondi, R. Rizzi, and F. Sikora, "Complexity Insights of the Minimum Duplication Problem," Proc. 38th Int'l Conf. Current Trends in Theory and Practice of Computer Science (SOFSEM '12), pp. 153-164, 2012. [52] C. Than, D. Ruths, and L. Nakhleh, "Phylonet: A Software Package for Analyzing and Reconstructing Reticulate Evolutionary Relationships," BMC Bioinformatics, vol. 9, no. 1,article 322, 2008. [53] J.A. Cotton and M. Wilkinson, "Majority-Rule Supertrees," Systematic Biology, vol. 56, no. 3, pp. 445-452, 2007. [54] M.S. Bansal and O. Eulenstein, "An $\Omega (n^2/\log n)$ Speed-Up of TBR Heuristics for the Gene-Duplication Problem," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 5, no. 4, pp. 514-524, Oct.-Dec. 2008. [55] A. Rokas, B.L. Williams, N. King, and S.B. Carroll, "Genome-Scale Approaches to Resolving Incongruence in Molecular Phylogenies," Nature, vol. 425, pp. 798-804, 2003. [56] C.-H. Kuo, J.P. Wares, and J.C. Kissinger, "The Apicomplexan Whole-Genome Phylogeny: An Analysis of Incongruence among Gene Trees," Molecular Biology and Evolution, vol. 25, no. 12, pp. 2689-2698, 2008. [57] L.S. Kubatko, B.C. Carstens, and L.L. Knowles, "STEM: Species Tree Estimation Using Maximum Likelihood for Gene Trees under Coalescence," Bioinformatics, vol. 25, no. 7, pp. 971-973, 2009. [58] L.Y. Liang Liu and D.K. Pearl, "Maximum Tree: A Consistent Estimator of the Species Tree," J. Math. Biology, vol. 60, no. 1, pp. 95-106, 2010. [59] Y. Wu, "Coalescent-Based Species Tree Inference from Gene Tree Topologies under Incomplete Lineage Sorting by Maximum Likelihood," Evolution, vol. 66, no. 3, pp. 763-775, 2012. [60] S.V. Edwards, L. Liu, and D.K. Pearl, "High-Resolution Species Trees without Concatenation," Proc. Nat'l Academy of Sciences USA, vol. 104, no. 14, pp. 5936-5941, 2007. [61] L. Liu and D.K. Pearl, "Species Trees from Gene Trees: Reconstructing Bayesian Posterior Distributions of a Species Phylogeny Using Estimated Gene Tree Distributions," Systematic Biology, vol. 56, no. 3, pp. 504-514, 2007. [62] C. Ané, B. Larget, D.A. Baum, S.D. Smith, and A. Rokas, "Bayesian Estimation of Concordance among Gene Trees," Molecular Biology and Evolution, vol. 24, no. 7, p. 1575, 2007. [63] L. Liu, D.K. Pearl, R.T. Brumfield, and S.V. Edwards, "Estimating Species Trees Using Multiple-Allele DNA Sequence Data," Evolution, vol. 62, no. 8, pp. 2080-2091, 2008. [64] L. Liu, "Best: Bayesian Estimation of Species Trees under the Coalescent Model," Bioinformatics, vol. 24, no. 21, pp. 2542-2543, 2008. [65] J. Heled and A.J. Drummond, "Bayesian Inference of Species Trees from Multilocus Data," Molecular Biology and Evolution, vol. 27, no. 3, pp. 570-580, 2010. [66] B.R. Larget, S.K. Kotha, C.N. Dewey, and C. Ané, "BUCKY: Gene Tree/Species Tree Reconciliation with Bayesian Concordance Analysis," Bioinformatics, vol. 26, no. 22, pp. 2910-2911, 2010. [67] H.H. Fan and L.S. Kubatko, "Estimating Species Trees Using Approximate Bayesian Computation," Molecular Phylogenetics and Evolution, vol. 59, no. 2, pp. 354-363, 2011. [68] L. Liu, L. Yu, D.K. Pearl, and S.V. Edwards, "Estimating Species Phylogenies Using Coalescence Times among Sequences," Systematic Biology, vol. 58, no. 5, pp. 468-477, 2009. [69] E.M. Jewett and N.A. Rosenberg, "iGLASS: An Improvement to the Glass Method for Estimating Species Trees from Gene Trees," J. Computational Biology, vol. 19, no. 3, pp. 293-315, 2012. [70] L. Arvestad, A.-C. Berglund, J. Lagergren, and B. Sennblad, "Bayesian Gene/Species Tree Reconciliation and Orthology Analysis Using MCMC," Bioinformatics, vol. 19, pp. 7-15, 2003. [71] O. Äkerborg, B. Sennblad, L. Arvestad, and J. Lagergren, "Simultaneous Bayesian Gene Tree Reconstruction and Reconciliation Analysis," Proc. Nat'l Academy of Sciences USA, vol. 106, no. 14, pp. 5714-5719, 2009. [72] P. Gorecki, G. Burleigh, and O. Eulenstein, "Maximum Likelihood Models and Algorithms for Gene Tree Evolution with Duplications and Losses," BMC Bioinformatics, vol. 12, no. Suppl 1, article S15, 2011. [73] M.D. Rasmussen and M. Kellis, "Unified Modeling of Gene Duplication, Loss, and Coalescence Using a Locus Tree," Genome Research, vol. 22, pp. 755-765, 2012.