ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes
Data files
Jun 08, 2023 version files 16.83 GB
-
alignments-1000taxon-01-16.tar.bz
-
alignments-1000taxon-17-32.tar.bz
-
alignments-1000taxon-33-50.tar.bz
-
alignments-100taxon.tar.bz
-
alignments-10taxon.tar
-
alignments-200taxon-10M.tar.bz
-
alignments-200taxon-2M.tar.bz
-
alignments-200taxon-500K.tar.bz
-
alignments-500taxong.tar
-
alignments-50taxon.tar.bz
-
astral.zip
-
ca-ml-ft.zip
-
estimated-gene-trees.tar.bz
-
mpest.zip
-
njst.zip
-
README.md
-
species-trees.zip
-
true-gene-trees.tar.bz
-
true-specis-trees.tar.bz
-
xi-gene-trees.zip
Abstract
Motivation: The estimation of species phylogenies requires multiple loci, since different loci can have different trees due to incomplete lineage sorting, modeled by the multi-species coalescent model. We recently developed a coalescent-based method, ASTRAL, which is statistically consistent under the multi-species coalescent model and which is more accurate than other coalescent-based methods on the datasets we examined. ASTRAL runs in polynomial time, by constraining the search space using a set of allowed ‘bipartitions’. Despite the limitation to allowed bipartitions, ASTRAL is statistically consistent.
Results: We present a new version of ASTRAL, which we call ASTRAL-II. We show that ASTRAL-II has substantial advantages over ASTRAL: it is faster, can analyze much larger datasets (up to 1000 species and 1000 genes) and has substantially better accuracy under some conditions. ASTRAL’s running time is O(n^2k|X|^2), and ASTRAL-II’s running time is O(nk|X|^2), where n is the number of species, k is the number of loci and X is the set of allowed bipartitions for the search space.
Methods
We used SimPhy (https://github.com/adamallo/SimPhy) to simulate species trees and gene trees and used Indelible (Fletcher and Yang, 2009) to simulate nucleotide sequences down the gene trees with varying length and model parameters. We estimated gene trees on these simulated gene alignments, which we then used in coalescent-based analyses.
We simulated 11 model conditions, which we divide into two datasets, with one model condition appearing in both datasets. We used SimPhy to simulate species trees according to the Yule process, characterized by the number of taxa, maximum tree length, and the speciation rate (this combination defines a model condition).
Dataset 1: In six model conditions, we fixed the number of taxa to 200 and varied tree length (500 K, 2 M and 10 M generations) and speciation rates (1e-6 and 1e-7 per generation). The tree length impacts the amount of ILS, with lower length resulting in shorter branches, and therefore higher levels of ILS. Speciation rate impacts whether speciation events tend to happen close to the tips (1e-6) or close to the base (1e-7). Different tree shapes (i.e. combinations of tree length and speciation rate) produce different levels of ILS starting from relatively low [roughly 10% distance between true gene trees and the species tree, measured by the Robinson–Foulds (RF) distance] and going up to very high (roughly 70% RF).
Dataset II: we fixed the tree shape to 2M/1e-6 and set the number of taxa to 10, 50, 100, 200, 500, and 1000.
The model condition with 200 taxa and the 2 M/1e-6 tree shape appears in both datasets.
For each model condition, we simulated 50 species trees, forming 50 replicates. On each species tree, 1000 gene trees were simulated according to the multi-species coalescent model with the population size fixed to 200 000.
We simulated indel-free gene alignments using Indelible and under the GTR + Γ model. First, for each replicate, two parameters, μ and σ, were drawn uniformly from (5.7,7.3)(5.7,7.3) and (0,0.3)(0,0.3) respectively. Then, the sequence length for each gene in that replicate was drawn from a log-normal distribution with μ and σ parameters (the average sequence length is uniformly distributed between 300 bp and 1500 bp). GTR + Γ parameters were drawn from Dirichlet distributions that had parameters estimated using ML from a collection of real biological datasets (details given in the paper).
We used FastTree to estimate the 550 000 gene trees ranging from 10 to 1000 species.