Multi-allele species reconstruction using ASTRAL
Data files
Jun 08, 2023 version files 20.29 GB
-
D1.estimatedgenetrees.tar.gz
3.12 GB
-
D1.species.trees.tar.gz
880.64 KB
-
model.200-5.1000000.0.000001-sequences.tar.gz
1.86 GB
-
model.200-5.1000000.0.000001.sp.tar.gz
3.95 MB
-
model.200-5.1000000.0.000001.tar.gz
1.62 GB
-
model.200-5.2000000.0.000001-sequences.tar.gz
2.71 GB
-
model.200-5.2000000.0.000001.sp.tar.gz
3.83 MB
-
model.200-5.2000000.0.000001.tar.gz
1.66 GB
-
model.200-5.500000.0.000001-sequences.tar.gz
1.68 GB
-
model.200-5.500000.0.000001.sp.tar.gz
3.82 MB
-
model.200-5.500000.0.000001.tar.gz
1.27 GB
-
multiind-1ind-2ind.tar.gz
6.37 GB
-
README.md
3.42 KB
-
true-sp-D1.tar.gz
596.45 KB
-
true-sp-D2.tar.gz
504.90 KB
Jul 17, 2023 version files 37.76 GB
Abstract
Genome-wide phylogeny reconstruction is becoming increasingly common, and one driving factor behind these phylogenomic studies is the promise that the potential discordance between gene trees and the species tree can be modeled. Incomplete lineage sorting is one cause of discordance that bridges population genetic and phylogenetic processes. ASTRAL is a species tree reconstruction method that seeks to find the tree with minimum quartet distance to an input set of inferred gene trees. However, the published ASTRAL algorithm only works with one sample per species. To account for polymorphisms in present-day species, one can sample multiple individuals per species to create multi-allele datasets. Here, we introduce how ASTRAL can handle multi-allele datasets. We show that the quartet-based optimization problem extends naturally, and we introduce heuristic methods for building the search space specifically for the case of multi-individual datasets. We study the accuracy and scalability of the multi-individual version of ASTRAL-III using extensive simulation studies and compare it to NJst, the only other scalable method that can handle these datasets. We do not find strong evidence that using multiple individuals dramatically improves accuracy. When we study the trade-off between sampling more genes versus more individuals, we find that sampling more genes is more effective than sampling more individuals, even under conditions that we study where trees are shallow (median length: ≈ 1Ne) and ILS is extremely high.
Methods
We simulate two new datasets (see Table below):
- a heterogeneous dataset (D1) where many parameters are simultaneously changed and ILS levels are extremely high.
- a more homogeneous dataset (D2) where parameters are less varied and the amount of ILS is controlled to create three model conditions.
We use SimPhy to generate species trees and gene trees according to the MSC model. All replicates have 5 individuals per species.
D1. This dataset includes 330 replicates. The number of genes was uniformly sampled between 50 and 1000 per replicate. The number of species was uniformly sampled between 20 and 200. The species tree birth rate parameter is randomly sampled from a log uniform distribution in [10-7,10-6], and the death rate is sampled from a log uniform distribution, bounded from below by 10-7 and bounded by the birth rate parameter from above. The population size is sampled from a uniform distribution in [105,106]. We sampled a maximum species tree height for each replicate from a log-normal distribution with an expected value of 500000 generations (ranging between 0.19M and 1M in 90% of replicates).
D2. This dataset has three model conditions, varying number of generations: 0.5M, 1M, or 2M. Each has 50 replicates with 200 species (and one outgroup) and 1000 genes with species birth rate set to 10-6 under a birth-only model. The population size is 200,000. We also create two versions of it where only one or two individuals per species are randomly sub-sampled.