Data from: ASTRAL: genome-scale coalescent-based species tree estimation
Data files
Jan 05, 2024 version files 590.26 MB
-
biological.zip
-
estimatedgenetrees.zip
-
README.md
-
sequencedata.zip
-
simulation-scripts.zip
-
truetrees.zip
Abstract
Species trees provide insight into basic biology, including the mechanisms of evolution and how it modifies biomolecular function and structure, biodiversity and co-evolution between genes and species. Yet, gene trees often differ from species trees, creating challenges to species tree estimation. One of the most frequent causes for conflicting topologies between gene trees and species trees is incomplete lineage sorting (ILS), which is modelled by the multi-species coalescent. While many methods have been developed to estimate species trees from multiple genes, some which have statistical guarantees under the multi-species coalescent model, existing methods are too computationally intensive for use with genome-scale analyses or have been shown to have poor accuracy under some realistic conditions.
Results: We present ASTRAL, a fast method for estimating species trees from multiple genes. ASTRAL is statistically consistent, can run on datasets with thousands of genes and has outstanding accuracy—improving on MP-EST and the population tree from BUCKy, two statistically consistent leading coalescent-based methods. ASTRAL is often more accurate than concatenation using maximum likelihood, except when ILS levels are low or there are too few gene trees.
README: ASTRAL: genome-scale coalescent-based species tree estimation
https://doi.org/10.5061/dryad.ht76hdrp0
This repository includes both simulated and biological dataset.
Description of the data and file structure
The following datasets are used in the ASTRAL paper shown above. All these archive files include README files that describe their content.
biological.zip:
This file includes:
1. our estimated gene trees on alignments provided to us by authors of Song et al, 2012, PNAS,
2. our estimated species trees on the same dataset.
We have re-analyses of two biological datasets in our paper.
Song et al dataset
We obtained gene alignments from the Song et al and re-estimated gene trees and species trees.
The following files are included in mammals.zip
mammals-alignments.zip contains all the alignments that we obtained from Song et al.
mammals-genetreess.zip contains gene trees that we estimated. For each gene, we include 3 files
- RAxML_bipartitions.final.f200 is the bestML tree with support values drawn on it based on 200 bootstrap replicates.
- RAxML_bootstrap.all includes 200 replicates of bootstrapping using RAxML
- RAxML_bootstrap.all.extra is related to the gene resampling procedure. When gene resampling bootstrapping was used, some genes needed more than 200 bootstrap replicates. Those are included in RAxML_bootstrap.all.extra files (thus the first 200 replicates are same as RAxML_bootstrap.all, but some genes have more replicates).
424.[mpest/astral].mlbs
: the species trees estimated based on these 424 gene trees.
Note that the original Song et al dataset has 447 genes, but we removed 23 genes for reasons described in the paper.
Chiari et al. dataset
All the gene data related to this dataset are already available on the Dyrad
truetrees.zip:
The model species tree and the true gene trees simulated based on the mammalian dataset of Song et al, 2012, PNAS.
The following files are available (all in newick format):
model-species-tree: The model species tree used for simulation.
The following are the true gene trees simulated using the coalescence process based on the model tree. Branches in the model species tree are multiplied by 2, 5, or divided by 2 and 5, to create alternative levels of ILS.
- true-trees-1X
- true-trees-scaled2down
- true-trees-scaled2up
- true-trees-scaled5down
- true-trees-scaled5up
sequencedata.zip
Sequence data simulated on the true gene trees (mammalian dataset).
All the simulated alignments for various levels of ILS are given here. Each zip file is a collection of alignments (.fasta
), which make up the content of replicates. Only full alignments are given here. Alignments are trimmed into their first 500bp or 1000bp to create various model conditions
with varying phylogenetic signals.
estimatedgenetrees.zip
This file gives gene trees estimated using RAxML on alignments of length 1000 and 500 (mammalian dataset).
All Gene trees, including their bootstrap replicates are provided in this file. The zip file contains a set of other zip files, each corresponding to a different ILS level. Each of these zip files consists of a collection of files with the following format:
-
[gene id]
.[alignment length]
-BestML.tre -
[gene id]
.[alignment length]
-bp.MLBS.gz
BestML
is a newick tree file that contains the maximum likelihood tree returned by RAxMML (best of 10 runs).MLBS.gz
includes the set of 200 bootstrap replicates for each gene (note these files are compressed using gzip and need to be uncompressed using gunzip).-
gene_ids.txt
: Note that model conditions in the paper are defined by the number of genes in addition to the ILS level and the alignment length. The gene trees provided here are the same for model conditions that differ only in the number of genes. Thus, a particular gene id can be used in model conditions with 25 genes, 50 genes, and so on. The gene ids assigned to each model condition are shown ingene_ids.txt
. For example, gene id 2342 is used in replicate 12 of 200 genes model conditions, replicate 6 of 400 genes model conditions, and replicate 3 of 800 genes model condition.
Sharing/Access information
We acknowledge the help of Bastien Boussau who performed these simulations for another study and made them available to us for this paper.
Code/Software
We used the script given here in the mammalian dataset: https://www.ideals.illinois.edu/items/55771
Methods
Availability and implementation: ASTRAL is available in open source form at https://github.com/smirarab/ASTRAL/. Datasets studied in this article are available at http://www.cs.utexas.edu/users/phylo/datasets/astral. Contact: warnow@illinois.edu Supplementary information: Supplementary data are available at Bioinformatics online.