Data from: Fast coalescent-based computation of local branch support from quartet frequencies
Data files
Mar 29, 2024 version files 1.68 GB
-
ASTRALII-BL.tar.gz
-
ASTRALII-pp.tar.gz
-
ASTRALII.tar.gz
-
Avian.tar.gz
-
Biological.tar.gz
-
README.md
Abstract
Species tree reconstruction is complicated by effects of incomplete lineage sorting, commonly modeled by the multi-species coalescent model (MSC). While there has been substantial progress in developing methods that estimate a species tree given a collection of gene trees, less attention has been paid to fast and accurate methods of quantifying support. In this article, we propose a fast algorithm to compute quartet-based support for each branch of a given species tree with regard to a given set of gene trees. We then show how the quartet support can be used in the context of the MSC to compute (1) the local posterior probability (PP) that the branch is in the species tree and (2) the length of the branch in coalescent units. We evaluate the precision and recall of the local PP on a wide set of simulated and biological datasets, and show that it has very high precision and improved recall compared with multi-locus bootstrapping. The estimated branch lengths are highly accurate when gene tree estimation error is low, but are underestimated when gene tree estimation error increases. Computation of both the branch length and local PP is implemented as new features in ASTRAL.
README
Simulations
There are two simulated datasets that we used in this paper.
ASTRALII dataset.
This dataset contains estimated gene trees, true gene trees, true species trees, and inferred species trees with ASTRAL, RAxML, and NJST. Each archive contains a directory structure corresponding to:
- model conditions (e.g.,
model.200.10000000.0.0000001
), and under each of them, it includes one folder per replicate[rep]
. -
k
is either 1000, 200, or 50, and is the number of genes. -
method
is eitherastral
,njst
,concat
and refers to the method used for inferring the species tree. - Two types of gene trees
[gt]
are used: true gene treestrue
and estimated gene treeshalf
. Note that_[gt]
is not specified forconcat
method because it is irrelevant.
Files:
ASTRALII-BL.tar.gz
: The branch length estimations.-
model.200.10000000.0.0000001/01/astral-bl-[k]-[gt].txt
: A file where each row is an internal branch and branch lengths are given in coalescent units (first column) and in the unit of the number of generations (third column). -
model[model]/[rep]/astral/astral_[k]_[gt]_sp.nwk
: The tree with branch lengths in coalescent unit for internal nodes. Ignore terminal branch lengths, which are in generation time. -
model[model]/[rep]/astral/astral_[k]_[gt]_sp_Stat
: The log file of astral
-
ASTRALII-pp.tar.gz
: The posterior probability estimate files. Files are the species trees with posterior probability annotated (newick format). Files are of type:-
model.[model]/[rep]/astral/[method]_[k]_[gt]
: full astral annotation (like-t 12
).- Note: poorly named
model.[model]/[rep]/astral/[method]_[k]_[gt]_sp
files correspond to true species trees scored.
- Note: poorly named
-
model.[model]/[rep]/astral/[method]_[k]_[gt]Stat
: ASTARL log files
-
ASTRALII.tar.gz
: raw simulated datasets. Files are of typedata/ASTRALII/200-taxa/model.[model]/[rep]/*
and include:- True (simulated) species trees (
s_tree.trees
) - True (simulated) gene trees (
truegenetrees
) - Estimated gene trees (FastTree), removing those with low resolution (
estimatedgenetre.halfresolved
) - Inferred species trees with ASTRAL (
astral-v474-p1
), RAxML (concatenatedtree
), and NJST (njst
) from 50, 200, or 1000 genes (the 50 and 200 are the first from theestimatedgenetre.halfresolved
ortruegenetrees
files).
- True (simulated) species trees (
Avian simulated dataset.
This dataset contains drectories of the form noscale.1000g.[bp]/R[rep]/
where [bp]
is the sequence length (250, 500, 1000, 1500). In each directory, we have
-
avian-astral.tre.blen
: Output of astral applied to bestML gene trees with correct branch lengths -
avian-astral.tre.blen.err
: ASTRAL log file when computing BL, applied to bestML input tree -
astral-bl.nwk
: ASTRAL tree with full annotations. -
genetrees.gt
: 1000 bestML gene trees -
astral/BS.[bsrep].tre
: MLBS ASTRAL results, bootstrap replicate numbered[bsrep]
. 200 replicates are done. -
astral/Best.tre
: ASTRAL applied on bestML gene trees with no branch lengths -
astral/RAxML_bipartitionsBranchLabels.bestML
: greedy consensus of MLBS replicates, as done by RAxML. -
astral/avian-astral-truegene.nwk
: results of applying ASTRAL on true gene trees for the species tree inferred from gene trees of this directory. -
astral/avian-astral-truegene.nwk.info-estimatedgenetree
: ASTRAL log file of applying ASTRAL on true gene trees for the species tree inferred from gene trees of this directory.
Biological datasets
In this paper, we analyzed 4 different biological datasets. The results and the datasets are available at Biological.tar.gz
:
Avian biological dataset of Jarvis et. al. available at paper: Jarvis, Erich D., et. al. "Phylogenomic analyses data of the avian phylogenomics project." GigaScience 4.1 (2015): 1-9.
- These data are available at http://dx.doi.org/10.5524/101041
The dataset analyzed by Xi et. al. available at paper: Xi, Zhenxiang, Liang Liu, Joshua S Rest, and Charles C. Davis. “Coalescent versus Concatenation Methods and the Placement of Amborella as Sister to Water Lilies.” Systematic Biology 63, no. 6 (November 1, 2014): 919–932. http://doi.org/10.1093/sysbio/syu055.
- See Dryad repo doi:10.5061/dryad.qb251 for sequences. Gene trees provided to us by authors.
The 1KP dataset analyzed by Naim Matasci et. al. published at: Matasci, N., Hung, L.H., Yan, Z., Carpenter, E.J., Wickett, N.J., Mirarab, S., Nguyen, N., Warnow, T., Ayyampalayam, S., Barker, M. and Burleigh, J.G., 2014. Data access for the 1,000 Plants (1KP) project. GigaScience, 3(1), pp.1-10.
- Data available on iPlant as well. https://datacommons.cyverse.org/browse/iplant/home/shared/onekp_pilot
The dataset analyzed by Prum et. al. available at: Prum, R.O., Berv, J.S., Dornburg, A., Field, D.J., Townsend, J.P., Lemmon, E.M. and Lemmon, A.R., 2015. A comprehensive phylogeny of birds (Aves) using targeted next-generation DNA sequencing. Nature.
- Data available from the original paper
Sharing/Access information
Code/Software
The simulated data are generated using Simphy and with scripts given here
- We used ASTRAL (posteval) version 4.9.1 for scoring and ASTRAL (master) version 4.9.8 for computing the branch length of the trees.
- To have posterior probabilities of branches of main species tree and 2 other alternatives we used the posteval branch.
java −Xmx2000M −jar astral.4.9.1.jar −i [GENE TREES] −q [SPECIES TREE] −t 4
To compute the branch lengths of main species tree we used the MAP solution with the command master :
java −Xmx2000M −jar astral.4.9.8.jar −i [GENE TREES] −q [SPECIES TREE] −t 2
To compute the bootstrap support of the alternative topologies we used posteval:
java −Xmx2000M −jar astral.4.9.1.jar −i [BS-replicates] −q [SPECIES TREE] −t 5