QuCo: quartet-based co-estimation of species trees and gene trees
Data files
Nov 29, 2023 version files 6.05 GB
Abstract
Motivation: Phylogenomics faces a dilemma: on the one hand, the most accurate species and gene tree estimation methods are those that co-estimate them; on the other hand, these co-estimation methods do not scale to moderately large numbers of species. The summary-based methods, which first infer gene trees independently and then combine them, are much more scalable but are prone to gene tree estimation error, which is inevitable when inferring trees from limited-length data. Gene tree estimation error is not just random noise and can create biases such as long-branch attraction.
Results: We introduce a scalable likelihood-based approach to co-estimation under the multi-species coalescent model. The method, called quartet co-estimation (QuCo), takes as input independently inferred distributions over gene trees and computes the most likely species tree topology and internal branch length for each quartet, marginalizing over gene tree topologies and ignoring branch lengths by making several simplifying assumptions. It then updates the gene tree posterior probabilities based on the species tree. The focus on gene tree topologies and the heuristic division to quartets enables fast likelihood calculations. We benchmark our method with extensive simulations for quartet trees in zones known to produce biased species trees and further with larger trees. We also run QuCo on a biological dataset of bees. Our results show better accuracy than the summary-based approach ASTRAL run on estimated gene trees.
Availability and implementation: QuCo is available on https://github.com/maryamrabiee/quco. Supplementary information Supplementary data are available at Bioinformatics online.
README: Quco Dataset
https://doi.org/10.6076/D1CP4R
Date belonging to the following paper:
Rabiee, Maryam, and Siavash Mirarab. “QuCo: Quartet-Based Co-Estimation of Species Trees and Gene Trees.” Bioinformatics 38, no. Supplement1 (June 24, 2022): i413–21. https://doi.org/10.1093/bioinformatics/btac265.
Description of the data and file structure
There are several files:
Quartet-simulation-sequences.tar.gz
The main simulations presented in the paper, which involve Felsenstein’s zone quartets.
Here, we provide the simulated sequences.
Files are of the form: rep.[CU]d/R[long]l-[short]s/[rep]/seq[seqlength]/sequences.tar.gz
where
-
[seqlength]
is the sequence length and is either 1600, 800, 400, 200 -
[rep]
is the replicate number, which is between 1 and 20 -
[short]
is the length of the short branch, and varies between 0.01, 0.02, 0.04, 0.08. -
[long]
is the length of the long branch and varies between 0.1, 0.2, 0.3, and 0.4. -
[CU]
is the length of the internal branch length in coalescent units (CU) and varies between 0.1, 0.2, and 0.3.
Each file includes the simulated sequences in the fasta format for all genes of each repliate.
Anomaly-simulations-mrbayes-outputs.tar.gz
This includes the results of the anomaly zone simulations, and specifically the output of MrBayes.
For each of the 50 replicated simulations, we include:
-
[id]/mrbayes-outputs.tar.gz
Inside each archive, we have MrBayes MCMC sample from 200 loci. The files are named as follows, where locus id
is the name of the locus and we have results of four chains (runs 1--4). Each .t file includes the MCMC samples in nexus format, as generated by MrBayes.
-
seq600/[locus id]/[locus id].nex.run[1/2/3/4].t
BiologicalDataset_rj_tree_distribution.tar.gz
Biological dataset of Bossert et al. (2021) with 32 species and 1291 UCEs (http://www.ncbi.nlm.nih.gov/pubmed/33367855).
Here, for reproducabililty, we provide MrBayes tree distributions. For each of the 1291 loci, you can find the following files:
-
uce-[ucid].run[1/2].t
: The MCM sample in nexus format from MrBayes, for chain (run) 1 or 2.
Sharing/Access information
The rest of the data is available on
Code/Software
The simulated data are generated using Simphy
Methods
Data are simulated by us and provided here for reproducibility.