Data from: Improved robustness to gene tree incompleteness, estimation errors, and systematic homology errors with weighted TREE-QMC
Data files
Mar 14, 2025 version files 1.44 GB
-
cloutier2019whole.tar.gz
8.23 MB
-
mirarab2015astral2-extsim.estimated-gene-trees-abayes.tar.gz
1.09 GB
-
mirarab2015astral2-extsim.species-trees.tar.gz
59.72 MB
-
morel2022asteroid-ilssim.gene-trees.tar.gz
186.76 MB
-
morel2022asteroid-ilssim.species-trees.tar.gz
10.68 MB
-
morel2022asteroid-plants.tar.gz
369.28 KB
-
README.md
5.24 KB
-
wu2024genomes.tar.gz
64.99 MB
-
zhang2022weighting-gteesim.species-trees.tar.gz
21.56 MB
Abstract
Summary methods are widely used to reconstruct species trees from gene tres while accommodating discordance from incomplete lineage sorting; however, it is increasingly recognized that their accuracy can be negatively impacted by incomplete and/or error-ridden gene trees. To address the latter, Zhang and Mirarab (2022) updated the popular summary method ASTRAL so that it weights quartets based on gene tree branch lengths and support values. The implementation of these weighting schemes presented computational challenges, leading Zhang and Mirarab (2022) to replace ASTRAL's original algorithm (i.e., computing an exact solution within a constrained search space) in favor of search heuristics based on phylogenetic placement. Here, we show that these weighting schemes can be effectively leveraged within the Quartet Max Cut framework of Snir and Rao (2010), introducing weighted TREE-QMC. The incorporation of weighting schemes into TREE-QMC required only a small increase in time complexity compared to the unweighted algorithm; fortunately, the increase in runtime was also small, behaving more like a constant factor in our simulation study. Moreover, weighted TREE-QMC was fast and highly competitive with weighted ASTRAL, even outperforming it in terms of species tree accuracy on some challenging simulation conditions, such as large numbers of taxa. In reanalyzing two avian data sets, we found that weighting quartets by gene tree branch lengths can improve robustness to systematic homology errors and can be as effective as removing the impacted taxa from individual gene trees or removing the impacted gene trees entirely. Lastly, our study revealed that TREE-QMC was robust to extreme rates of missing taxa, suggesting its utility as a supertree method.
https://doi.org/10.5061/dryad.hdr7sqvsx
Description of the data and file structure
This repository contains data sets (gene trees and species trees) used in the weighted TREE-QMC evaluation study. For results on simulated data sets, we maintained the directory structure of the original studies (see the original studies for a complete description of model conditions and simulation parameters).
Files and variables
File: cloutier2019whole.tar.gz
Description: Results on the avian data set from Cloutier et al. (2019); specifically, the best ML gene trees with abayes support that we estimated as well as the species trees that we estimated; see README.txt files in this directory for more information.
File: mirarab2015astral2-extsim.species-trees.tar.gz
Description: Results on the simulated data sets from Mirarab and Warnow (2015); specifically, the species trees that we estimated given the best ML gene trees either with sh support or abayes support. Species trees are in the directories labeled by <model condition>/<replicate number> and have the naming convention <method name and parameters>_<gene tree branch support>_<number of gene trees>gen.tre. If qsupp-wh is tagged at the end of the file name, it means branch support was estimated using hybrid weighted quartets.
File: mirarab2015astral2-extsim.estimated-gene-trees-abayes.tar.gz
Description: Results on the simulated data sets from Mirarab and Warnow (2015); specifically, the gene trees with abayes support that we estimated. Gene trees are in directories labeled by <model condition>/<replicate number>.
File: morel2022asteroid-plants.tar.gz
Description: Results on the plant data set from Morel et al (2022); specifically, the gene trees file that we created by combining individual best ML gene trees into a single file, the species trees that we estimated , and the new reference tree that we created; see README.txt file in this directory for more information.
File: morel2022asteroid-ilssim.gene-trees.tar.gz
Description: Results on simulated data sets from Morel al. (2022); specifically, the gene files that we created by combining individual gene trees into a single file. Gene trees are in the directories labeled by <model condition>_seed<replicate number>.
File: morel2022asteroid-ilssim.species-trees.tar.gz
Description: Results on simulated data sets from Morel et al. (2023); specifically, the species trees that we estimated from the (combined) best ML gene trees. Species trees are in the directories labeled by <model condition>_seed<replicate number> and have the naming convention <method name and parameters>.tre. If qsupp-wn is tagged at the end of the file name, it means branch support was estimated using (unweighted) weighted quartets.
File: zhang2022weighting-gteesim.species-trees.tar.gz
Description: Results on simulated data sets from Zhang and Mirarab (2022); specifically, the species trees that we estimated from best ML gene trees with abayes and bootstrap support. Species trees are in the directories labeled by and have the naming convention <method name and parameters>_<gene tree branch support>_<sequence length>bps_<number of gene trees>gen.tre. If qsupp-wh is tagged at the end of the file name, it means branch support was estimated using hybrid weighted quartets.
File: wu2024genomes.tar.gz
Description: Results on the avian data set from Wu et al. (2024); specifically, the best ML gene trees with abayes support that we estimated as well as species trees that we estimated; see README.txt files in this directory for more information.
Code/software
Source code for weighted TREE-QMC is available on Github: https://github.com/molloy-lab/TREE-QMC
Access information
Data was derived from the following sources:
- Cloutier et al (2019)
- Mirarab and Warnow (2015)
- Morel et al. (2023)
- Wu et al. (2024)
- Zhang and Mirarab (2022)
