Data from: Improving quartet graph construction for scalable and accurate species tree estimation from gene trees
Data files
Jul 02, 2022 version files 6.14 MB
-
csvs.zip
-
README.txt
Dec 23, 2022 version files 3.16 GB
-
jarvis2014whole.tar.gz
-
mahbub2021wqfm-aviansim.csvs.tar.gz
-
mahbub2021wqfm-aviansim.estimated-species-trees.tar.gz
-
mahbub2021wqfm-aviansim.gene-trees.tar.gz
-
mahbub2021wqfm-aviansim.refined-gene-trees.tar.gz
-
mahbub2021wqfm-aviansim.true-species-tree.tre
-
mahbub2021wqfm-mammaliansim.csvs.tar.gz
-
mahbub2021wqfm-mammaliansim.estimated-species-trees.tar.gz
-
mahbub2021wqfm-mammaliansim.gene-trees.tar.gz
-
mahbub2021wqfm-mammaliansim.refined-gene-trees.tar.gz
-
mahbub2021wqfm-mammaliansim.true-species-tree.tre
-
mirarab2015astral2.csvs.tar.gz
-
mirarab2015astral2.estimated-gene-trees.tar.gz
-
mirarab2015astral2.estimated-species-trees.tar.gz
-
mirarab2015astral2.refined-gene-trees.tar.gz
-
mirarab2015astral2.true-gene-trees.tar.gz
-
mirarab2015astral2.true-species-trees.tar.gz
-
README.md
Abstract
Summary methods are one of the dominant approaches for estimating species trees from genome-scale data. However, they can fail to produce accurate species trees when the input gene trees are highly discordant due to gene tree estimation error as well as biological processes, like incomplete lineage sorting. Here, we introduce a new summary method TREE-QMC that offers improved accuracy and scalability under these challenging scenarios. TREE-QMC builds upon the algorithmic framework of QMC (Snir and Rao 2010) and its weighted version wQMC (Avni et al. 2014). Their approach takes weighted quartets (four-leaf trees) as input and builds a species tree in a divide-and-conquer fashion, at each step constructing a graph and seeking its max cut. We improve upon this methodology in two ways. First, we address scalability by providing an algorithm to construct the graph directly from the input gene trees. By skipping the quartet weighting step, TREE-QMC has a time complexity of O(n^3 k) with some assumptions on subproblem sizes, where n is the number of species and k is the number of gene trees. Second, we address accuracy by normalizing the quartet weights to account for "artificial taxa," which are introduced during the divide phase so that solutions on subproblems can be combined during the conquer phase. Together, these contributions enable TREE-QMC to outperform the leading methods (ASTRAL-III, FASTRAL, wQFM) in an extensive simulation study. We also present the application of these methods to an avian phylogenomics data set.