Completing gene trees without species trees in sub-quadratic time

Published Jun 26, 2023 on Dryad. https://doi.org/10.6076/D1N30V

Data files

Jun 26, 2023 version files 5.02 GB

Abstract

Motivation: As genome-wide reconstruction of phylogenetic trees becomes more widespread, limitations of available data are being appreciated more than ever before. One issue is that phylogenomic datasets are riddled with missing data, and gene trees, in particular, almost always lack representatives from some species otherwise available in the dataset. Since many downstream applications of gene trees require or can benefit from access to complete gene trees, it will be beneficial to algorithmically complete gene trees. Also, gene trees are often unrooted, and rooting them is useful for downstream applications. While completing and rooting a gene tree with respect to a given species tree has been studied, those problems are not studied in depth when we lack such a reference species tree.

Results: We study completion of gene trees without a need for a reference species tree. We formulate an optimization problem to complete the gene trees while minimizing their quartet distance to the given set of gene trees. We extend a seminal algorithm by Brodal et al. to solve this problem in quasi-linear time. In simulated studies and on a large empirical dataset, we show that completion of gene trees using other gene trees is relatively accurate and, unlike the case where a species tree is available, is unbiased.

Availability and implementation: Our method, tripVote, is available at https://github.com/uym2/tripVote.

We test tripVote on published simulated datasets and a real plant dataset by OneKP Initiative. The simulated datasets were both created using Simphy to generate gene and species trees under the multi-species coalescent model and heterogeneous parameters, and Indelible to simulate nucleotide sequences on gene trees according to the GTR + Γ model with varying sequence lengths and sequence evolution parameters. FastTree2 was used to estimate gene trees based on the GTR + Γ model.

The 201-taxon dataset by Mirarab and Warnow (2015) includes incomplete lineage sorting (ILS). We use 3 model conditions with 200 taxa where tree length is 2M and speciation rate is 1e-6, and use the first 20 out of the 50 replicates of the original dataset (to save computational time). In each replicate, we use the first 500 (out of 1000) estimated gene trees that are fully resolved. The three model conditions have medium, high, and very high levels of ILS, resulting in 21, 33, and 69% RF distance between true gene trees and the species tree. They also have high levels of gene tree estimation error (15, 22, and 34% RF between estimated and true gene trees).

The 31-taxon dataset by Mai et al. (2017) is used to examine the effect of clock deviations. We use the three model conditions with the root-to-crown ratio equal to 1.0, but varying levels of deviation from the clock (low, medium, high). We only use the first 20 out of the 100 replicates of the original dataset because our experiments are computationally intensive. The average coefficient of variation of root-to-tip distances of low, medium, and high deviations are 0.04, 0.30, and 0.69, respectively. These replicates have moderately high level of ILS (with 24% mean RF distance between true gene trees and the species tree). The amount of gene tree estimation error increases with deviations from the clock (RF errors are 30, 41, and 52%).

The real OneKP biological dataset of 1178 plants by OneKP Initiative (2019) has 384 gene trees, all of which miss some of the species. The original study provide an ASTRAL species tree inferred from 384 gene trees, inferred using RAxML, each with at least $1178 / 2 = 589$ species.

We compare tripVote with two alternative tree completion algorithms: ASTRAL-completion, the method used in ASTRAL, and OCTAL, which minimizes RF distance of each gene tree to the species tree. ASTRAL-completion is run using the ASTRAL software, and OCTAL is run using the TRACTION-RF software. In addition, to guide visualization and interpretation, we add a lower-bound control by randomly completing the gene trees (repeated 100 times and averaged).

In simulated datasets, we randomly remove m leaves from each estimated gene tree to create incomplete gene trees; $m \in {0, 1, 2, 20, 50, 100}$ for the 201-taxon dataset and $m \in {0, 1, 2, 3, 8, 15}$ for the 31-taxon dataset. We use tripVote, OCTAL and ASTRAL-completion to complete each set. For the 201-taxon dataset with m = 1, we compare the accuracy of tripVote with and without the sampling.

We also test the ability of tripVote to improve species tree estimation. On the 201-taxon dataset, we compare five versions of ASTRAL for inferring species trees from incomplete gene trees. (i) The default ASTRAL uses ASTRAL-completion to construct the search space and original trees to score. (ii) We use tripVote in place of ASTRAL-completion but continue to score trees using incomplete trees. (iii) We use OCTAL in place of ASTRAL-completion. Since running OCTAL needs a species tree, we use the ASTRAL species tree inferred in (i) as input to OCTAL. Thus, in this setting, ASTRAL is run twice. (iv) We use the gene trees completed by ASTRAL-completion as input to ASTRAL, making them used both for search space creation and scoring. (v) Similarly, we use tripVote completed trees as input. We measure the error of these ASTRAL trees by computing their RF distances to the true species tree.

We also test tripVote and ASTRAL-completion on their ability to root an unrooted gene tree with respect to other rooted gene trees. On the simulated datasets, we remove the outgroup from a set of n − k gene trees (arbitrarily selected) and use the k remaining trees to infer the outgroup placement. We vary k in ${1, 10, 50, 100, 250, 500}$ We compare tripVote to other rooting methods: the outgroup rooting (root at the original placement of the outgroup before removing it), mid-point rooting and MinVar rooting and the random rooting (as a control).

For the OneKP dataset, we set up two versions: one where the original gene trees are used directly and one with extra missing data where we prune out an extra $p %$ of the taxa from each gene tree (for $p \in {5, 10, 15, 20}$ ⁠).

Completing gene trees without species trees in sub-quadratic time

Data files

Abstract

Methods

Works referencing this dataset