TREE-QMC: Improving quartet graph construction for scalable and accurate species tree estimation from gene trees
Han, Yunheng; Molloy, Erin K (2022), TREE-QMC: Improving quartet graph construction for scalable and accurate species tree estimation from gene trees, Dryad, Dataset, https://doi.org/10.5061/dryad.m0cfxpp6g
Genome-scale data is increasingly available for species tree estimation, bringing attention to the fact that different regions of the genome can have evolutionary histories that differ from each other and from the species tree due to incomplete lineage sorting. This has led to the development of many new methods, some of which estimate the species tree from a collection of (estimated) gene trees. ASTRAL, the dominant method in this class, executes an exact algorithm for the Maximum Quartet Support Species Tree problem within a constrained solution space, constructed from the input gene trees. The other popular heuristics for this NP-hard problem, wQMC and wQFM, take a set of weighted quartets as input and construct the species tree in a divide-and-conquer fashion, for example via graph cuts. This approach has failed to compete with ASTRAL because it is less scalable when considering the time required to explicitly weight quartets based on the gene trees. Moreover, in prior studies, wQMC has been consistently less accurate than either wQFM or ASTRAL. Here, we address that the poor accuracy of wQMC by showing that the quartet graph should be normalized to account for "artificial taxa," which are introduced during the divide phase so that solutions on subproblems can be combined during the conquer phase. This led us to design efficient techniques for uniform and non-uniform graph normalization, with the goal of the latter being improved robustness on data sets with large numbers of taxa and gene tree heterogeneity. We provide an algorithm to construct the normalized quartet graph directly from the input gene trees, enabling our new method TREE-QMC to run in O(n^3 k) time where n is the number of species and k is the number of gene trees (provided some assumptions on subproblem sizes). In an extensive simulation study, graph normalization enabled TREE-QMC to be competitive in terms of species tree accuracy with ASTRAL-III and FASTRAL, a recent method that runs ASTRAL-III using an aggressively constrained solution space to speedup analyses. Notably, when the number of taxa was large (>=500), TREE-QMC (with non-uniform graph normalization) produced more accurate species trees than either FASTRAL and ASTRAL-III, at speeds closer to FASTRAL. We also re-analyzed an avian data set of 3,679 ultraconserved elements, finding that the estimated trees differed only on the short branches suggestive of ILS, with the TREE-QMC tree being closer to the concatenation tree than the ASTRAL-III/FASTRAL tree. Overall, our study shows that wQMC's divide-and-conquer framework can be competitive with the dynamic programming approach of ASTRAL.
State of Maryland