ASTRAL-MP: scaling ASTRAL to very large datasets using randomization and parallelization
Data files
Jul 05, 2023 version files 7.97 GB
Abstract
Motivation
Evolutionary histories can change from one part of the genome to another. The potential for discordance between the gene trees has motivated the development of summary methods that reconstruct a species tree from an input collection of gene trees. ASTRAL is a widely used summary method and has been able to scale to relatively large datasets. However, the size of genomic datasets is quickly growing. Despite its relative efficiency, the current single-threaded implementation of ASTRAL is falling behind the data growth trends and is not able to analyze the largest available datasets in a reasonable time.
Results
ASTRAL uses dynamic programing and is not trivially parallel. In this paper, we introduce ASTRAL-MP, the first version of ASTRAL that can exploit parallelism and also uses randomization techniques to speed up some of its steps. Importantly, ASTRAL-MP can take advantage of not just multiple CPU cores but also one or several graphics processing units (GPUs). The ASTRAL-MP code scales very well with increasing CPU cores, and its GPU version, implemented in OpenCL, can have up to 158× speedups compared to ASTRAL-III. Using GPUs and multiple cores, ASTRAL-MP is able to analyze datasets with 10,000 species or datasets with more than 100,000 genes in <2 days.
Availability and implementation
ASTRAL-MP is available at https://github.com/smirarab/ASTRAL/tree/MP.
Methods
In testing the efficiency of ASTRAL-MP, we use several simulated and real datasets (see Table). The datasets range in the number of species (n) between 48 and 1,000 and have between 1,000 and 14,446 gene trees (k).
Name | Original publication | # Species (n) | # Genes (k) | Type | # Generations | Contraction threshold | # Reps. |
---|---|---|---|---|---|---|---|
SV | Mirarab and Warnow (2015) | 100, 200, 500, 1000 | 1000 | Simulated | 2×1062×106 | Fully resolved | 10 |
Avian | Mirarab et al. (2014a) | 48 | 14 446, 1000 | Real | Unknown (order: 107) | Full, 0, 33, 50, 75% | 1, 10 |
Insects | Sayyari et al. (2017) | 144 | 1478 | Real | Unknown | Fully resolved | 1 |
Note: For SV, some outlier replicates have fewer than 1m000 genes because poorly resolved gene trees are removed. For avian, the full dataset is subsampled randomly to create 10 inputs with 1m000 gene trees.
In addition, to test limits of n, we used an existing simulated dataset (20 replicates) with 104 species and 1000 gene trees similarly to the SV1000 dataset.
To test limits of k, we used an insect transcriptomic dataset (Misof et al., 2014; Sayyari et al., 2017) with 144 taxa and 1,478 genes, each with 100 bootstrapped gene trees.