ASTRAL-MP: scaling ASTRAL to very large datasets using randomization and parallelization

Mirarab, Siavash 1 ; Yin, John1; Zhang, Chao1

Published Jul 05, 2023 on Dryad. https://doi.org/10.6076/D16W2H

Data files

Jul 05, 2023 version files 7.97 GB

Abstract

Motivation

Evolutionary histories can change from one part of the genome to another. The potential for discordance between the gene trees has motivated the development of summary methods that reconstruct a species tree from an input collection of gene trees. ASTRAL is a widely used summary method and has been able to scale to relatively large datasets. However, the size of genomic datasets is quickly growing. Despite its relative efficiency, the current single-threaded implementation of ASTRAL is falling behind the data growth trends and is not able to analyze the largest available datasets in a reasonable time.

Results

ASTRAL uses dynamic programing and is not trivially parallel. In this paper, we introduce ASTRAL-MP, the first version of ASTRAL that can exploit parallelism and also uses randomization techniques to speed up some of its steps. Importantly, ASTRAL-MP can take advantage of not just multiple CPU cores but also one or several graphics processing units (GPUs). The ASTRAL-MP code scales very well with increasing CPU cores, and its GPU version, implemented in OpenCL, can have up to 158× speedups compared to ASTRAL-III. Using GPUs and multiple cores, ASTRAL-MP is able to analyze datasets with 10,000 species or datasets with more than 100,000 genes in <2 days.

Availability and implementation

ASTRAL-MP is available at https://github.com/smirarab/ASTRAL/tree/MP.

In testing the efficiency of ASTRAL-MP, we use several simulated and real datasets (see Table). The datasets range in the number of species (n) between 48 and 1,000 and have between 1,000 and 14,446 gene trees (k).

Name	Original publication	# Species (n)	# Genes (k)	Type	# Generations	Contraction threshold	# Reps.
SV	Mirarab and Warnow (2015)	100, 200, 500, 1000	1000	Simulated	$2 \times 10^{6}$	Fully resolved	10
Avian	Mirarab et al. (2014a)	48	14 446, 1000	Real	Unknown (order: 10⁷)	Full, 0, 33, 50, 75%	1, 10
Insects	Sayyari et al. (2017)	144	1478	Real	Unknown	Fully resolved	1

Note: For SV, some outlier replicates have fewer than 1m000 genes because poorly resolved gene trees are removed. For avian, the full dataset is subsampled randomly to create 10 inputs with 1m000 gene trees.

In addition, to test limits of n, we used an existing simulated dataset (20 replicates) with 10⁴ species and 1000 gene trees similarly to the SV1000 dataset.

To test limits of k, we used an insect transcriptomic dataset (Misof et al., 2014; Sayyari et al., 2017) with 144 taxa and 1,478 genes, each with 100 bootstrapped gene trees.

ASTRAL-MP: scaling ASTRAL to very large datasets using randomization and parallelization

Data files

Abstract

Methods

Works referencing this dataset