Skip to main content
Dryad

ASTRAL-MP: scaling ASTRAL to very large datasets using randomization and parallelization

Cite this dataset

Mirarab, Siavash; Yin, John; Zhang, Chao (2023). ASTRAL-MP: scaling ASTRAL to very large datasets using randomization and parallelization [Dataset]. Dryad. https://doi.org/10.6076/D16W2H

Abstract

Motivation

Evolutionary histories can change from one part of the genome to another. The potential for discordance between the gene trees has motivated the development of summary methods that reconstruct a species tree from an input collection of gene trees. ASTRAL is a widely used summary method and has been able to scale to relatively large datasets. However, the size of genomic datasets is quickly growing. Despite its relative efficiency, the current single-threaded implementation of ASTRAL is falling behind the data growth trends and is not able to analyze the largest available datasets in a reasonable time.  

Results

ASTRAL uses dynamic programing and is not trivially parallel. In this paper, we introduce ASTRAL-MP, the first version of ASTRAL that can exploit parallelism and also uses randomization techniques to speed up some of its steps. Importantly, ASTRAL-MP can take advantage of not just multiple CPU cores but also one or several graphics processing units (GPUs). The ASTRAL-MP code scales very well with increasing CPU cores, and its GPU version, implemented in OpenCL, can have up to 158× speedups compared to ASTRAL-III. Using GPUs and multiple cores, ASTRAL-MP is able to analyze datasets with 10,000 species or datasets with more than 100,000 genes in <2 days. 

Availability and implementation

ASTRAL-MP is available at https://github.com/smirarab/ASTRAL/tree/MP

Methods

In testing the efficiency of ASTRAL-MP, we use several simulated and real datasets (see Table). The datasets range in the number of species (n) between 48 and 1,000 and have between 1,000 and 14,446 gene trees (k). 

Name Original publication # Species (n) # Genes (k) Type # Generations Contraction threshold # Reps.
SV  Mirarab and Warnow (2015)  100, 200, 500, 1000  1000  Simulated  2×1062×106  Fully resolved  10 
Avian  Mirarab et al. (2014a)  48  14 446, 1000  Real  Unknown (order: 107 Full, 0, 33, 50, 75%  1, 10 
Insects  Sayyari et al. (2017)  144  1478  Real  Unknown  Fully resolved 

Note: For SV, some outlier replicates have fewer than 1m000 genes because poorly resolved gene trees are removed. For avian, the full dataset is subsampled randomly to create 10 inputs with 1m000 gene trees.

In addition, to test limits of n, we used an existing simulated dataset (20 replicates) with 104 species and 1000 gene trees similarly to the SV1000 dataset. 

To test limits of k, we used an insect transcriptomic dataset (Misof et al., 2014Sayyari et al., 2017) with 144 taxa and 1,478 genes, each with 100 bootstrapped gene trees. 

Funding

National Science Foundation, Award: 1565862

National Science Foundation, Award: ACI-1053575