Accurate, scalable, and fully automated inference of species trees from raw genome assemblies using ROADIES
Data files
Jul 20, 2024 version files 48.39 GB
-
birds_convergence_files.tar.gz
319.55 MB
-
birds_output_directory_1.tar.gz
2.49 GB
-
birds_output_directory_10.tar.gz
2.56 GB
-
birds_output_directory_11.tar.gz
2.57 GB
-
birds_output_directory_12.tar.gz
2.56 GB
-
birds_output_directory_13.tar.gz
2.55 GB
-
birds_output_directory_14.tar.gz
2.55 GB
-
birds_output_directory_15.tar.gz
2.55 GB
-
birds_output_directory_16.tar.gz
2.54 GB
-
birds_output_directory_17.tar.gz
2.58 GB
-
birds_output_directory_18.tar.gz
1.41 GB
-
birds_output_directory_19.tar.gz
2.52 GB
-
birds_output_directory_2.tar.gz
1.39 GB
-
birds_output_directory_20.tar.gz
1.39 GB
-
birds_output_directory_3.tar.gz
1.95 GB
-
birds_output_directory_4.tar.gz
1.96 GB
-
birds_output_directory_5.tar.gz
1.97 GB
-
birds_output_directory_6.tar.gz
1.39 GB
-
birds_output_directory_7.tar.gz
2.55 GB
-
birds_output_directory_8.tar.gz
1.42 GB
-
birds_output_directory_9.tar.gz
2.57 GB
-
flies_convergence_files.tar.gz
3.36 MB
-
flies_output_directory_1.tar.gz
333.75 MB
-
flies_output_directory_2.tar.gz
313.86 MB
-
mammals_convergence_files.tar.gz
53.88 MB
-
mammals_output_directory_1.tar.gz
862.28 MB
-
mammals_output_directory_2.tar.gz
609.60 MB
-
mammals_output_directory_4.tar.gz
573.21 MB
-
mammals_output_directory_5.tar.gz
585.03 MB
-
mammals_output_directory_6.tar.gz
595.41 MB
-
mammals_output_directory_8.tar.gz
648.43 MB
-
README.md
8.86 KB
Jan 09, 2025 version files 51.29 GB
-
birds_48_read2tree_3_reference.nwk
2.04 KB
-
birds_48_read2tree_7_reference.nwk
2.04 KB
-
birds_convergence_files.tar.gz
319.55 MB
-
birds_output_directory_1.tar.gz
2.49 GB
-
birds_output_directory_10.tar.gz
2.56 GB
-
birds_output_directory_11.tar.gz
2.57 GB
-
birds_output_directory_12.tar.gz
2.56 GB
-
birds_output_directory_13.tar.gz
2.55 GB
-
birds_output_directory_14.tar.gz
2.55 GB
-
birds_output_directory_15.tar.gz
2.55 GB
-
birds_output_directory_16.tar.gz
2.54 GB
-
birds_output_directory_17.tar.gz
2.58 GB
-
birds_output_directory_18.tar.gz
1.41 GB
-
birds_output_directory_19.tar.gz
2.52 GB
-
birds_output_directory_2.tar.gz
1.39 GB
-
birds_output_directory_20.tar.gz
1.39 GB
-
birds_output_directory_3.tar.gz
1.95 GB
-
birds_output_directory_4.tar.gz
1.96 GB
-
birds_output_directory_5.tar.gz
1.97 GB
-
birds_output_directory_6.tar.gz
1.39 GB
-
birds_output_directory_7.tar.gz
2.55 GB
-
birds_output_directory_8.tar.gz
1.42 GB
-
birds_output_directory_9.tar.gz
2.57 GB
-
BUSCO_pipeline.zip
814.01 MB
-
BUSCO_result_summary_and_logs.zip
20.92 MB
-
flies_convergence_files.tar.gz
3.36 MB
-
flies_output_directory_1.tar.gz
333.75 MB
-
flies_output_directory_2.tar.gz
313.86 MB
-
mammals_convergence_files.tar.gz
53.88 MB
-
mammals_output_directory_1.tar.gz
862.28 MB
-
mammals_output_directory_2.tar.gz
609.60 MB
-
mammals_output_directory_4.tar.gz
573.21 MB
-
mammals_output_directory_5.tar.gz
585.03 MB
-
mammals_output_directory_6.tar.gz
595.41 MB
-
mammals_output_directory_8.tar.gz
648.43 MB
-
README.md
12.68 KB
-
yeast_convergence_output_0.tar.gz
18.60 MB
-
yeast_output_directory_0.tar.gz
2.05 GB
Abstract
Current genome sequencing initiatives across a wide range of life forms offer significant potential to enhance our understanding of evolutionary relationships and support transformative biological and medical applications. Species trees play a central role in many of these applications; however, despite the widespread availability of genome assemblies, accurate inference of species trees remains challenging for many scientists due to the limited automation, significant domain expertise, and substantial computational resources required by conventional methods. To address this limitation, we present ROADIES, a fully-automated pipeline to infer species trees starting from raw genome assemblies (those lacking prior annotations). In contrast to the prominent approach, ROADIES randomly selects segments of the input genomes to generate gene trees. This eliminates the need to choose any single reference species or perform the cumbersome steps of gene annotations and whole genome alignments. ROADIES also leverages existing discordance-aware methods that allow multi-copy genes, eliminating the need to infer orthology. Using the genomic datasets from large-scale sequencing efforts across four diverse life forms (placental mammals, pomace flies, birds, and budding yeasts), we show that ROADIES infers species trees that are comparable in quality with the state-of-the-art studies that involved domain experts but in a fraction of the time and effort. With its speed, accuracy, and automation, ROADIES has the potential to vastly simplify species tree inference, making it accessible to a broader range of scientists and applications.
Usage Notes
https://doi.org/10.5061/dryad.tht76hf73
ROADIES is a novel pipeline designed for phylogenetic tree inference of the species directly from their raw genomic assemblies.
For further details related to how to run the tool ROADIES, please refer to our Wiki: https://turakhia.ucsd.edu/ROADIES/
This repository contains the output files generated by ROADIES (v0.1.0) (https://github.com/TurakhiaLab/ROADIES/releases/tag/v0.1.0) for estimating the species tree for the following datasets (in the accurate mode of operation):
- 240 mammalian species from the infraclass Placentalia (alternatively referred to as “placental mammals”)
- 100 flies species belonging to the subfamily of Drosophilinae and Steganinae
- 363 bird species from the class Aves
- 332 yeast species from the subphylum Saccharomycotina
Along with the above files, it also contains:
- BUSCO pipeline output files with 48 birds dataset
- Read2Tree pipeline output files with 48 birds dataset
Description of the input datasets
The details of all the genomic datasets, including their respective orders, scientific nomenclature, and NCBI ID, are provided in the Supplementary_Table.xlsx
(provided with the manuscript’s preprint). We used reference trees from authoritative studies as a proxy for ground truth to compare the accuracy of the species tree estimated by ROADIES.
Placental Mammals
The 240 species of placental mammals are collected from the Zoonomia consortium (https://cglgenomics.ucsc.edu/november-2020-nature-mammalian-and-avian-alignments/). Zoonomia consortium earlier published 241 species with two assemblies from the same species - Canis lupus familiaris (Domestic dog). We removed the duplicate dog assembly (with NCBI ID GCF_000002285.3) and kept the one with NCBI ID GCA_004027395.1
The reference tree of 240 placental mammals is taken from the Zoonomia consortium: https://www.science.org/doi/10.1126/science.abn3943
Drosophila
For 100 species of Drosophila, we collected the dataset from NCBI BioProject ID PRJNA675888 (https://www.ncbi.nlm.nih.gov/bioproject/675888) used by Kim et al. (https://elifesciences.org/articles/66405), along with the reference tree.
Aves
The 363 avian species are collected from the Birds 10k Genome Project’s dataset (https://www.nature.com/articles/s41586-020-2873-9). The reference tree is collected from the topology described by Stiller et al. (https://www.nature.com/articles/s41586-024-07323-1).
Yeast
The 332 budding yeast species are collected from the dataset provided by Shen et al. 2018 (https://www.cell.com/cell/fulltext/S0092-8674(18)31332-1?_returnURL=https%3A%2F%2Flinkinghub.elsevier.com%2Fretrieve%2Fpii%2FS0092867418313321%3Fshowall%3Dtrue). The reference tree topology is also collected from the same work.
Description of the uploaded ROADIES output files
Each <dataset>_output_directory_<num>.tar.gz
contains a set of directories including ROADIES output files for mammals, flies, and birds datasets (dataset
= mammals
or flies
or birds
). For each of these datasets, ROADIES was run simultaneously with smaller gene counts (2000-4000 genes) in separate AWS instances, with
Directory Contents
- alignments: Contains the pairwise alignment output of all individual input genomes aligned with randomly sampled gene sequences (in the form of
<species_name>.maf
). - benchmarks: Contains the runtime values of individual jobs for each stage in the pipeline. These files are used to estimate and compare the stage-wise runtime and are not used in the final tree estimation.
- genes: Contains output files from the multiple sequence alignment and tree-building stages of ROADIES (run by PASTA (https://github.com/smirarab/pasta), RAxML-NG (https://github.com/amkozlov/raxml-ng)).
- genetrees:
gene_tree_merged.nwk
: Lists all gene trees together generated by gene tree estimation step of ROADIES. It is then provided to ASTRAL-Pro (last stage of ROADIES) to estimate the final species tree from this list of gene trees.original_list.txt
: Lists all gene trees together corresponding to their gene IDs. Some lines may have only gene IDs if the gene was filtered out due to having less than four species. This file aids in debugging.
- plots:
gene_dup.png
: Histogram showing the count of gene duplicates (Y-axis) vs. the number of genes with duplication (X-axis).homologs.png
: Histogram showing the count of genes (Y-axis) vs. the number of homologous species (X-axis).num_genes.png
: Plot showing the number of genes aligned to each input genome after pairwise alignment. X-axis represents genomes, Y-axis represents gene counts.sampling.png
: Plot showing the number of genes sampled from each input genome after random sampling. X-axis represents genomes, Y-axis represents gene counts.
- samples: Contains the list of randomly sampled genes from individual input genomes.
<species_name>_temp.fa:
Contains list of genes sampled from the particular input genome.out.fa:
Contains all randomly sampled subsequences (genes) combined, used for the pairwise alignment step.
- statistics: Contains CSV data for the plots in the plots directory.
gene_to_species.csv
: Provides information about which genes are aligned to which species after the pairwise alignment step, including the gene ID, score, line number in the .maf file, and position of all homologs.
roadies_stats.nwk
: The final estimated species tree with support branch values in Newick format.roadies.nwk
: The final estimated species tree in Newick format.roadies_rerooted.nwk
(optional): The final estimated species tree, re-rooted to the outgroup node from the reference tree.time_stamps.csv
: Contains the start time, number of gene trees required for estimating the species tree, end time, and total runtime (in seconds).
Description of the uploaded convergence files
The set of gene trees from each instance (gene_tree_merged.nwk
) for mammals, flies, and birds are collected and concatenated. These are iterated to mimic the convergence algorithm of ROADIES. The convergence output files are uploaded here as <dataset>_convergence_files.tar.gz
which contain the following files and directories (where iter_num
is the number of iterations, and iter_gene_count
is the number of genes at each iteration). Note that the gene count doubles at each iteration as per the convergence strategy of ROADIES.
Convergence Output Directory Contents
- convergence_output:
freqQuad_<iter_num>.csv:
Generated by ASTRAL-Pro (https://github.com/chaoszhang/A-pro) corresponding to the list of gene trees provided at each iteration.run_<iter_gene_count>.nwk
: The final unrooted Newick tree generated by ROADIES at the species level at each iteration.run_<iter_gene_count>_rerooted.nwk
: The final rooted Newick tree generated by ROADIES at the species level at each iteration (usingreroot.py
from the ROADIES repo).pruned_run_<iter_gene_count>.nwk
: The final unrooted Newick tree generated by ROADIES at the order level at each iteration.pruned_run_<iter_gene_count>_rerooted.nwk
: The final rooted Newick tree generated by ROADIES at the order level at each iteration (usingreroot.py
from the ROADIES repo).ref_dist.csv
: Each row provides the following information:- Iteration number (
iter_num
) - Number of gene trees
- Species-level normalized Robinson-Foulds (normRF) distance between the final estimated species tree (
run_<iter_gene_count>.nwk
) and the reference tree. - Order-level normRF distance between the final estimated species tree (
pruned_run_<iter_gene_count>.nwk
) and the reference tree. - The percentage of highly supported nodes in the current iteration’s final species-level tree. Nodes with local posterior probability >= 0.95 are considered highly supported. This value increases with each iteration, indicating increased confidence in the tree with more gene trees.
- Iteration number (
- gene_tree_list: Contains the following files:
gene_trees_<iter_gene_count>.nwk
: Lists the gene trees at each iteration.generated_mapping_<dataset>.txt
: The mapping file provided to ASTRAL-Pro for the final species tree estimation.get_values_from_genetrees_<dataset>.py
: A Python script that takes the list of gene trees at each iteration and the species and order level reference trees as input, and outputs the final trees, distance, and support values (ref_dist.csv
).
Description of the uploaded BUSCO output files
There are two zip files associated with BUSCO outputs. BUSCO_pipeline.zip
contains the output files of all the intermediate stages of the BUSCO based concatenation and coalescent pipelines, including MAFFT, TrimAl, RAxML, Pargenes. The details of these pipelines are mentioned in the Methods section. BUSCO_result_summary_and_logs.zip
contains the log files of BUSCO runs for all 48 birds along with the final scores in summary form.
Directory contents of BUSCO_pipeline.zip
BUSCO coalescent pipeline
: This contains the MAFFT and TrimAl output in the form of<BUSCO Gene ID>.fa.aln
and<BUSCO Gene ID>.fa.aln.trimmed
, respectively. It also contains the pargenes output in a separate subfolder, which performs gene tree and species tree estimation.BUSCO concatenation pipeline
: This contains the MAFFT and TrimAl output in the form of<BUSCO Gene ID>.fa.aln
and<BUSCO Gene ID>.fa.aln.trimmed
, respectively, and the RAxML output in a separate folder.RAxML_bestTree.my_busco_phylo
gives the final species tree.create_supermatrix.py
: This is the script which creates a supermatrix from the filtered and trimmed MSAs to be provided to RAxML.supermatrix.aln.fa
: This is the supermatrix containing all the MSAs in concatenated form from all 48 species.download_data.sh
: This script downloads the genomic assemblies based on provided list of GCA ids.get_busco_sequences.py
: This python scripts extracts the single copy complete BUSCO sequences based on the IDs extracted by next script below.get_common_busco_ids.py
: This python script extracts the BUSCO IDs of all single copy complete BUSCO genes present in all species (in concatenation pipeline) and present in at least one species (in coalescent pipeline).run_busco.sh
: This script runs the BUSCO tool for all 48 birds genomic assemblies.
Description of the uploaded Read2Tree output files
There are two Read2Tree output files in the Newick format from two experiments, one by choosing 3 reference species, and another by choosing 7 reference species from OMA database. These species trees are generated from 48 birds dataset.
birds_48_read2tree_3_reference.nwk
- This file saves the final tree generated by the Read2Tree by choosing the marker genes from 3 reference species from OMA database.-
birds_48_read2tree_7_reference.nwk
- This file saves the final tree generated by the Read2Tree by choosing the marker genes from 7 reference species from OMA database.The details of the reference species are provided in the manuscript (Methods section).
Version changes
Jul 20, 2024: Initial version
Jan 6, 2025: Added following new files:
- BUSCO output files (
BUSCO_pipeline.zip
,BUSCO_result_summary_and_logs.zip
) - Read2Tree output files (
birds_48_read2tree_3_reference.nwk
,birds_48_read2tree_7_reference.nwk
) - ROADIES output and convergence files for 332 yeast species (
yeast_convergence_output_0.tar.gz
,yeast_output_directory_0.tar.gz
)