Data from: Chronospaces: an R package for the statistical exploration of divergence times promotes the assessment of methodological sensitivity
Data files
Jul 24, 2024 version files 105.76 MB
-
Datasets.zip
8.44 MB
-
genesortR_for_chronospaces.R
23.11 KB
-
prepare_and_analize_datasets.R
22.35 KB
-
README.md
5.74 KB
-
Results.zip
97.27 MB
Abstract
Much of our understanding of the history of life hinges upon time calibration, the process of assigning absolute times to cladogenetic events. Bayesian approaches to time scaling phylogenetic trees have dramatically grown in complexity, and depend today upon numerous methodological choices. Arriving at objective justifications for all of these is difficult and time consuming. Thus, divergence times are routinely inferred under only one or a handful of parametric conditions, often times chosen arbitrarily. Progress towards building robust biological timescales necessitate the development of better methods to visualize and quantify the sensitivity of results to these decisions.
Here, we present an R package that assists in this endeavor through the use of chronospaces, i.e., graphical representations summarizing variation in the node ages contained in time-calibrated trees. We further test this approach using three empirical datasets spanning widely differing timeframes.
Our results reveal large differences in the impact of many common methodological decisions, with the choice of clock (uncorrelated vs. autocorrelated) and loci having strong effects on inferred ages. Other decisions have comparatively minor consequences, including the use of the computationally intensive site-heterogeneous model CAT-GTR. Notably, these conclusions are as valid for Cenozoic divergences as they are for the deepest eukaryote nodes.
The package chronospace implements a range of graphical and analytical tools that assist in the exploration of sensitivity and the prioritization of computational resources in the inference of divergence times.
https://doi.org/10.5061/dryad.cfxpnvxdn
The data contained in this repository supports the results presented in Mongiardino Koch & Milla Carmona (2024), introducing the R package chronospace, and exploring its use to understand sources of uncertainty in divergence time estimation.
Description of the data and file structure
The repository contains two folders, which have been zipped for convenience.
The first of these, ‘Datasets’, includes in turn three subfolders, containing the data obtained from three publications dealing with the diersification of three clades, and whose names denote the focal clade (i.e., ‘Curculionoidea’, ‘Decapoda’, and ‘Eukaryota’). Each of these folders contain the same set of files:
- ‘all_gene_trees.tre’: A tree file containing all gene trees, ordered as in the phylogenomic dataset (see below), and formatted in Newick style. These were inferred using ParGenes v.1.0.1 and can be imported to R using function from package ape, or otherwise visualized using FigTree.
- ‘calibrations.txt’: A text file including node calibrations in the format expected by PhyloBayes. This includes information of two terminal species bracketing the constrained nodes, as well as minimum and maximum age bounds of said node.
- ‘outgroups.txt’: A text file listing the terminal names of taxa considered outgroups to the clade of interest, one per line.
- ‘partitions.txt’: An RAxML-style partition file, specifying the type of data (in this case, amino acids), the partition name (numeric), and the interval of columns of the alignment that each loci cover.
- ‘supermatrix.fa’: A FASTA file phylogenomic matrix, which can be read into R using functions from package phangorn, as well as provided to various software for phylogenetic inference.
- ‘topology.tre’: A tree file in Newick format containing a single tree considered by the authors of each study as the best estimate of the phylogeny of the species studied, and used for divergence time estimation under a constrained topology in PhyloBayes.
A second folder named ‘Results’, includes the same three subfolders (one for each clade), in which the following files can be found:
- 40 ‘.tre’ files, each including 500 time-scaled phylogenies (chronograms) for the clade of interest, obtained using PhyloBayes under different parametric conditions (varying the type of clock, the model of molecular evolution, the subset of genes used, and the shape of the prior on constrained nodes). Each of these is the result of combining the posterior trees of two replicated chains and subsampling randomly only 500 trees. The conditions leading to the inference of each file is specified in the name of the file (e.g., file names that include ‘ln’ where inferred using a log-normal autocorrelated clock, while those that include ‘ugam’ while inferred using an uncorrelated gamma-distributed clock).
- A single R data object, stored as a ‘.Rda’ file, containing the node ages of all 40 ‘.tre’ files ready to be imported into R using function load(). These files were generated using function extract_ages() from package chronospace, and are provided here simply to ease the replication of results.
Beyond these two folders, two R scripts are also provided. These are explained below
Sharing/Access information
Software used to generate these files include:
- Lartillot, N., Rodrigue, N., Stubbs, D. & Richer, J. (2013) PhyloBayes MPI. Phylogenetic reconstruction with infinite mixtures of profiles in a parallel environment. Systematic Biology, 62, 611-615.
- Mongiardino Koch, N. (2021) Phylogenomic subsampling and the search for phylogenetically reliable loci. Molecular Biology and Evolution, 38, 4025-4038.
- Morel, B., Kozlov, A.M. & Stamatakis, A. (2018) ParGenes: a tool for massively parallel model selection and phylogenetic tree inference on thousands of genes. Bioinformatics, 35, 1771-1773.
Genome-scale alignments, species trees, partition files, and fossil calibrations were obtained from:
- Shin, S., Clarke, D.J., Lemmon, A.R., Moriarty Lemmon, E., Aitken, A.L., Haddad, S., Farrell, B.D., Marvaldi, A.E., Oberprieler, R.G. & McKenna, D.D. (2018) Phylogenomic data yield new and robust insights into the phylogeny and evolution of weevils. Molecular Biology and Evolution, 35, 823-836.
- Strassert, J.F., Irisarri, I., Williams, T.A. & Burki, F. (2021) A molecular timescale for eukaryote evolution with implications for the origin of red algal-derived plastids. Nature Communications, 12, 1879.
- Wolfe, J.M., Breinholt, J.W., Crandall, K.A., Lemmon, A.R., Lemmon, E.M., Timm, L.E., Siddall, M.E. & Bracken-Grissom, H.D. (2019) A phylogenomic framework, evolutionary timeline and genomic resources for comparative studies of decapod crustaceans. Proceedings of the Royal Society B, 286, 20190079.
Code/Software
Script ‘prepare_and_analyze_datasets.R’ includes code that takes the files present within each ‘Dataset’ subfolder, subsamples the phylogenomic matrices using various approaches, and produces the files needed to run time calibrated inference in PhyloBayes (including slurm batch files). The code also takes the results present in the ‘Results’ folder and replicates the images included in the manuscript.
Script ‘genesortR_for_chronospaces.R’ is a modified script of the genesortR pipeline (Mongiardino Koch 2021) that has been slightly modified to generate small subsampled molecular datasets for divergence time estimation. This second script is automatically run from within the first one.
Data included in this repository include the publicly-available, genome-scale datasets originally gathered by Shin et al. 2018 (Curculionoidea), Wolfe et al. 2019 (Decapoda), and Strassert et al. (2021) Eukaryota. These were subsampled into various subsets of genes using R code that is provided, and used to generate alternative reconstructions of the diversification of these clades, using PhyloBayes v.4.1 (Lartillot et al. 2013). Posterior distributions of topologies for each of these files are provided, along with R code to process the data and generate the results that are presented in the manuscript.