Data from: But the clock, tick-tock: The preeminence of relaxed clock models in total-evidence dated phylogenetics
Data files
Aug 12, 2025 version files 692.81 MB
-
Final_matrices.zip
2.05 MB
-
Orthologs_from_GenBank.zip
293.88 MB
-
Other_supporting_files.zip
228.09 MB
-
README.md
16.72 KB
-
Scripts.zip
21.10 KB
-
Trees.zip
168.75 MB
Abstract
Phylogenetic clock models translate inferred amounts of evolutionary change (calculated from either genotypes or phenotypes) into estimates of elapsed time, providing a mechanism for time scaling phylogenetic trees. Relaxed clock models, which accommodate variation in evolutionary rates across branches, are one of the main components of Bayesian dating, yet their consequences for total-evidence phylogenetics have not been thoroughly explored. Here, we combine morphological, molecular (both transcriptomic and Sanger-sequenced), and stratigraphic datasets for all major lineages of echinoids (sea urchins, heart urchins, sand dollars). We then perform total-evidence dated inference under the fossilized birth-death prior, varying two analytical conditions: the choice between autocorrelated and uncorrelated relaxed clocks, which enforce (or not) evolutionary rate inheritance; and the ability to recover ancestor-descendant relationships. Our results show that the latter has no impact on either topology or node ages and highlight a previously unnoticed interaction between the tree and clock models, with analyses implementing an autocorrelated clock precluding the recovery of direct ancestry. On the other hand, tree topology, fossil placement, divergence times, and downstream macroevolutionary inferences (e.g., ancestral state reconstructions) in sea urchins are all strongly affected by the type of relaxed clock implemented. In regions of the tree where molecular rate variation is pervasive and morphological signal relatively uninformative, fossil tips seem to play little to no role in informing divergence times, and instead passively move in and out of clades depending on the ages imposed upon them by molecular data. Our results highlight the extent to which the phylogenetic and macroevolutionary conclusions of total-evidence dated analyses are contingent on the choice of relaxed clock model, highlighting the need for either careful methodological validation or a thorough assessment of sensitivity. Our efforts continue to illuminate the echinoid tree of life, supporting the erection of the order-level clade Apatopygoida to include three living species last sharing a common ancestor with other extant lineages in the Jurassic. Furthermore, they also illustrate how the phylogenetic placement of extinct clades hinges upon the modelling of molecular data, evidencing the extent to which the fossil record remains subservient to phylogenomics.
https://doi.org/10.5061/dryad.xgxd254s5
Description of the data and file structure
The data contained in this repository supports the results presented in Mongiardino Koch et al. (2025), presenting the latest total-evidence dated phylogeny of all major living and extinct clades of echinoids (sea urchins, sand dollars, heart urchins, and allies). Inference in built upon state-of-the-art morphological, molecular (both transcriptomic and Sanger-sequenced, and stratigraphic datasets, and provides new approaches for automatically extracting and combining this different sources of phylogenetic information. It furthermore explores the phylogenetic and macroevolutionary consequences of two methodological decisions generally determined arbitrarily by researchers: 1) the choice of clock model, and 2) the implementation of direct ancestry among fossil terminals. Results reveal the extreme dependence of tree topology, node ages, fossil placement, and morphological macroevolution on the former of these two, while the decision to allow for sampled ancestors does not carry major consequences. These insights are of major relevance for the entire phylogenetic community.
Files and variables
File: Final_matrices.zip
Description: This zip file contains phylogenetic matrices in various formats (NEXUS, phylip, and fasta) and accompanying files. Datasets are of 4 major types, depending on the information contained: morphological, transcriptomic molecular data coded as amino acids, Sanger-sequenced molecular coded as nucleotides, and combined datasets incorporating all of the former. Each one of these data types is identified with file names starting with letters a through d, respectively. The following files are included:
- 'a_morphology.nex': A morphological dataset in NEXUS format, including 303 characters coded for 169 terminals. Terminal names include both the species used for coding, as well as the higher-level clade it represents (family/subfamily/tribe). These two are separated by a pipe symbol. The dataset can be visualized in either Mesquite or a text editor, run in phylogenetic software such as PAUP*, or imported into R for other purposes using package Claddis.
- 'b1_transcriptomic_385genes.fa': 385 high-occupancy loci subsampled from the larger phylogenomic dataset of Mongiardino Koch et al. (2022) and coded as amino acids. Loci retained have an occupancy > 80%, and were subsequently sorted from high to low phylogenetic usefulness using R pipeline genesortR (Mongiardino Koch 2021). The final alignment includes 86,074 positions coded for 35 echinoids and 1 holothuroid outgroup. Terminal names include both the species used for coding, as well as the higher-level clade it represents (family/subfamily/tribe). These two are separated by a pipe symbol. The dataset can be visualized in a text editor, run in phylogenetic software such as IQTREE, or imported into R for other purposes using package ape. The first 25 loci included are extracted by R code included in 'Scripts.zip' (file 'merge_datasets.R') and combined with the remainder sources of information to assemble the combined datasets labelled with letter d within this same zip file.
- 'b2_transcriptomic_385genes.txt': A text file containing RAxML-style partitioning information (loci start and end positions) for the 385 transcriptomic loci contained within file 'b1_transcriptomic_385genes.fa'.
- 'c1_nuclear_18S_RNA.phy': Phylogenetic matrix of nuclear 18S ribosomal RNA sequences in phylip format (25 taxa, 1767 nucleotide positions). The data was assembled from a de novo inference of orthologous sequences of all echinoid nuclear data deposited in NCBI. The orthologous clusters identified through this approach were subsampled to one sequence per major echinoid clade using R code included in 'Scripts.zip' (file 'match_seq_to_taxonomy.R'), and aligned using MAFFT. Terminal names include both the species used for coding and their corresponding Genbank accession number, as well as the higher-level clade it represents (family/subfamily/tribe). These two are separated by a pipe symbol.
- 'c2_nuclear_28S_RNA.phy': Phylogenetic matrix of nuclear 28S ribosomal RNA sequences in phylip format (40 taxa, 1063 nucleotide positions). The dataset was generated and formatted as described for file 'c1_nuclear_18S_RNA.phy'.
- 'c\3_mitochondrial_12S_RNA.phy': Phylogenetic matrix of mitochondrial 12S ribosomal RNA sequences in phylip format (20 taxa, 531 nucleotide positions). The dataset was generated and formatted as described for file 'c1_nuclear_18S_RNA.phy'.
- 'c4_mitochondrial_16S_RNA.phy': Phylogenetic matrix of mitochondrial 16S ribosomal RNA sequences in phylip format (42 taxa, 649 nucleotide positions). The dataset was generated and formatted as described for file 'c1_nuclear_18S_RNA.phy'.
- 'c5_mitochondrial_COI.phy': Phylogenetic matrix of mitochondrial cytochrome oxidase I (COI) sequences in phylip format (62 taxa, 1252 nucleotide positions). The dataset was generated and formatted as described for file 'c1_nuclear_18S_RNA.phy'.
- 'd1_total_evidence_IGR_SA.nex': NEXUS formatted phylogenetic matrix combining the information contained in files 1 through 8. The skeleton of this file was automatically generated using R code included in 'Scripts.zip' (file 'merge_datasets.R'), which was subsequently modified using a text editor to add instructions to perform total-evidence dated phylogenetic inference in MrBayes using an uncorrelated IGR clock model and allowing for sampled ancestry.
- 'd2_total_evidence_IGR_noSA.nex': NEXUS formatted phylogenetic matrix combining the information contained in files 1 through 8. The skeleton of this file was automatically generated using R code included in 'Scripts.zip' (file 'merge_datasets.R'), which was subsequently modified using a text editor to add instructions to perform total-evidence dated phylogenetic inference in MrBayes using an uncorrelated IGR clock model and not allowing for sampled ancestry.
- 'd3_total_evidence_TK02_SA.nex': NEXUS formatted phylogenetic matrix combining the information contained in files 1 through 8. The skeleton of this file was automatically generated using R code included in 'Scripts.zip' (file 'merge_datasets.R'), which was subsequently modified using a text editor to add instructions to perform total-evidence dated phylogenetic inference in MrBayes using an autocorrelated TK02 clock model and allowing for sampled ancestry.
- 'd4_total_evidence_TK02_noSA.nex': NEXUS formatted phylogenetic matrix combining the information contained in files 1 through 8. The skeleton of this file was automatically generated using R code included in 'Scripts.zip' (file 'merge_datasets.R'), which was subsequently modified using a text editor to add instructions to perform total-evidence dated phylogenetic inference in MrBayes using an autocorrelated TK02 clock model and not allowing for sampled ancestry.
File: Orthologs_from_GenBank.zip
Description: This zip file contains 118 fasta files each containing orthologous nucleotide sequences for echinoids. Sequences come from a batch download of all non-redundant echinoid sequences deposited in GenBank, which were clustered into orthologs using the methods described in Yang and Smith (2014) as implemented by Picciani et al. (2018). These files are filtered and processed using R code included in 'Scripts.zip' (file 'match_seq_to_taxonomy.R') to produce the five nucleotide phylip datasets found in zip file 'Final_matrices.zip' (these would have to be aligned with MAFFT to be identical).
File: Scripts.zip
Description: Three R scripts are provided to replicate most of the analyses performed. All code dependencies are made clear and necessary packages are loaded (if previously installed) before runnning.
- 'match_seq_to_taxonomy.R': This R script loads sets of ortholog clusters (i.e., those available within zip file 'Orthologs_from_GenBank.zip) assigns each sequence to a higher-level (suprageneric) clades using a taxonomic hierarchy (available in zip file 'Other_supporting_files.zip' as RData file 'echinoid_taxonomy_13Jul21.Rda'), and selects only one sequence per each of the target clades (in this case, defined in file 'OTUs.csv', also included within zip file Other_supporting_files.zip'). The taxonomic hierarchy used was web scraped from the World Register of Marine Species using the deWoRMR R script (available from https://github.com/mongiardino/deWoRMR). The filtered datasets are available as five nucleotide phylip datasets found in zip file 'Final_matrices.zip' (these would have to be aligned with MAFFT to be identical).
- 'merge_datasets.R': This R script loads all individual phylogenetic datasets available in zip file 'Final_matrices.zip' (i.e., files 1 through 8) and uses the pipe symbol used in their terminal names to combine the information for the same terminal spread across multiple files. It also incorporates missing entries for all terminals not included within any individual file. The resulting combined matrix is output and served as the backbone upon which files 9 through 12 of zip file 'Final_matrices.zip' were generated.
- 'analyses_echinoid_TE2.R': This R script includes all necessary code to replicate downstream macroevolutionary analyses and topological comparisons of the results of different total-evidence dated runs. To do so, it loads and processes files saved within zip files 'Other_supporting_files.zip' and 'Trees.zip'.
File: Other_supporting_files.zip
Description: This zip file contains five heterogenous files that either contain necessary information to run the processes included within the R routines included in zip file 'Scripts.zip', or are files that these files would generate but after a considerable amount of time, and are thus provided to speed up replication.
- 'ages_ltt.Rda': This file is contains ages for all nodes posterior samples of trees found in 'Trees.zip'. These ages are extracted from said trees by R code found in file 'analyses_echinoid_TE2.R' (within 'Scripts.zip'). Nonetheless, given that it takes considerable amount of time to generate this data, it is provided to aid in replication and plotting of results. The data can be loaded directly into R.
- 'echinoid_taxonomy_13Jul21.Rda': This file contains the taxonomic hierarchy of echinoids as contained in the World Register of Marine Species (WoRMS) on July 2021. Contained are all the taxonomic hierarchies assigned to each valid species of echinoid. The data was web scraped using deWoRMR (https://github.com/mongiardino/deWoRMR) and is used by script 'match_seq_to_taxonomy.R' (within 'Scripts.zip') to assign molecular sequences to higher level clades that are used as terminals in the phylogenetic analysis. The data can be loaded directly into R.
- 'OTUs.csv': A comma-separated value table of Operational Taxonomic Units (OTUs), i.e., the terminals used for the phylogenetic analysis, defined as clades at the level of genus or above. The table contains information regarding the name of the targeted clade (column OTU), it's taxonomic rank (column Rank), the species within said clades that were used to obtain morphological and transcriptomic data (columns Morphological and Transcriptomic, respectively), a boolean indicating whether the clade is contained within another OTU due to lack of taxonomic resolution, and if so, inside which other OTU (columns Nested, and Inside, respectively; the latter containing entries only if Nested is TRUE), and finally, a boolean indicating whether the clade contains any extant members (column Extant?). The table is used by script 'match_seq_to_taxonomy.R' (within 'Scripts.zip') to assign molecular sequences to higher level clades that are used as terminals in the phylogenetic analysis. The data can be loaded directly into R.
- 'quartet_dists.Rda': This file is contains all pairwise symmetric quattet distances between the sets of posterior trees found in 'Trees.zip'. These values are calculated from said trees by R code found in file 'analyses_echinoid_TE2.R' (within 'Scripts.zip') using functions from package quartet. Nonetheless, given that it takes considerable amount of time to generate this data, it is provided to aid in replication and plotting of results. The data can be loaded directly into R.
- 'Stratigraphic_data.xlsx': A comma-separated value table of stratigraphic information for every extinct clade present in the analysis. The data was gathered from the primary paleontological literature and includes, for every clade (column OTU), the extinct species for which the data was gethered (generally the type species of the OTU, also the one for which the morphological data was coded; column Species), some general taxonomic information considered pertinent (column Notes), the stratigraphic units containing said fossil (columns Period/Epoch and Age/StageSubdivision), the stratigraphic range of the fossil (separated into an Oldest Bound (MA) and Youngest Bound (MA) columns, both numeric and in millions of years), a source from which the information was extracted (column Publication), further details on the geological formation containing the fossil taxon (if available; columns Formation and Precise Biostratigraphic Range), and a column with Stratigraphic References. The dates contained were incorporated into the combined NEXUS file used for phylogenetic inference (files 9-12 of 'Final_matrices.zip') as tip dates.
File: Trees.zip
Description: Contained in this folder are numerous files containing phylogenetic trees of various kinds obtained from each of the four phylogenetic analyses performed (i.e., IGR-SA, IGR-noSA, TK02-SA, and TK02-noSA). For each of these analyses, the base folder contains:
- A maximum clade credibility tree in NEXUS format (files 'IGR_SA_MCC.nex', 'IGR_noSA_MCC.nex', 'TK02_SA_MCC.nex', and 'TK02_noSA_MCC.nex'). These were obtained using TreeAnotator, part of the BEAST software.
- A majority rule consensus tree in Newick format (files 'IGR_SA_MRC.tre', 'IGR_noSA_MRC.tre', 'TK02_SA_MRC.tre', and 'TK02_noSA_MRC.tre'). These were generated by MrBayes at the end of each phylogenetic run.
- A random subsample of 15,000 posterior trees of each analysis in NEXUS format (files 'IGR_SA_big_posterior_sample.nex', 'IGR_noSA_big_posterior_sample.nex', 'TK02_SA_big_posterior_sample.nex', and 'TK02_noSA_big_posterior_sample.nex'). These were subsampled from the full list of trees output by MrBayes using R code.
- A random subsample of 1,500 posterior trees of each analysis in NEXUS format (files 'IGR_SA_small_posterior_sample.nex', 'IGR_noSA_small_posterior_sample.nex', 'TK02_SA_small_posterior_sample.nex', and 'TK02_noSA_small_posterior_sample.nex'). These were subsampled from the full list of trees output by MrBayes using R code.
The zip file also contains 3 folders ('clades_100_percent', 'clades_95_percent', and 'clades_90_percent'). Within each are four files (one per phylogenetic analyses) in which the trees found in the files described under item 4 above were further subsampled and their nodes collapsed to retain only clades found with high probability (i.e., either 100%, 95%, or 90%). These folders and files are generated by R code found in file 'analyses_echinoid_TE2.R' (within 'Scripts.zip'). Nonetheless, given that it takes considerable amount of time to generate this data, it is provided to aid in replication and plotting of results.
Code/software
Phylogenetic matrices in NEXUS, fasta or phylip formats can be view and run using a variety of software (Beast, MrBayes, IQ-TREE, PAUP*, RAxML, etc). Phylogenetic software used to run final NEXUS files data was MrBayes v.3.2.7a.
Sequence data in fasta format can be viewed using text editors or dedicated software such as SeaView. It can also be loaded into R using functions of package ape, and aligned using softare such as MAFFT.
.Rda files and .csv files can be loaded and inspected in R using functions load() and read.csv(), respectively.
.R files contain R scripts that replicate all aspects of this study. These have numerous package dependencies which are lited at the beginining of each file.
Tree files in Newick (.tre) and NEXUS formats (.nex) can be visualized using FigTree. They can also be loaded into R using functions read.tree() and read.nexus(), both from package ape.
