Skip to main content

Nucleotide substitutions during speciation may explain substitution rate variation

Cite this dataset

Janzen, Thijs (2021). Nucleotide substitutions during speciation may explain substitution rate variation [Dataset]. Dryad.


Although molecular mechanisms associated with the generation of mutations are highly conserved across taxa, there is widespread variation in mutation rates between evolutionary lineages. When phylogenies are reconstructed based on nucleotide sequences, such variation is typically accounted for by the assumption of a relaxed molecular clock, which is just a statistical distribution of mutation rates without much underlying biological mechanism. Here, we propose that variation in accumulated mutations may be partly explained by an elevated mutation rate during speciation. Using simulations, we show how shifting mutations from branches to speciation events impacts inference of branching times in phylogenetic reconstruction. Furthermore, the resulting nucleotide alignments are better described by a relaxed than by a strict molecular clock. Thus, elevated mutation rates during speciation potentially explain part of the variation in substitution rates that is observed across the tree of life.


The file "Supplementary information.pdf" contains all supplementary results, as indicated in the main text. 

The associated R package 'nodeSub' has been attached as tar.gz file, from which it can be installed. Alternatively, the package can be installed from GitHub using: devtools::install_github("thijsjanzen/nodeSub"), or obtained from CRAN:

A tarball of the package phangorn as used for Figure 5 has been included for reference.

Furthermore, two example XML files as used by BEAST2 are included for inference of a full tree ('tree_inference_example.xml') and for obtaining the marginal likelihoods of the different substitution models ('example_marginal.xml').

The files "" and "" contain code and data used for each figure in the main text and the supplement. After unzipping and combining, each folder contains 4 files:

- simulate.R  - R code used to simulate alignments, perform inference and calculate summary statistics
- plot_figure.R - R code used to plot the figure
- simulated_data.txt - summary information of the simulated data. Because the BEAST chains are huge, in order to be able to handle the output data, we have summarized the posterior distributions of summary statistics and collected them. We have included an example collection script (called collect.R) that demonstrates how we obtained these data files.
- Figure_*.pdf  - the resulting figure in high resolution