Quantifying the error of secondary vs. distant primary calibrations in a simulated environment

Powell, Christopher 1 ; Waskin, Sydney1 ; Battistuzzi, Fabia Ursula1

Published Feb 17, 2020 on Dryad. https://doi.org/10.5061/dryad.1zcrjdfp5

Data files

Feb 17, 2020 version files 21.06 MB

SimulatedSequences.7z

21.06 MB

Abstract

Using calibrations to obtain absolute divergence times is standard practice in molecular clock studies. While the use of primary (e.g., fossil) calibrations is preferred, this approach can be limiting because of their rarity in fast-growing datasets. Thus, alternatives need to be explored, such as the use of secondary (molecularly-derived) calibrations that can anchor a timetree in a larger number of nodes. However, the use of secondary calibrations has been discouraged in the past because of concerns in the error rates of the node estimates they produce with an apparent high precision. Here, we quantify the amount of errors in estimates produced by the use of secondary calibrations relative to true times and primary calibrations placed on distant nodes. We find that, overall, the inaccuracies in estimates based on secondary calibrations are predictable and mirror errors associated with primary calibrations and their confidence intervals. Additionally, we find comparable error rates in estimated times from secondary calibrations and distant primary calibrations, although the precision of estimates derived from distant primary calibrations is roughly twice as good as that of estimates derived from secondary calibrations. This suggests that increasing dataset size to include primary calibration may produce divergence times that are about as accurate as those from secondary calibrations, albeit with a higher precision. Overall, our results suggest that secondary calibrations may be useful to explore the parameter space of plausible evolutionary scenarios when compared to time estimates obtained with distant primary calibrations.

We started from a main tree of 248 species represented in a tree of life. This main tree was split into two subtrees, tree A (173 species) and tree B (71 species), that represent two clades and maximize the size of the dataset in each tree. We then added to these clades two shared lineages which were arbitrarily chosen and an outgroup. This setup created two nested phylogenies that were used to test hypotheses on calibrations’ performance. To simulate multiple genes, we used a set of 446 empirical parameters (e.g., length, GC content, initial evolutionary rate) and altered the main timetree according to an autocorrelated model (ν = 1) that resulted in estimated rates of up to ± 25% of the mean rate. This effectively created 446 phylogenies with different branch lengths but same topology. These parameters were given to SeqGen to simulate genes under an Hasegawa-Kishino-Yano (HKY) model. Ten random sets of individual genes were then concatenated to reach a length of at least 30,000 sites (30,029 – 30,725). In addition, we also created one concatenation with all genes (approximately 604,000 sites) and two concatenations of half the number of genes (223 genes per concatenation) with lengths of 273,812 and 330,187. Each of these concatenations were used independently in downstream analyses. Patterns between the 30k, half, and full concatenations were similar. Therefore, we discuss results from the 30k concatenations because they allow us to evaluate the variance of estimates among datasets. For primary calibrations, three nodes from tree A were chosen: a relatively shallow node at 63.9 million years ago (mya), and two that were deeper in the tree but in two different clades (209.4 mya and 220.2 mya). The overlapping node between tree A and B has an intermediate depth (167 mya) within tree A and is centrally placed within the topology of tree B. These primary and secondary calibrations were chosen to minimize the effect that biased location (e.g., all within one clade) and divergence times (e.g., all young nodes) may have on the accuracy of estimations.

Quantifying the error of secondary vs. distant primary calibrations in a simulated environment

Data files

Abstract

Methods

Usage notes

Works referencing this dataset