Skip to main content
Dryad

Online Appendix and Cetacean Datasets for: The Occurrence Birth-Death Process for combined-evidence analysis in macroevolution and epidemiology

Cite this dataset

Andréoletti, Jérémy et al. (2022). Online Appendix and Cetacean Datasets for: The Occurrence Birth-Death Process for combined-evidence analysis in macroevolution and epidemiology [Dataset]. Dryad. https://doi.org/10.5061/dryad.p8cz8w9rq

Abstract

Phylodynamic models generally aim at jointly inferring phylogenetic relationships, model parameters, and more recently, the number of lineages through time, based on molecular sequence data. In the fields of epidemiology and macroevolution these models can be used to estimate, respectively, the past number of infected individuals (prevalence) or the past number of species (paleodiversity) through time. Recent years have seen the development of “total-evidence” analyses, which combine molecular and morphological data from extant and past sampled individuals in a unified Bayesian inference framework. Even sampled individuals characterized only by their sampling time, i.e. lacking morphological and molecular data, which we call occurrences, provide invaluable information to reconstruct the past number of lineages.

Here, we present new methodological developments around the Fossilized Birth-Death Process enabling us to (i) incorporate occurrence data in the likelihood function; (ii) consider piecewise-constant birth, death and sampling rates; and (iii) reconstruct the past number of lineages, with or without knowledge of the underlying tree. We implement our method in the RevBayes software environment, enabling its use along with a large set of models of molecular and morphological evolution, and validate the inference workflow using simulations under a wide range of conditions.

We finally illustrate our new implementation using two empirical datasets stemming from the fields of epidemiology and macroevolution. In epidemiology, we infer the prevalence of the COVID-19 outbreak on the Diamond Princess ship, by taking into account jointly the case count record (occurrences) along with viral sequences for a fraction of infected individuals. In macroevolution, we infer the diversity trajectory of cetaceans using molecular and morphological data from extant taxa, morphological data from fossils, as well as numerous fossil occurrences. The joint modeling of occurrences and trees holds the promise to further bridge the gap between between traditional epidemiology and pathogen genomics, as well as paleontology and molecular phylogenetics.

Methods

Online Appendix : available in the `Related Works` section

Cetacean Datasets : [copied from the subsection *Material and methods* > *Cetacean data analysis* > *Molecular, morphological and occurrence datasets* of the main paper]

The data can be subdivided in three parts: molecular, morphological, and occurrences. Datasets were collected and analysed separately and are stored on the Open Science Framework (https://osf.io) ([dataset] Aguirre-Fern´andez et al., 2020). Molecular data comes from Steeman et al. (2009), and comprises 6 mitochondrial and 9 nuclear genes, for 87 of the 89 accepted extant cetacean species. Morphological data was obtained from Churchill et al. (2018), the most recent version of a widely-used dataset first produced by Geisler and Sanders (2003). After merging 2 taxa that are now considered synonyms on the Paleobiology Database (PBDB) and removing 3 outgroups that would have violated our model’s assumptions, it now contains 327 variable morphological characters for 27 extant and 90 fossil taxa (mostly identified at the species level but 21 remain undescribed). In order to speed up the analysis we further excluded the undescribed specimens and reduced this dataset to the generic level by selecting the most complete specimen in each genera. Indeed, the computing cost increases quadratically with the maximum number of hidden lineages N, to the point of becoming the bottleneck in our MCMC when N > 100. Given that a mid-Miocene peak diversity between 100 and 220 species is expected (Quental and Marshall, 2010), with less than 100 observed lineages in our inferred tree at that time, N should therefore be about 150. Inferring instead the tree of cetacean genera allows us to reduce N to 70 hidden lineages. The final dataset thus contains 41 extant and 62 extinct genera.

Occurrences come from the PBDB (data archive 9, M. D. Uhen) on May 11, 2020. The dataset initially consisted of all 4678 cetacean occurrences, but the cetacean fossil record is known to be subject to several biases (Uhen and Pyenson, 2007; Marx et al., 2016; Dominici et al., 2020). A detailed exploration (see Online Appendix E) of this occurrence dataset revealed several notable biases. First, an artefactual cluster of occurrences in very recent times, combined with other expected Pleistocene biases (Dominici et al., 2020), led us to remove all Late Pleistocene and Holocene occurrences. Second, we detected substantial variations in fossil recovery per time unit across lineages (see Fig. S10) resulting from oversampling of some species and localities, 295 possibly due to greater abundance or spatio-temporal biases (Dominici et al., 2020). This observation violates our assumption of identical fossil sampling rates among taxa during a given interval. In order to reduce this bias, we retained occurrences identified at the genus level and further aggregated all occurrences belonging to an identical genus found at the same geological formation. In the case of occurrences for which the geological formation was not specified, we used geoplate data combined with stratigraphic interval as a proxy for geological formation. This resulted in a total of 968 occurrences retained for the analysis.

Usage notes

Online Appendix :

This online Appendix presents the detailed derivation of the model used by Andréoletti, Zwaans et al., as well as supplementary results and figures. We extend results of Gupta et al. (2020) and Manceau et al. (2021) to piecewise-constant parameters, describe our implementation in the RevBayes software, and give detailed information on all priors used for simulation or inference in our analyses.

Cetacean molecular, morphological and occurrence datasets :

The initial raw files are included, but the modified files that were effectively used in the analysis are the following :
 - Taxa : Cetacea_genera.csv
 - Nuclear sequences : M4358_nuclear_simplified_newNames_genera_removeOutgroups.nex
 - Mitochondrial sequences : M4376_mt_simplified_newNames_genera_removeOutgroups.nex
 - Morphological characters : morpho_simplified_newNames_genera_removeOutgroupsUndescribedInvariants.nex
 - Fossil occurrences : Cetacea_occurrences_min_max_age_species_corrected.csv

All modifications are described in the methods section and/or below.

Molecular dataset :
 - "newNames" = updated names from the PBDB in May 2020 (physeter catodon -> Physeter macrocephalus) 
 - "genera" = keep only the most complete specimen in each genus has been kept (present in the morphological dataset then longest nuclear sequence) for genus-level analyses
   // Removed species : Balaenoptera acutorostrata, Balaenoptera bonaerensis, Balaenoptera borealis, Balaenoptera brydei, Balaenoptera edeni, Balaenoptera musculus, Balaenoptera omurai, Berardius arnuxii, Cephalorhynchus commersonii, Cephalorhynchus eutropia, Cephalorhynchus hectori, Delphinus capensis, Delphinus tropicalis, Eubalaena australis, Eubalaena japonica, Globicephala melas, Hyperoodon planifrons, Kogia simus, Lagenorhynchus acutus, Lagenorhynchus obliquidens, Lagenorhynchus australis, Lagenorhynchus cruciger, Lagenorhynchus obscurus, Lissodelphis peronii, Mesoplodon bidens, Mesoplodon bowdoini, Mesoplodon carlhubbsi, Mesoplodon densirostris, Mesoplodon stejnegeri, Mesoplodon ginkgodens, Mesoplodon grayi, Mesoplodon hectori, Mesoplodon layardii, Mesoplodon mirus, Mesoplodon perrini, Mesoplodon peruvianus, Mesoplodon traversii, Phocoena dioptrica, Phocoena sinus, Phocoena spinipinnis, Platanista minor, Sotalia guianensis, Stenella attenuata, Stenella clymene, Stenella frontalis, Stenella longirostris, Tursiops aduncus
 - remove outgroups (Bos taurus, Sus scrofa, Hippopotamus amphibius)

Morphological dataset :
 - morpho_conservative.nex : initial dataset
 - "newNames" = updated names from the PBDB in May 2020
 - "simplified" = simpler NEXUS files for RevBayes
 - "genera" = keep only the most complete specimen in each genus has been kept (lowest missing proportion then higher number of unambiguous states) for genus-level analyses
   // Removed species : Atocetus nasalis, Brachydelphis jahuayensis, Haborophocoena minutus, Lophocetus repenningi, Odobenocetops peruvianus, Otekaikea huata, Parapontoporia wilsoni
 - remove outgroups (Bos taurus, Sus scrofa, Hippopotamus amphibius)
 - remove undescribed taxa (CCNHM 1078, CCNHM 208, CCNHM 210, CCNHM 567, CCNHM Schizodelphis, ChM PV2758, ChM PV2761, ChM PV2764, ChM PV4178, ChM PV4745, ChM PV4746, ChM PV4755, ChM PV4802, ChM PV4834, ChM PV4961, ChM PV5711, ChM PV5720, ChM PV5852, ChM PV7679, Schizodelphis morckhoviensis, Xenorophus sp.)
 - remove invariant characters
 - remove uncertainty-polymorphism (viewed as missing)

Occurrence dataset :
 - downloaded from the Paleobiology Database (PBDB) on May 11th 2020

Scripts :

Validation 2 : simulation-based calibration
- Validation_2_simulations.py : python script to simulate 1000 trees from prior distributions
- OBDP_validation_2.Rev : RevLanguage script to infer posterior distributions of the OBDP parameters for each of the simulated trees

Cetacean diversification analysis :
- mcmc_OBDP_Cetaceans_genera_constrained.Rev : RevLanguage script to estimate the cetacean tree and posterior distributions of the OBDP parameters
- inferKt_Cetaceans.Rev : RevLanguage script to infer the probability distribution of the number of cetacean genera through time

Diamond Princess SARS-CovV-2 transmission analysis :
- mcmc_DiamondPrincess.Rev : RevLanguage script to estimate the transmission tree and posterior distributions of the OBDP parameters
- inferKt_DiamondPrincess.Rev : RevLanguage script to infer the probability distribution of the number of infected individuals through time

- Plot_Kt_density.R : R script using new RevGadgets functions to plot the results from the inferKt scripts

Funding

ETH Zürich Postdoctoral Fellowship