Skip to main content
Dryad

Data and supplementary information from: Sequential bayesian phylogenetic inference

Cite this dataset

Hoehna, Sebastian; Hsiang, Allison (2024). Data and supplementary information from: Sequential bayesian phylogenetic inference [Dataset]. Dryad. https://doi.org/10.5061/dryad.qrfj6q5nq

Abstract

The ideal approach to Bayesian phylogenetic inference is to estimate all parameters of interest jointly in a single hierarchical model. However, this is often not feasible in practice due to the high computational cost. Instead, phylogenetic pipelines generally consist of sequential analyses, whereby a single point estimate from a given analysis is used as input for the next analysis (e.g., a single multiple sequence alignment is used to estimate a gene tree). In this framework, uncertainty is not propagated from step to step, which can lead to inaccurate or spuriously certain results. Here, we formally develop and test the sequential approach for Bayesian phylogenetic inference, which uses importance sampling to generate observations for the next step of an analysis pipeline from the posterior produced in the previous step. The sequential approach presented here not only accounts for uncertainty between analysis steps, but also allows for greater flexibility in software choice (and hence model availability) and can be more efficient computationally than the traditional joint approach when multiple models are being tested. We show that the sequential approach is identical in practice to the joint approach only if sufficient information in the data is present (a narrow posterior distribution) and/or sufficiently many importance samples are used. Conversely, we show that the common practice of using a single point estimate can be biased, e.g., a single phylogeny estimate to transform an unrooted phylogeny into time-calibrate phylogeny. We demonstrate the theory of sequential Bayesian inference using both a toy example and an empirical case study of insect divergence times estimation using a relaxed clock model from transcriptome data. In the empirical example, we estimate three posterior distributions of branch lengths from the same data (DNA character matrix with a GTR+Gamma+I substitution model, an amino acid data matrix with empirical substitution models, and an amino acid data matrix with the PhyloBayes CAT-GTR model). Finally, we apply three different-node calibration strategies and show that both, the data source and underlying substitution process to estimate branch lengths as well as the node-calibration strategies, impact divergence time estimates. Thus, our new sequential Bayesian phylogenetic inference provides the opportunity to efficiently test different approach for divergence time estimation, including branch lengths estimation from other software.

README: Sequential Bayesian Phylogenetic Inference

Empirical Data Analysis (Divergence Times of Insects)

The data are retrieved from Misof, Bernhard et al. (2015). Data from: Phylogenomics resolves the timing and pattern of insect evolution [Dataset]. Dryad. https://doi.org/10.5061/dryad.3c0f1

This repository contains five files from that study, where we simply transformed the data into a different file format. The files are:

  1. BLOSUM62.nex, JTT.nex and LG.nex which contain the amino acid dataset for the empirical amino acid substitution model analysis.
  2. supermatrix_C.nuc.all.fas which contains the DNA supermatrix
  3. supermatrix_D.phy which contains the amino acid supermatrix in Phylip format for PhyloBayes.

Funding

Deutsche Forschungsgemeinschaft