Skip to main content
Dryad

Data from: Phylogeny inference under time-decaying migration and varying information content

Cite this dataset

Shaik, Zaynab; Bergh, Nicola; Verboom, George; Oxelman, Bengt (2023). Data from: Phylogeny inference under time-decaying migration and varying information content [Dataset]. Dryad. https://doi.org/10.5061/dryad.cc2fqz6d1

Abstract

Postspeciation gene flow is widespread across the Tree of Life but is ignored as a cause of gene tree discordance under the standard multispecies coalescent. Where interspecific migration has occurred but is not modelled explicitly, effective population sizes, divergence times and topology can be seriously misestimated. Isolation-with-migration and multispecies coalescent-with-introgression models explicitly model migration but include additional parameters that limit their computational viability with even moderately sized molecular data sets. Here we simulate the evolution of sequences which vary in molecular information content under the coalescent while allowing continuous, tree-wide gene flow/migration between contemporaneous branches, the rate of which decreases with time since divergence. Using simulated sequences, we evaluate the performance of DENIM under rapidly to gradually time-decaying migration and benchmark its performance against the standard MSC method StarBeast3. DENIM consistently outperforms StarBeast3, both in phylogenetic accuracy and computational performance per core. Rapidly decaying migration is associated with improved topology and divergence time estimates under both DENIM and StarBeast3. While species tree estimation accuracy is not improved by increasing the number of loci from 30 to 60 under either method, model convergence is slowed considerably. By contrast, increasing sequence length to 10,000 bp has no clear effect on convergence rates, but shows a tendency towards increased accuracy in DENIM. We apply DENIM and StarBeast3 with a 36-locus empirical bat data set and recover species trees identical in topology to those obtained with 12,931 loci. Our work demonstrates that DENIM can deliver accurate phylogenetic estimates in the presence of both deep coalescence and empirically realistic migration patterns using as few as 30 loci with single-core runtimes of 2-3 days.

README: Phylogeny inference under time-decaying migration and varying information content

https://doi.org/10.5061/dryad.cc2fqz6d1

Input and output files for the simulated and empirical data for DENIM and StarBeast3 with supporting Figure S1 and Tables S1 and S2.

Description of the data and file structure

Empirical data: the folder "Empirical-data" contains the empirical results and a Read-me.txt file explaining the file contents. The folder "Bat-primate-sequence-data" contains the bat and primate sequence data from Jebb et al. (2020), provided separately for licensing reasons.

Simulated data: folders beginning with "Simdat" include files for the simulated data with a Read-me.txt file in "Simdat_9_Readme-and-SimulatedMigs" explaining the file contents.

Folders "8_Locus-length-number-combs-not-included_filepart*" are the input and output data files for simulated locus length and number combinations that failed to reach convergence after several weeks of computation.

Figure S1 and Tables S1 and S2 are included as separate files.

Sharing/Access information

All supporting files are provided on Dryad.

Code/Software

Annotated R scripts for the empirical and simulated data are in the folders "Empirical-data" and "Simdat_7_R-scripts" respectively.