Data and code from: Spatially heterogeneous gene flow may hinder linking phylogeographic data to macroevolutionary patterns

Published Mar 05, 2026 on Dryad. https://doi.org/10.5061/dryad.9w0vt4bsb

Data files

Mar 05, 2026 version files 1.68 MB

01_simulations.zip

1.63 MB
02_processing.zip

14.79 KB
03_mapping.zip

3.62 KB
04_divergence_dates.zip

31.13 KB
README.md

8.57 KB

Abstract

Biogeography relies on estimates of phylogeny and gene flow to test hypotheses. Established theory posits (1) gene flow varies non-randomly with respect to geography and (2) gene flow can mislead phylogenetic inference. These findings, often viewed in isolation, together suggest that spatial impacts on gene flow could bias biogeographic inference. For example, whenever a “peripheral” population is isolated and inferred as sister to a “core” clade of adjacent populations that exchange migrants, it might reflect elevated gene flow of “core” populations rather than macroevolutionary processes (e.g., direction of colonization). Given this ubiquitous but often unacknowledged biogeographic setting, it remains unknown how spatial influences on gene flow impact metrics used to assess biogeographic hypotheses (e.g., branch length or monophyly). We simulate gene flow that varies nonrandomly regarding geography and found that relatively low levels of gene flow lead to monophyly of adjacent populations, with longer branches for peripheral populations, regardless of true divergence history. We highlight an empirical example wherein antbirds (Thamnophilidae) are young in Amazonia, but older at Amazonia’s periphery. Although it is unclear if the simulated bias applies to antbirds, our results suggest a distinguishability problem for any nodes stemming from radiations in geographically heterogeneous environments.

Last updated 4 March 2026

Repository for scripts and input files used in a study that investigates the potential role of spatially heterogeneous gene flow in shaping phylogenetic inference, with an eye towards a biogeographic pattern (the core-periphery effect).

Sections 1 through 3 are written by Ethan Gyllenhaal; Section 4 is written by Lukas Musher.

This README describes the scripts and data files used for the aforementioned project. Each section corresponds to a zipped directory in the Dryad repository.

Coalescent Simulations (01_simulations.zip)

core_phylo_noanc.py - Python script used to run msprime simulations, taking in a set of parameters defined in the batch script and outputting fasta format files for phylogenetic inference. Takes in parameters for output path, replicate number, number of migrants per generation (generally <1), and per population effective pop. size, time to run after colonization completes, interval of colonization events, and a string split order. Useage like python $dir/core_trial_noanc.py -r [int replicate] -m [float Nm] -n [int/float effective pop size] -e [int/float post-split evolution time] -i [int/float split interval in gens] -s [string split interval: CAB, ACB, or ABC] -o [output stem, ideally includes all variables used].

variant_fasta.sh - Custom script for reducing output fasta files to variant sites only, for space efficiency. Usage: sh variant_fasta.sh in.fasta out.fasta.

full_coresim_run.slurm - SLURM bash script used for running replicates. It uses GNU parallel to iterate over replicates. Note that this did not finish in the 48-hour walltime, and had to be resumed several times. Uploaded script reflects the original run.

sim_tree_output.zip - Zipped directory of all output treefiles from simulations. Used as input for the next section.

Simulation processing (02_processing.zip)

summarize_branch.py - Python script for summarizing output phylogenies run in a batch script with output from the msprime Python script (both in section 1). This uses ETE3 to process trees and output topologies and branch lengths. Takes a big list of tree files as input. Usage example: python summarize_branch.py -i 'sim_tree_output/*treefile' -o coresim_output_branch.csv

coresim_output_branch.csv - Output from summarize_branch.py. Columns are as follows, (NOTE, after the first use of Anest, Bnest, ABnest, Cmono, Other, Non-monophyletic, and Poor-resolution, the full set are refered to with a *):

First column is the stem name for the parameter combination
Anest - Number of topologies with population A but not B nested in the core (C) populations
Bnest - Number of topologies with population B but not A nested in the core (C) populations
ABnest - Number of topologies with population A and B nested in the core (C) populations
Cmono - Number of topologies with a monophyletic core group (all 4 populations)
Other - number of topologies not matching the prior set, never found here, not sure it can happen but a catch for if it does
Non-monophyletic - Number of topologies where A or B were non-monophyletic (none in this set)
Poor-resolution - Number of poorly resolved topologies without a well-supported monophyletic core
*total - The SUMMED tree heights for each category, notably not a mean, needs to be divided by the count per category
*_core - The summed mean branch lengths of core populations per category
*_a - The summed branch lengths of population a per category
*_b - The summed branch lengths of population b per category
total - The summed tree heights across categories
core - The summed mean core branch lengths across categories
a - The summed population a branch lengths across categories
b - The summed population b branch lengths across categories
name - Same as the first column, stem name of files
disp - Simulated core gene flow level in number of migrants WITHOUT THE DECIMAL (IN THIS CASE DIVIDED BY 10) for easier parsing
ne - Simulated per population effective population size
time - Simulated post-divergence time in generations
int - Simulated split interval
history - Simulated split history, named based on what orders divergences occurred in (A = population a, B = population b, C = core radiation)

plot_topo_branch_heatmap.R - R script for plotting summarized phylogenetic information, primarily heat maps. Also includes code for making input used in 04_divergence_dates (Fig 3A)

Richness Map (03_mapping.zip)

plot_antbird_richness.R - R script that uses BirdLife data to plot species richness map. Note that you'll have to get your own BirdLife data: https://datazone.birdlife.org/contact-us/request-our-data. Use the species list (below) to subset the massive input dataframe.

species.txt - List of antbird species used here, based on BirdLife taxonomy.

Divergence date estimation (04_divergence_dates.zip)

area.thamno.txt - Output from R code categorizing biogeographic scores into categories for plot in Figure 3B.

violin_full.csv - Input used for violin plots, generated by plot_topo_branch_heatmap.R (step 2). Most values from coresim_output_branch.csv, with some additional math from R. Columns are below:

Combination - Arbitrary number for each combination of parameters for each population's branch length
disp - Simulated core gene flow level in number of migrants WITHOUT THE DECIMAL (IN THIS CASE DIVIDED BY 10) for easier parsing
ne - Simulated per population effective population size
time - Simulated post-divergence time in generations
int - Simulated split interval
a - The summed population a branch lengths across categories
b - The summed population b branch lengths across categories
total - The summed tree heights across categories
core - The summed mean core branch lengths across categories
core_per - The summed mean core branch length per parameter combination divided by the smaller of summed population a or b branch lengths (if doing per category, would have to divide by the category count first, not relevant here)
branch.length - The mean branch length for the given population (core, A, or B)
name - Stem name of the file the data was calculated from
history - Simulated split history, named based on what orders divergences occurred in (A = population a, B = population b, C = core radiation)
count - Sum of replicates per parameter combo (20 for all here)
pop - The population these statistics were computed for, either all core populations combined or one individual peripheral population.
migration - Binary variable describing if any gene flow was included (does not separate by migration rates)

T400F_Thamnophilidae.tre - Phylogeny used for divergence dates and plotting.

Thamnophil_biogeographic_scoring.csv - CSV used as input for biogeographic assignment code. Columns are below:

Taxon_name_code - Name of taxon in phylogeny
Family - Family of taxon
genus - Genus of taxon
species - Species of taxon
subspecies_if_mult_sampled - Subspecies of taxon if multiple were sampled
specimen locality - Locality of specimen (only for one)
real_elevational_range - Elevational range of taxon if known (not used here)
n_america to islands - Biogeographic range of taxon (reduced to summary in area.thamno.txt). All of these variables are scored based on LJM's assessment of ranges based on available literature. In order, these refer to: North America, West Mexico, East Mexico, Central America, Northwest South America, Northern Andes, Southern Andes, Western Amazonia, Northeast Amazonia, Southeast Amazonia, Atlantic Forest of South America, Cerrado savanna ecoregion, Chaco region of South America, Caatinga biome of Brazil, Beni savanna ecoregion, Llanos grasslands, Tepui highlands, Parana river basin, Amazonian white-sand forest habitat, Temperate regions of South America, Atacama desert, Caribbean region, general lowland habitats under 250m elevation, general upland habitat from 250 to 1500m, general montane habitat from 1500m to 3500m, high montant habitat over 3500m, Amazonian flooded forest, Amazonian terra firme forest, general arid habitats, general humid habitats, and general oceanic islands.
Notes - additional notes if relevant

thamnophilidae_ages.R - R code for generating the phylogeny and box plots in Figure 3 (A and B). Uses the biogeographic scoring CSV and tree, outputs area.thamno.txt