Gene flow complicates phylogenetic inference in an archipelago radiation
Data files
Dec 11, 2025 version files 1.92 GB
-
01_stacks.zip
5.97 KB
-
02_adegenet.zip
21.55 MB
-
03_admixture.zip
422.58 KB
-
04_confirm_intergrade.zip
17.34 MB
-
05_pairwise_Fst.zip
17.34 MB
-
06_iqtree_svdq.zip
26.67 MB
-
07_SNAPP.zip
14.85 MB
-
08_UCEs.zip
428.34 MB
-
09_DSuite.zip
7.49 KB
-
10_genetrees.zip
15.53 MB
-
11_photos.zip
5.35 MB
-
12_colonization_sims.zip
8.90 KB
-
13_phylo_sims.zip
1.37 GB
-
14_bonus_sim_stats.zip
2.32 MB
-
README.md
19.10 KB
-
TableS1_Sampling_table_SolSym.xlsx
16.63 KB
-
TableS2_UCE_sampling.xlsx
11.42 KB
Abstract
Allopatric divergence is a fundamental component of most traditional models of biogeography and community assembly. Gene flow between allopatric populations should be influenced by the nature of geographic barriers and can have a profound impact on adaptation, the speciation process, and phylogenetic inference. Superspecies—monophyletic groups of taxa with species-level differences in phenotype or genotype that are found exclusively in allopatry or parapatry—present an opportunity to characterize the effects of gene flow on the divergence process. Here we investigate patterns of gene flow, population structure, and inferred phylogenetic relationships for members of an avian superspecies, the Solomons Monarchs (Aves: Symposiachrus barbatus complex) occupying the Solomon Islands. We found that gene flow among allopatric species matches predictions based on geography, but phylogenetic relationships were not concordant with the most likely colonization history based on a stepping-stone colonization model. Notably, the most isolated island, Makira, has a species that was inferred to be sister to the taxa on all other islands in concatenated phylogenetic analyses, despite Makira being farthest from the presumed original source of immigrants. We use population genetic simulations to demonstrate that such a result could be driven by bias resulting from low levels of gene flow, reflecting a challenge in phylogeographic inference that results when one population is differentially isolated. These simulated findings demonstrate a distinguishability issue in phylogeographic inference, where gene flow and colonization history can be difficult to disentangle.
README by: Ethan Gyllenhaal
Last updated: 10 Sept 2025
Repository for scripts and input files used in a study investigating gene flow and its impacts on phylogeography in a radiation of Symposiachrus monarch flycatchers in the Solomon islands. Everything is in subfolders within a zipped file. If you use something here, please cite us, or contact EFG on who is best to cite (and remind him if he forgets to update the citation):
This README describes the scripts and data files used for the afforementioned project. The data were collected using Illumina sequencing for RAD-seq libraries, largely following the Stacks reference-based pipeline. Scripts and input data file are in the same folder for a given analysis.
Useage
The majority of the data included here is output from a Stacks bioinformatics pipeline, which was run as a Torque script on a high performancce computing center. Individual VCFs were then processed with command line tools (e.g., vcftools) or R scripts for preparing figures, tables, etc for the paper. Data file formats (e.g. vcf, phylip) are standard except when noted. Additionally, there are two sets of simulation scripts and output (SLiM and msprime). Access reccomendations per data type are below. All data types except .xlsx and compressed files can be viewed in any plain text editor, but certain files are large enough that may not be advisable depending on the editor.
Supplemental Tables out of appendix
TableS1_Sampling_table_SolSym.xlsx: Table of samples used for RAD-seq dataset. From left to right, columns are species name, subspecies (when relevant), island, country, voucher number, tissue number, sample name used in scripts, sample name used for SRA accession, number of reads, which datasets the sample was included in, and Arctos database links. The numbers in the dataset column correspond to: 1) full dataset, 2) all Solomons samples, 3) all New Georgia samples, 4) all Bukida samples, 5) samples used in DSuite tests for gene flow, and 6) samples used in SNAPP.
TableS2_UCE_sampling.xlsx: Table of samples used for UCE dataset. From left to right, columns are species name, subspecies (when relevant), island, country, voucher number, tissue number (when relevant), sample name used in scripts, sample name used for SRA accession, and number of recovered loci.
Data and other files
- .xlsx - Excel files, in our case manipulated with microsoft excel, but LibreOffice Calc and similar open source software can be used as well (and formulas transfers for at least LibreOffice).
- .vcf - Variant Call Format files of SNP data. These contain a standard set of information including position on the reference, genotype info, and depth. Full specifications can be found here: https://samtools.github.io/hts-specs/VCFv4.2.pdf. They can be viewed in any text editor, and I tend to use VCFtools to manipulate/filter them.
- .tre and .treefile - Newick format phylogeny. Best viewed with FigTree: https://tree.bio.ed.ac.uk/software/figtree/
- .stat - File output by IQTree detailing concordance factor statistics, viewed in text editors.
- .nex - Nexus format phylogeny. Best viewed with FigTree: https://tree.bio.ed.ac.uk/software/figtree/. Note that this is technically the same format as .nexus file, but they contain different fields.
- .phylip and .nexus - Two allignment file formats used for IQ-TREE and SVDQUarters. These both contain concatenated sequence information, and although they are just text files, they are best used as input for other programs.
- .fa - Allignment file format used only for storing per-locus gene trees used for locus-level IQ-TREE analyses. Can be viewed in any text editor, note that only first allele of each set was used.
- .nwk - Newick format phylogeny generated in a text editor. Can be viewed in FigTree, but note that it isn't an empirical phylogeny per say (i.e., just made based off of topology from empirical trees).
- .XML - Markdown file generated by BEAUti (https://beast.community/beauti) for use with BEAST (https://beast.community/index.html). Contains UCE allignments and details for BEAST to run analyses on the file. Can be viewed in a text editor raw or in BEAUti for user-friendly viewing/editing.
- .ped and .map - Two plink format files for genotypes and site information, respectively. Formats are outlined here: https://www.cog-genomics.org/plink/1.9/formats. In short, .ped includes population ID, individual ID, four unused columns (value of 0 or -9), and two columns of SNP information per genotype. The other, .map, includes columns for chromosome code, variant ID, an unused column, and coordinator in basepairs. Used as input for admixture.
- .trees - Nexus file of trees from the postior of SNAPP runs, used as input for DensiTree and TreeAnnotator.
- .conf - File used for configuration of the PHYLUCE pipeline for UCE data.
- .tar.gz - Compressed tar archive, can be decompressed with "tar -zxvf".
- .gz - Directory compressed with gzip, can be decompressed with "gzip".
- .jpeg or .png - Image files, used for photos of specimens, viewable in any image viewing software.
- unspecified (no extension) - These are plain text files, visible with text editor of viewer of choice.
Scripts
- .pbs and .slurm - Batch scripts submitted from command line on UNM's cluster.
- .R - R script designed to be run interactively in RStudio.
- .sh - Shell scripts used for various tasks. Run on command line locally or on the cluster.
- .py - Python scripts, run either interactively or as part of a pipeline.
- .slim - Eidos scripts for running SLiM simulations, run on the cluster, but with slight modifications can be viewed and run in SLiM's GUI as well.
Supplemental data
These are excel files that act as supplemental tables for the paper either explicitly or to make explicit how I went about correcting tests of gene flow per-category.
Zipped directories
The numbered directories each contain files used for the bioinformatics pipeline and other analyses
Stacks pipeline (01_stacks.zip)
run_stacks_pipeline.pbs - Torque submission script used to run the pipeline. It was very quick, using only a single 8-core node for <10 hours. Made to run on UNM's Center for Advanced Research Computing's Wheeler cluster.
popmaps directory - Directory of population maps used to make subset input files. Individual files are below, with the first column denoting sample names and second population assignment:
*popmap_NG - New Georgia Group samples
*popmap_bukida - Bukida Group samples
*popmap_full - All samples
*popmap_full_indiv - All samples, but with a dummy population used to output per-individual phylip files
*popmap_solo - All Solomon Island samples
*snapp_even - SNAPP input with three samples per island group and two samples from an outgroup
*snapp_five_island - SNAPP input with up to five islands per island group (two per island if group have more than one) and two samples from an outgroup
*snapp_two_island - SNAPP input with up to two islands per island group (two per island if group have more than one) and two samples from an outgroup
Adegenet (02_adegenet.zip)
SoloSympos_PCA.R - R script for the 3 PCAs in the paper (resulting in 4 figures), with the first one walking through the process in detail. The full and Bukida PCAs are in Figure 2. The New Georgia PCAs are in Figure S1.
solsym_*_75.vcf - 75% complete VCF files for subsets of the data (all Solomons birds, all Bukida birds, and all New Georgia birds, respectively).
Admixture (03_admixture.zip)
solsym_admixture_plot.R - R script for plotting Admixture results. Output used in Figure 2.
run_admixture_NG.sh - Shell script for running Admixture. Assumes a conda environment with Plink and Admixture is loaded (or similar ways of making "plink" and "admixture" call their respective programs).
solsym_NG_75.* - Plink input files from Stacks, with 75% completeness for New Georgia Samples.
solsym_NG_75_12.* - Plink files from prior line converted to 12 format for use in Admixture.
Confirming intergrade (04_confirm_intergrade.zip)
keep_interspecific - Sample list of individuals to keep for fixed SNP analysis (only one is used, rest are there to confirm we do indeed have fixed SNPs)
kolo_etc_samples - Sample list of Kolombangara and connected islands.
rano_vella_samples - Sample list of Vella Lavella and Ranongga.
solsym_solomons_75.vcf - 75% complete VCF file for Solomons subset of data.
run_interspecific_het.sh - Main script, first uses VCFtools and unix commands to find fixed SNPs (between the combined Kolombangara etc and Vella/Ranongga populations), then limits input VCF to those SNPs, then does some unix math so calculate proportion of heterozygotes for the focal individual at those sites. Comparable to what the program introgression calculates.
Pairwise Fst (05_pairwise_Fst.zip)
pairwise_Fst_vcftools.sh - Shell script used for calculating pairwise Fst between all Solomons population using VCFtools. Output used in Table 1.
pops directory - Files used to specify populations.
solsym_solomons_75.vcf - 75% complete VCF file for Solomons subset of data.
IQTree and SVDQuartet (06_iqtree_svdq.zip)
NOTE: SVDQuartets analyses were run with the PAUP GUI. Output in Figure 3 and Figure S2.
iqtree.slurm - SLURM batch script for running IQTree for inference on RAD and UCE data, as well as site concordance factor calculation for RAD data.
full_rad_75.phylip - Input for IQTree, generated with Stacks with slight manual modifications.
full_rad_75.phylip.treefile - Output from IQtree, without concordance factors.
solsym_cf_full_75.cf.tree - Phylogeny for RAD dataset with concordance factors on nodes.
solsym_cf_full_75.cf.stat - Full concordance factor data for RAD dataset.
solomons_triv_SVDQ_75.tre - SVDQuartets phylogeny for ingroup and S. trvirgatus. Tips are grouped together.
solomons_triv_SVDQ_75.nex - SVDQuartets input modified from IQTree phylip without a script. Removed all outgroups other than trvirgatus.
SNAPP (07_SNAPP.zip)
NOTE: Some steps were run with the Beauti GUI. Output in Figure 4.
snapp_input_maker.sh - Driver script for converting a VCF to diploid SNAPP input (i.e. Nexus input for processing in Beauti). Converts from VCF to haploid SNAPP using https://github.com/BEAST2-Dev/SNAPP/tree/master/script, then to diploid SNAPP using a custom python script (SNAPP_haploid_to_diploid.py), and finally adding lines for a new nexus file using sed commands. Now takes command line arguments (relative to original version).
generate_snapp_input.sh - Simple script with commands for each SNAPP input file.
SNAPP_haploid_to_diploid.py - Script for converting a tab-delimited set of haploid nexus SNP data to diploid data. Takes and outputs "middle" lines, header and footer done by "snapp_input_maker.sh". Note that it's "dumb" and needs the input to be named "snapp_dip.txt" and outputs "snapp_out.txt".
snapp100_*.vcf - 100% complete VCF for SNAPP samples of a given set: even = 3 samples from 1 island per group; group = up to 4 samples per island group, with 2 islands (2 samples each) for Bukida and New Georgia groups; island = up to 10 samples per island group, with 5 islands (2 samples each) for Bukida and New Georgia groups.
snapp100_*.nex - Nexus file used to make Beast XML for SNAPP.
snapp100_*.xml - Beast XML input file. Simply run with the line: beast -threads 32 solsym_limit100.xml. (adding --resume if needed)
snapp*.trees - The first N trees from a given SNAPP run (resuming continues past limit, to if limit was 10 million we pruned extra trees past that).
treeanno_*.tre - TreeAnnotator summary trees.
UCEs Phylogenies (08_UCEs.zip)
NOTE: SVDQuartets analyses were run with the PAUP GUI. Output in Figure 3. IQTREE script is in section 06.
run_phyluce_solo_triv.sh - Shell script for running PHYLUCE pipeline. Starts with cleaned reads and outputs Nexus and Phyllip files.
solsym_uce.conf - Taxon set .conf file. Used in several phyluce calls.
solsym_spades.conf - Spades .conf file, note paths are changed to generic ones up until clean reads folder. Used in phyluce_assembly_assemblo_spades.
solsym_uce_contigs.tar.gz - All of the contigs.fa outputs from spades, tarballed and gzipped.
sympos_uce_nexus_90.nexus - 90% complete nexus file used for SVDQuartets analysis.
sympos_uce_SVDQ_90.tre - Tree produced by SVDQ, combining samples by population.
sympos_uce_raxml_90.phylip - 90% complete phylip file used for RAxML analysis.
sympos_uce_raxml.tre - Tree produced by RAxML, combining samples by population.
DSuite (09_DSuite.zip)
NOTE: The topologies and popmaps have 6 consistent name corresponding to subfigures in Figure S3. B = main, C = noMal, D = conservative, E = conservative_noMal, F = malbar G = malbar_conservative.
run_D.sh - Shell script used for running Dsuite. The first one is used for the main ABBA/BABA results (Figure 5). The first and the rest are used for fbranch (Figure S3).
fbranch_topologies/* - Topologies used for fbranch, outlined below:
*topo_main.nwk - The main topology.
*topo_noMal.nwk - Malaita removed.
*topo_conservative.nwk - Certain populations removed to avoid uncertainty within island groups.
*topo_conservative_noMal.nwk - As above, but also with Malaita removed.
*topo_malbar.nwk - New Georgia samples removed.
*topo_malbar_conservative.nwk - New Georgia samples and one uncertain Bukida sample removed.
pop_maps/* - Population maps used for the runs. All but two downsampled ones match the above topologies. These downsampled maps match the main topology:
*map_downsample3 - Population map with a maximum of 3 samples per population.
*map_downsample2 - Population map with a maximum of 2 samples per population.
Genetrees (10_genetrees.zip)
all_genetrees.slurm - Slurm batch script for making locus-specific fastas and running gene trees with given numbers of parsimony informative sites.
check_monophyly_all.py - Python script for outputting proportion and branch length stats for gene trees.
locus_fastas.fa - Fasta of locus-specific fastas.
popmap_genetrees - Population map for gene tree samples (one per island group).
*.zip - Zipped directories of locus fastas and IQ-TREE output for fastas with given number of parsimony informative sites, as listed below:
*all.zip - At least one PIS
*all_PIS2.zip - At least two PIS
*all_PIS3.zip - At least three PIS
Photos of intergrade (11_photos.zip)
All of these are photos of the intergrade (center), Ranongga samples (left two), and Kolombangara samples (right two). They show the view over the underparts (Belly_View_Full.jpeg), sides (Side_View_Full.jpeg), and the main one used in Figure 2 (Main_hybrid_comp.png).
Colonization simulations (12_colonization_sims.zip)
solsym_nonWF_carc_loui_v2.slim - Slim script used to run colonization simulations.
run_loui_colonize.slurm- Slurm script used to run simulations in parallel. Requires a conda environment with SLiM.
drive_summarize.sh - Script used to modify output then run python script.
summarize_loui.py - Script used to summarize the output from slim.
summarized_*.txt - Text files output by summarize script. Each row is for a given dispersal value. Three files correspond to the proportion of simulations where a given island group met specific conditions (columns 2, 3, 4, and 5 correspond to New Georgia, Malaita, Makira, and Bukida respectively): "first" means which was colonized first, "last" means which was colonized last, and "success" means which islands were successfully colonized of the simulations where the Solomons were colonzied. The path file is colonization paths, with columns corresponding to NLKB, NLBK, NKLB, NKBL, NBLK, NBKL, LNKB, LNBK, LKNB, LKBN, LBNK, LBKN, KNLB, KNBL, KLNB, KLBN, KBNL, KBLN, BNLK, BNKL, BLNK, BLKN, BKNL, then BKLN (B=Bukida, N=New Georgia, L=Malaita, K=Makira).
slim_sankey.R - R script for plotting colonization paths in a Sankey diagram. Note that paths are converted from proportions to numbers and summed across all simulations, then assigned manually in R (couldn't figure out a smart way to do it).
Phylogenetic simulations (13_phylo_sims.zip)
new_solo_phylo_msp.py - Script for running msprime simulations, made to be run in parallel.
run_solomons_msp.slurm - Slurm batch script for running replicates of msprime followed by IQTREE. Requires a conda environment with msprime and IQTREE (plus other python packages).
variant_fasta.sh - Script for removing invariant sites to keep alignment files smaller for storage purposes.
combined_sumstats.sh - Simple script for combining summary stat files in parallel.
summarize_*.txt - Tables summarizing topologies from simulations that started in Bukida (_buk) or Makira (_mak). Columns are the row name (stem name), 15 possible topologies (values are counts), count of non-monophyletic trees, count of poorly resolved trees, the stem name of the simulation set, dispersal distance, effective population size, post-split time, and split interval.
summarize_trees_bs.py - Script for parsing and classifying topologies from set of trees.
summarize_trees.slurm - Slurm script for running summarize_trees_bs.py.
plot_msp_heatmap.R - Script for plotting heatmap of phylogenetic simulation results.
output_trees.zip - Zipped directory of trees output by IQTREE. The "d" corresponds to dispersal distance (0.0001 acts as 0), n to total population time, t for post-split time, i for split interval, and r is the replicate number.
output_sim.zip - Zipped directory of fastas and summary stats output by msprime. The "d" corresponds to dispersal distance (0.0001 acts as 0), n to total population time, t for post-split time, i for split interval, and r is the replicate number.
Additional summary simulation files (14_bonus_sim_stats.zip)
plot_sumstats.R - Script for plotting a lot of data exploration plots not used in the paper.
source_d* - Source population data from SLiM simulations for a range of dispersal values. From left to right, columns are the source population values for the New Georgia Group, Malaita, Makira, and the Bukida Group.
sumstat_d* - Summary stat (pi and private alleles) data for SLiM simulations for a range of dispersal values. From left to right, the first four columns are for pi and second four for private allele count, each in the order of: New Georgia Group, Malaita, Makira, Bukida Group.
sumstat_combined - Msprime output for all simulations, first with parameters then with population-specific summary stats. The first 6 columns are: dispersal distance, total Ne, post-split divergence time, split interval, simulation start point, and replicate number. Next four are pi (for Bukida, New Georgia, Malaita, Makira) and then number of private alleles (in the same order).
The dataset is mostly a RAD-seq dataset, generating 1000s of loci across the genome, which were then sequenced on an Illumina sequencer. These were then processed with Stacks, with input generated for downstream analyses using several different scripts outlined in the document. Additionally, we ran two sets of simulations to demonstrate the theorhetical basis of two of our major claims (colonization and the impact of geographically mediated gene flow on phylogenetic inference).
There is a README.txt file included which describes the script and datasets included. Datasets are divided by mode of inference, with scripts associated with given input files (e.g., vcf, phylip, or nexus files) in the same .zip files. I did my best to rename input files in a meaningful way, and change my scripts accordingly, but most scripts aren't designed to run everything with the same directory structure/naming format as the zipped directory (i.e., some alteration needed). Scripts are also uploaded individually to Zenodo. Scripts, accessory files, and trees are also located on my GitHub: https://github.com/ethangyllenhaal/SolomonsSymposRad.
Most figures were edited in Adobe Illustrator to change, e.g., text and line format, but no data were fundamentally altered.
- Gyllenhaal, Ethan F; Klicka, Lukas B; DeCicco, Lucas H et al. (2025). Gene flow complicates phylogenetic inference in an archipelago radiation. Systematic Biology. https://doi.org/10.1093/sysbio/syaf081
