Repeated waves of colonization of a great speciator
Data files
Apr 21, 2025 version files 1.40 GB
-
01_stacks.zip
1.47 KB
-
02_adegenet.zip
14.85 MB
-
03_sNMF.zip
5.35 MB
-
04_pairwise_Fst.zip
8.43 MB
-
05_phylo.zip
9.27 MB
-
06_treemix.zip
55.54 KB
-
07_DSuite.zip
7.51 MB
-
08_manhattan.zip
29.81 MB
-
09_UCEs.zip
1.86 MB
-
10_annotation.zip
1.32 GB
-
11_splitstree.zip
7.36 MB
-
Pachy_final_d_table.xlsx
24.25 KB
-
README.md
15.50 KB
-
Table_S1_Sampling_table_PachyRAD.xlsx
17.59 KB
-
Table_S2_Sampling_table_UCE.xlsx
14.72 KB
Abstract
Secondary contact between previously allopatric lineages offers a test of reproductive isolating mechanisms that may have accrued in isolation. Such instances of contact can produce stable hybrid zones—where reproductive isolation can further develop via reinforcement or phenotypic displacement—or result in the lineages merging. Ongoing secondary contact is most visible in continental systems, where steady input from parental taxa can occur readily. In oceanic island systems, however, secondary contact between closely related species of birds is relatively rare. When observed on sufficiently small islands, relative to population size, secondary contact likely represents a recent phenomenon. Here, we examine the dynamics of a group of birds whose apparent widespread hybridization influenced Ernst Mayr’s foundational work on allopatric speciation: the whistlers of Fiji (Aves: Pachycephala). We demonstrate two clear instances of secondary contact within the Fijian archipelago, one resulting in a hybrid zone on a larger island, and the other resulting in a wholly admixed population on a smaller, adjacent island. We leveraged low genome-wide divergence in the hybrid zone to pinpoint a single genomic region associated with observed phenotypic differences. We use genomic data to present a new hypothesis that emphasizes rapid plumage evolution and post-divergence gene flow.
README by: Ethan Gyllenhaal
Last updated: 14 April 2025
Repository for scripts and input files used in a study investigating the phylogeographic history of Pachycephala whistlers in Fiji. Everything is in subfolders within a zipped file. If you use something here, please cite us.
This README describes the scripts and data files used for the aforementioned project. The data were collected using Illumina sequencing for RAD-seq libraries, largely following the Stacks reference-based pipeline. Data uploaded to Dryad is in zipped directories outlined below, code relevant to analyses is under the relevant section but stored on Zenodo (https://doi.org/10.5281/zenodo.12774334) and github (https://github.com/ethangyllenhaal/FijiPachyRad). Code stored on Github and Zenodo is denoted by ***.
Usage
The majority of the data included here is output from a Stacks bioinformatics pipeline, which was run as a Torque script on a high performancce computing center. Individual VCFs were then processed with command line tools (e.g., vcftools) or R scripts for preparing figures, tables, etc for the paper. Data file formats (e.g. vcf, phylip) are standard except when noted. Access reccomendations per data type are below. All data types except .xlsx and .treemix.gz can be viewed in any plain text editor, but certain files are large enough that may not be advisable depending on the editor.
Data and other files
- .xlsx - Excel files, in our case manipulated with microsoft excel, but LibreOffice Calc and similar open source software can be used as well (and formulas transfers for at least LibreOffice).
- .vcf - Variant Call Format files of SNP data. These contain a standard set of information including position on the reference, genotype info, and depth. Full specifications can be found here: https://samtools.github.io/hts-specs/VCFv4.2.pdf. They can be viewed in any text editor, and I tend to use VCFtools to manipulate/filter them.
- .tre - Newick format phylogeny. Best viewed with FigTree: https://tree.bio.ed.ac.uk/software/figtree/
- .nex - Nexus format phylogeny. Best viewed with FigTree: https://tree.bio.ed.ac.uk/software/figtree/. Note that this is technically the same format as .nexus file, but they contain different fields.
- .phylip and .nexus - Two allignment file formats used for IQ-TREE and SVDQUarters. These both contain concatenated sequence information, and although they are just text files, they are best used as input for other programs.
- .treemix.gz - Compressed TreeMix input, used as input for TreeMix. If viewing is needed, it can be viewed with zless or uncompressed with gunzip and viewed in text editor of choice.
- .nwk - Newick format phylogeny generated in a text editor. Can be viewed in FigTree, but note that it isn't an empirical phylogeny per say (i.e., just made based off of topology from empirical trees).
- .list - Plain text file, can be viewed in any text editor.
- .XML - Markdown file generated by BEAUti (https://beast.community/beauti) for use with BEAST (https://beast.community/index.html). Contains UCE allignments and details for BEAST to run analyses on the file. Can be viewed in a text editor raw or in BEAUti for user-friendly viewing/editing.
- .fasta - FASTA format reference genome (and intermediates). This can be viewed in text editors that are good at handling large files, but more often used in pipelines.
- .phy - Phylip format distance matrix. Can be viewed with any text editor/viewer, but generally used with SplitsTree4 (not the new app, the older program in the 3rd download link here: https://uni-tuebingen.de/en/fakultaeten/mathematisch-naturwissenschaftliche-fakultaet/fachbereiche/informatik/lehrstuehle/algorithms-in-bioinformatics/software/splitstree/).
- unspecified (no extension) - These are plain text files, visible with text editor of viewer of choice.
Scripts
- .pbs and .slurm - In Zenodo upload and github, batch scripts submitted from command line on UNM's cluster.
- .R - In Zenodo upload and github, R script designed to be run interactively in RStudio.
- .sh - In Zenodo upload and github, shell scripts used for various tasks. Run on command line locally or on the cluster.
Supplemental data
These are excel files that act as supplemental tables for the paper either explicitly or to make explicit how I went about correcting tests of gene flow per-category.
Pachy_final_d_table.xlsx - Table of complete ABBA/BABA results to fully illustrate procedure for Benjamini-Hochberg corrections and calculation of harmonic p values. Note that this excel sheet was used for calculations, so if it is used for other analyses, ensure that raw numbers are correctly output. Columns are:
- P1 - P3 - The taxa used for ABBA/BABA tests, with gene flow tested only between P2 and P3 (i.e., program doesn't allow negative D values)
- Dstatistic - The value of the D statistic
- p-value - P value based on Z score for the jackknifing
- f4-ratio - estimate of admixture fraction for the comparisons
- Category - The edge category used for multiple test corrections
- Rank - The rank within the category
- B-H value - Critical value for BH correction
- BH Significant? - If the p value is less than the critical value
- p - BH - The p value minus the B-H value, with values less than zero representing significant comparisons
- Harmonic mean - Harmonic mean p value for the category, only shown for first entry of the category
- Harmonic calc - Row for calculating the harmonic mean p value
Table_S1_Sampling_table_PachyRAD.xlsx - Table of samples used for RAD-seq dataset. The columns are:
- Species - Genus and species for the sample
- Subspecies - Subspecies of the sample (largely following Mayr 1932, but S Viti Levu birds may have been assigned to optata there)
- Sex - Sample sex if known, generally by plumage and gonad data
- Male phenotype score - Band and Lore score used for admixture mapping. Male phenotype was only scored for adult with study skins at the University of Kansas Biodiversity Institute (KU).
- Country - Country sample is from
- Island - Sample's island, with locality in parentheses for larger islands
- Tissue number - The institution and tissue number for the sample
- Voucher number - The institution and voucher number for the sample
- Script Sample Name - The sample name used in scripts (note older taxonomy)
- SRA Sample Name - The name used for the sample in the SRI upload
- NCBI Accession - NCBI sample number
- Number of reads - Number of cleaned reads for RAD dataset
- Datasets - Which dataset(s) the sample was used in. Dataset numbers correspond to: 1) yellow-throated taxa only, 2) all taxa, 3) all taxa, phylogenetically confounding individuals removed individuals removed, and 4) no Ovalau, phylogenetically confounding individuals removed.
Table_S2_Sampling_table_UCE.xlsx - Table of samples used for UCE phylogeny and dating analysis. The columns are:
- Species - Genus, species, and sometimes subspecies for the sample
- Country - Country sample is from
- Island - Sample's island (if relevant), with locality in parentheses for larger islands
- Institution - The institution the sample is from. LSUMNS = Lousiana State University, KUNHM = University of Kansas, CAS = California Academy of Sciences, AMNH = American Museum of Natural History, DMNH = Delaware Museum of Natural History, ANWC = Australian National Wildlife Collection, UWBM = Burke Museum
- Identifier - Identifier used for sample, varying based on tissue type
- Source - Sample type, either frozen tissue or toepad
- Sample Name - Name used for sample in tree
- NCBI Accession - NCBI sample assignment, either from this paper's upload or Brady et al.'s upload
- Platform - Sequencing platform used
- Number of clean reads - Number of clean reads for the sample, blank value corresponds to sample where only contigs were found on our server (but raw reads are uploaded to SRA)
- Number of recovered loci - Number of UCE loci for the sample
- Voucher number - Voucher number for sample, for toepads and certain tissues, this matches the Identifier
Zipped directories
The numbered directories each contain files used for the bioinformatics pipeline and other analyses
Stacks pipeline (01_stacks)
***run_pachy_rad.pbs - Torque submission script used to run the pipeline. It was very quick, using only a single 8-core node for <10 hours. Made to run on UNM's Center for Advanced Research Computing's Wheeler cluster. Note that this used the original reference without abbreviated names, annotation was done afterwards with the same pseudochromosome backbone, but without re-reunning the pipeline.
popmaps directory - Directory of population maps used to make subset input files. Individual files are below, with the first column denoting sample names and second population assignment
- popmap_full - population map of all samples
- popmap_graeffii - population map of all samples of the main ingroup (i.e., yellow-throated taxa)
- popmap_indiv - population map of all samples, but with a dummy population used to output per-individual phylip files
Adegenet (02_adegenet)
***PachyHybrid_PCA.R - R script for the 2 PCAs in the paper, with the first one walking through the process in detail. The graeffii (yellow-throated) PCA is in Figure 2. The other (all population) PCA is in Figure S1.
pachy_*_75.vcf - 75% complete VCF files for subsets of the data (all yellow-throated and all RAD-seq samples).
sNMF (03_sNMF)
***pachy_snmf.R - R script for the sNMF plot in the paper (Figure 2).
pachy_nosingle_graef75.vcf - 75% complete VCF file without singletons for all yellow-throated.
Pairwise Fst (04_pairwise_Fst)
***pairwise_Fst_vcftools.sh - Shell script used for calculating pairwise Fst between all population using VCFtools. Output used in Table S3. Note that it was run including only the pseudo-Z chromosome or excluding only the pseudo-Z chromosome (the 3rd command line argument). This is "--chr PseudoZ_dnac_ctaeGut3.2.4Z1728613511_REF" or "--not-chr PseudoZ_dnac_ctaeGut3.2.4Z1728613511_REF", respectively.
pops directory - Files used to specify populations, with each file populated by sample names. Each population is named after the island and (at times) sampling locality. Population files inside are: kadavu, kioa, ogea, ovalau, rabi, santacruz, taveuni, vanuaC, vanuaE, vanuaW, vitiN, vitiS.
pachy_full75.vcf - 75% complete VCF file for the full dataset.
RAD Phylogenetic (05_phylo)
IQ-TREE Output in Figure 2. SVDQuartets output in Supplement.
pachy_rad77_90.tre - IQ-TREE phylogeny for all samples.
pachy_rad77_90.phylip - Input for IQ-TREE, a 75% complete phylip matrix, generated with Stacks and slight manual modifications (removing comment at the end).
pachy_*_90.nexus - Nexus alignments with taxset info used for SVDQuartets for all taxa and excluding Ovalau.
pachy_*_90_100kquart.nex - Nexus trees output by SVDQuartets (note that "branch times" are support values), generated with above files and 100k evaluated quartets.
Treemix (06_treemix)
***run_treemix.sh - Script of TreeMix commands to run, for 6 different migration values, with jackknifing. This can be easily parallelized with GNU parallel, but wasn't needed with this small dataset.
***treemix_plot.R - Simple script for plotting TreeMix output, requires plotting functions provided by the developers.
pachy_full90.treemix.gz - Input for treemix, produced by Stacks then gzipped (TreeMix likes gz input).
DSuite (07_DSuite)
***run_D.sh - Shell script used for running Dsuite. The first one is used for the main ABBA/BABA results (Fig 3, Table S4-5), and the second excludes Ovalau to avoid over-estimating gene flow and allow more accurate multiple comparisons. Both had fbranch run, but only the first is presented in Figure S4).
pachy_full90.vcf - Input VCF for the analysis.
topo{subset}.nwk and pachy[full/subset] - Topologies (.nwk) and population maps (no extension) used for main and subset (i.e., no Ovalau) analysis. Note that the popmaps list the outgroup as Outgroup as opposed to a geographic/taxonomic name. The format is otherwise standard (one column sample name, one column population name).
FST Manhattan Plots (08_manhattan)
***windowed_fst_vcftools.sh - Wrapper shell script for using vcftools to generate windowed FST between a set of comma separated population pairs. Uses an unfiltered VCF and list of population pairs to be tested, plus a population directory full of population lists.
pachy_full_fst.vcf - Unfiltered VCF used in this analysis. Filtering is done in VCFtools instead of stacks.
***rename_fst.sh - Script for converting chromosome names to numbers in fst output for qqman. Numbers are simple integers, so chromosomes 1, 1A, and 1B and 1, 2, and 3.
fst_pairs - Text file of comma separated pairs tested here.
intervals.list - List of intervals in the reference, used by renaming script.
***fst_plot.R - R script for making Manhattan plots with qqman. Uses a function written for qqman users and makes 12 plots (3 main text only, 3 main and supp, 6 supp only; Fig 4 and S5).
pop directory - Directory of population lists used to calculate FST. Each file is named after the phenotypic subset (band* and index*) or geographic subset of individuals. The full list is: band01, band23, index01, index345, kioa, lores0, lores12, ovalau, rabi, taveuni, vanuaC, vanuaE, vanuaW, viti.
UCEs Phylogeny and Dating (09_UCEs)
NOTE: Phylogenetic analyses were run by MJA (other scripts by EFG). Output in Figure 1.
beast95per_10runs_logcombined.tre - Phylogeny presented in Fig. 1, with PP values.
pachy_fiji36_95per_*_HKYG-mrca.XML - XML files for individual BEAST runs, representing both the run parameters and the alignments used for the analysis.
References and annotation (10_annotation)
***mask_repeats.slurm - Slurm script for running RepeatMasker. Takes in the pseudochromosome level genome as input.
***gemoma.slurm - Slurm script for running gemoma with NCBI annotations as reference (downloaded and renamed prior to script). This used a node with 1TB memory, but it didn't need all of it. Input (pachy_pseudochrom_zfRM.fasta) is from mask_repeats.slurm.
pachy_original_ref.fasta - Reference output by Supernova, input into Satsuma for making pseudochromosomes.
pachy_pseudochrom_ref.fasta - Base pseudochromosome reference, used for running Stacks.
pachy_pseudochrom_ref_shortname.fasta - Reference used as input for repeat masking, removed metadata in chromosome names from Satsuma.
pachy_pseudochrom_zfRM.fasta - Reference used for annotation.
Phylogenetic Network (11_splitstree)
***11_stampp_splitstree_input.R - R script for making distance matrix with VCF input.
full_90_dist.phy - Distance matrix used for network.
pachy_full90.vcf - 90% complete VCF file for the full dataset.
Other repositories
This data is also available at https://github.com/ethangyllenhaal/FijiPachyRad.
This manuscript is primarily based on a RAD-seq dataset, with supplementary ultraconserved element target capture data. The RAD-seq data was processed with Stacks, and UCE data with phyluce. The RAD data was used for a variety of phylogeographic analysis, including ones investigating population structure, phylogeny, divergence landscape, and gene flow. This upload includes a README that details scripts used to produce data, some intermediate files, and final outputs.
- Gyllenhaal, Ethan; Brady, Serina; DeCicco, Lucas et al. (2025). Repeated waves of colonization of a great speciator. Zenodo. https://doi.org/10.5281/zenodo.12774333
- Gyllenhaal, Ethan; Brady, Serina; DeCicco, Lucas et al. (2025). Repeated waves of colonization of a great speciator. Zenodo. https://doi.org/10.5281/zenodo.12774334
- Gyllenhaal, Ethan F.; Brady, Serina S.; DeCicco, Lucas H. et al. (2024). Waves of Colonization and Gene Flow in a Great Speciator [Preprint]. Cold Spring Harbor Laboratory. https://doi.org/10.1101/2024.07.18.603796
