Data and code from: Diversification and divergence in Myioborus warblers: insights into evolutionary relationships and plumage genetics
Data files
May 27, 2026 version files 18.68 MB
-
code-data.zip
18.68 MB
-
README.md
8.05 KB
Abstract
To understand the history of diversification in the genus Myioborus and uncover genes underlying a plumage difference between two young hybridizing taxa, we analyzed whole-genome sequencing data from 11 of the 12 species in the family, comprising 18 subspecies. We estimated phylogenies from UCE and mitochondrial data, calculated metrics of genetic variation among a young clade in the tropical Andes, and examined signals of differentiation between hybridizing taxa. The code and data presented here will permit reproduction of our results.
The code and data files can be used to reproduce the analysis from Cespedes-Arias & Bennett et al. 2026.
For questions/concerns about the lostruct analysis, check with lncespedesarias [at] gmail [dot] com.
For the rest of the analyses, check with kevinfpbennett [at] gmail [dot] com.
Folder: data-processing
Code file:
rim-align-sort-mark-index.sh
This file processes raw reads and makes bams. It requires the S. coronata reference and raw reads, both of which are available from NCBI. You may need to alter the naming convention based on the form you get the reads.
Folder: variant-calling
Code files:
call-variants_mito.sh
Calls variants from mitochondrial DNA. Produces an all-sites VCF, so not just variant sites are output.
filter_mito.sh
Filters sites from the above VCF. This uses minimal filtering (GQ > 10 and DP > 10) because coverage is universally high on the mtDNA. The resulting VCF file is provided.
call-variants_example.sh
Variant calling was performed separately by chromosome, so this running this script and changing the region will replicate the full variant calling method.
filter_example.sh
This is the filtering script, again, for one chromosome. The depth cutoffs were the same for each chromosome.
concat.bcfs.sh
After filtering, concatenate the bcfs to make a gzipped vcf.
Data files:
pops.list
A file to use in the call-variants script
ploidy.list
A file to use in the call-variants script
samples.list
A file to use in the call-variants script
myiob.filt1.mito.vcf.gz
Output from the mitochondrial variant calling. The nuclear DNA VCF is too large to include here (80 GB).
Folder: phylogenies
Code files:
make-fastas_mito.sh
Makes a whole-mitochondrion fasta file for each individual in the dataset.
iqtree_mito.sh
Makes the mtDNA tree
make-fastas_uces.sh
Loops through a bed file and list of samples to make a fasta file for each individual containing sequences from the bed file locations. In this case, they are UCE loci plus flanking regions with 2kb total lengths.
phyluce.sh
Aligns UCE loci, trims with GBlocks, keeps only loci represented in all taxa, and concatenates the alignments. It also produces non-concatenated fasta alignments for individual gene trees.
iqtree_concat.sh
Runs IQ-TREE to make the concatenated phylogeny with branch support. Includes model selection to pick the best-support sequence evolution model. This is then used in iqtree_gene-trees.sh.
iqtree_gene-trees.sh
Runs IQ-TREE on each UCE "gene" tree separately using the model selected by the concatenated run.
treeqmc.sh
Runs TREE-QMC on the concatenated treefiles from the gene-trees step. Produces a tree with quartet support values but no branch lengths. If you want branch lengths, run iqtree_treeqmc-bls.sh. But that tree will not have the quartet supports.
iqtree_treeqmc-bls.sh
Runs IQ-TREE with the topology set as the TREE-QMC topology and the alignment as the concatenated UCE alignment. This gives branch lengths for the TREE-QMC topology, but loses quartet supports.
Data files:
uce-regions-2k.bed
This gives the location of the UCE loci and flanking sequences previously identified in the S. coronata reference genome. For use in make-fastas_uces.sh.
myiob.ind.list
This file is a sample list to iterate through when making the fastas.
mito.tree
The mitochondrial phylogeny
concat.tre
The concatenated UCE phylogeny
genes.tre
The concatenated UCE "gene" trees. This is the input to TREE-QMC.
species-nobl.tre
The TREE-QMC phylogeny with no terminal branch lengths
Folder: genomewide-fst
Code files:
plink_make-pgen.sh
Makes the pgen file needed to run the FST script
plink_fst.sh
Calculates pairwise FST values between taxa set in the pops.pheno file.
Data files
pops.pheno
Input to the fst script that tells plink which samples are in each category (in this case, taxa).
myiob.fst.summary
The FST output.
Folder: pcas
Code files:
GLs_andes.sh, GLs_andes-noalb.sh, GLs_chrysops-bairdi.sh
These three scripts make the beagle files used as input to pcangsd. They just provide different numbers of samples as inputs.
pcangsd_andes.sh, pcangsd_andes-noalb.sh
These are the pcangsd scripts that correspond to the above genotype likelihood scripts.
pcangsd_bairdi-chrysops_zregs.sh
This script runs pcangsd four times, one for each of the PCAs shown in Figure 5. The naming (ctl1, ctl2, ctl3, and focal) corresponds to panels C, D, E, and F, respectively.
Data files:
andes.cov, andes-noalb.cov
The covariance matrix outputs from pcangsd. Get eigenvectors/values with eigen() in R.
chrysops-bairdi.zctl1.beagle.gz, chrysops-bairdi.zctl2.beagle.gz, chrysops-bairdi.zctl3.beagle.gz, and chrysops-bairdi.zfocal.beagle.gz
Genotype likelihood files produces by GLs_chrysops-bairdi.sh and then manually subset to the regions of interest. See Figure 5 for locations on the Z.
chrysops-bairdi.zctl1.cov, chrysops-bairdi.zctl2.cov, chrysops-bairdi.zctl3.cov, and chrysops-bairdi.zfocal.cov
The covariance matric outputs from pcangsd in the Z regions (see Figure 5).
Folder: windowed-fst
Code files:
windowed-fst_bairdi-chrysops.sh
This script calculates FST in 10kb non-overlapping windows between M. m. bairdi and M. o. chrysops.
angsd.source.R
A script that loads the correct libraries and creates functions needed for running the plotting scripts for windowed FST in R.
windowed.fst.and.mds.R
An R script that takes in the angsd windowed FST output plus several other files (provided here) and performs the outlier analysis and plotting. It also plots the MDS results, so come back to it once you have run lostruct.
Data files:
bairdi-chrysops.popgen
Output from angsd that gives the windowed FST values
clustering_windows.csv
A file that gives the 1 Mb windows used to evaluate the Setophaga pairs for outlier overrepresentation. Columns: chromosome - chromosome name, bin_start - start of window, bin_end - end of window
setophaga_1mb_windows.csv
A file that gives the 1 Mb windows that showed an overrepresentation of outliers among any of nine pairs of species in Setophaga. Calculated for Baiz et al. 2021. Columns: chromosome - chromosome name, bin_start - start of window, bin_end - end of window, two sets of 9 columns each listing a pair of species - the first instance of each pair is the number of outlier 10kb windows per 1 Mb bin while the second instance is the either a 1 or 0 depending on whether there are >=8 outliers in the corresponding species pair, sum - summed counts from the second instance of each species pair set.
BBWA-BLPW.fst.csv
More results from Baiz et al. 2021. FST for the bay-breasted vs. blackpoll warbler comparison. Used in this paper just for plotting. Columns: chr - chromosome name, midPos - the center position of each chromosomal window, Nsites - number of sites used in the FST calculation, fst - FST.
mds_allchroms.csv
Final output from lostruct analysis. Used by the plotting script above but located in the lostruct directory. See that directory for details.
Folder: lostruct
Code files:
lostruct_10kb.job
With the main VCF split into chromosomes with bcftools, this sets up and runs the lostruct script in R.
lostruct_10kb.R
R script for running lostruct itself in 10kb windows. This runs separately per chromosome-VCF.
lostruct_plot_all-chroms.R
Outlier and plotting code for MDS results
Data files:
mds_allchroms.csv
The final result of the lostruct analysis, produced by combining each chromosome result. Columns: chrom - chromosome name, start - start of window, end - end of window, V1 - MDS factor 1, V2 - MDS factor 2.
