Data and code from: Diversification and divergence in Myioborus warblers: insights into evolutionary relationships and plumage genetics

Data files

May 27, 2026 version files 18.68 MB

code-data.zip

18.68 MB
README.md

8.05 KB

Abstract

To understand the history of diversification in the genus Myioborus and uncover genes underlying a plumage difference between two young hybridizing taxa, we analyzed whole-genome sequencing data from 11 of the 12 species in the family, comprising 18 subspecies. We estimated phylogenies from UCE and mitochondrial data, calculated metrics of genetic variation among a young clade in the tropical Andes, and examined signals of differentiation between hybridizing taxa. The code and data presented here will permit reproduction of our results.

The code and data files can be used to reproduce the analysis from Cespedes-Arias & Bennett et al. 2026.

For questions/concerns about the lostruct analysis, check with lncespedesarias [at] gmail [dot] com.

For the rest of the analyses, check with kevinfpbennett [at] gmail [dot] com.

Folder: data-processing

Code file:

rim-align-sort-mark-index.sh

This file processes raw reads and makes bams. It requires the S. coronata reference and raw reads, both of which are available from NCBI. You may need to alter the naming convention based on the form you get the reads.

Folder: variant-calling

Code files:

call-variants_mito.sh

Calls variants from mitochondrial DNA. Produces an all-sites VCF, so not just variant sites are output.

filter_mito.sh

Filters sites from the above VCF. This uses minimal filtering (GQ > 10 and DP > 10) because coverage is universally high on the mtDNA. The resulting VCF file is provided.

call-variants_example.sh

Variant calling was performed separately by chromosome, so this running this script and changing the region will replicate the full variant calling method.

filter_example.sh

This is the filtering script, again, for one chromosome. The depth cutoffs were the same for each chromosome.

concat.bcfs.sh

After filtering, concatenate the bcfs to make a gzipped vcf.

Data files:

pops.list

A file to use in the call-variants script

ploidy.list

A file to use in the call-variants script

samples.list

A file to use in the call-variants script

myiob.filt1.mito.vcf.gz

Output from the mitochondrial variant calling. The nuclear DNA VCF is too large to include here (80 GB).

Folder: phylogenies

Code files:

make-fastas_mito.sh

Makes a whole-mitochondrion fasta file for each individual in the dataset.

iqtree_mito.sh

Makes the mtDNA tree

make-fastas_uces.sh

Loops through a bed file and list of samples to make a fasta file for each individual containing sequences from the bed file locations. In this case, they are UCE loci plus flanking regions with 2kb total lengths.

phyluce.sh

Aligns UCE loci, trims with GBlocks, keeps only loci represented in all taxa, and concatenates the alignments. It also produces non-concatenated fasta alignments for individual gene trees.

iqtree_concat.sh

Runs IQ-TREE to make the concatenated phylogeny with branch support. Includes model selection to pick the best-support sequence evolution model. This is then used in iqtree_gene-trees.sh.

iqtree_gene-trees.sh

Runs IQ-TREE on each UCE "gene" tree separately using the model selected by the concatenated run.

treeqmc.sh

Runs TREE-QMC on the concatenated treefiles from the gene-trees step. Produces a tree with quartet support values but no branch lengths. If you want branch lengths, run iqtree_treeqmc-bls.sh. But that tree will not have the quartet supports.

iqtree_treeqmc-bls.sh

Runs IQ-TREE with the topology set as the TREE-QMC topology and the alignment as the concatenated UCE alignment. This gives branch lengths for the TREE-QMC topology, but loses quartet supports.

Data files:

uce-regions-2k.bed

This gives the location of the UCE loci and flanking sequences previously identified in the S. coronata reference genome. For use in make-fastas_uces.sh.

myiob.ind.list

This file is a sample list to iterate through when making the fastas.

mito.tree

The mitochondrial phylogeny

concat.tre

The concatenated UCE phylogeny

genes.tre

The concatenated UCE "gene" trees. This is the input to TREE-QMC.

species-nobl.tre

The TREE-QMC phylogeny with no terminal branch lengths

Folder: genomewide-fst

Code files:

plink_make-pgen.sh

Makes the pgen file needed to run the FST script

plink_fst.sh

Calculates pairwise FST values between taxa set in the pops.pheno file.

Data files

pops.pheno

Input to the fst script that tells plink which samples are in each category (in this case, taxa).

myiob.fst.summary

The FST output.

Folder: pcas

Code files:

GLs_andes.sh, GLs_andes-noalb.sh, GLs_chrysops-bairdi.sh

These three scripts make the beagle files used as input to pcangsd. They just provide different numbers of samples as inputs.

pcangsd_andes.sh, pcangsd_andes-noalb.sh

These are the pcangsd scripts that correspond to the above genotype likelihood scripts.

pcangsd_bairdi-chrysops_zregs.sh

This script runs pcangsd four times, one for each of the PCAs shown in Figure 5. The naming (ctl1, ctl2, ctl3, and focal) corresponds to panels C, D, E, and F, respectively.

Data files:

andes.cov, andes-noalb.cov

The covariance matrix outputs from pcangsd. Get eigenvectors/values with eigen() in R.

chrysops-bairdi.zctl1.beagle.gz, chrysops-bairdi.zctl2.beagle.gz, chrysops-bairdi.zctl3.beagle.gz, and chrysops-bairdi.zfocal.beagle.gz

Genotype likelihood files produces by GLs_chrysops-bairdi.sh and then manually subset to the regions of interest. See Figure 5 for locations on the Z.

chrysops-bairdi.zctl1.cov, chrysops-bairdi.zctl2.cov, chrysops-bairdi.zctl3.cov, and chrysops-bairdi.zfocal.cov

The covariance matric outputs from pcangsd in the Z regions (see Figure 5).

Folder: windowed-fst

Code files:

windowed-fst_bairdi-chrysops.sh

This script calculates FST in 10kb non-overlapping windows between M. m. bairdi and M. o. chrysops.

angsd.source.R

A script that loads the correct libraries and creates functions needed for running the plotting scripts for windowed FST in R.

windowed.fst.and.mds.R

An R script that takes in the angsd windowed FST output plus several other files (provided here) and performs the outlier analysis and plotting. It also plots the MDS results, so come back to it once you have run lostruct.

Data files:

bairdi-chrysops.popgen

Output from angsd that gives the windowed FST values

clustering_windows.csv

A file that gives the 1 Mb windows used to evaluate the Setophaga pairs for outlier overrepresentation. Columns: chromosome - chromosome name, bin_start - start of window, bin_end - end of window

setophaga_1mb_windows.csv

A file that gives the 1 Mb windows that showed an overrepresentation of outliers among any of nine pairs of species in Setophaga. Calculated for Baiz et al. 2021. Columns: chromosome - chromosome name, bin_start - start of window, bin_end - end of window, two sets of 9 columns each listing a pair of species - the first instance of each pair is the number of outlier 10kb windows per 1 Mb bin while the second instance is the either a 1 or 0 depending on whether there are >=8 outliers in the corresponding species pair, sum - summed counts from the second instance of each species pair set.

BBWA-BLPW.fst.csv

More results from Baiz et al. 2021. FST for the bay-breasted vs. blackpoll warbler comparison. Used in this paper just for plotting. Columns: chr - chromosome name, midPos - the center position of each chromosomal window, Nsites - number of sites used in the FST calculation, fst - FST.

mds_allchroms.csv

Final output from lostruct analysis. Used by the plotting script above but located in the lostruct directory. See that directory for details.

Folder: lostruct

Code files:

lostruct_10kb.job

With the main VCF split into chromosomes with bcftools, this sets up and runs the lostruct script in R.

lostruct_10kb.R

R script for running lostruct itself in 10kb windows. This runs separately per chromosome-VCF.

lostruct_plot_all-chroms.R

Outlier and plotting code for MDS results

Data files:

mds_allchroms.csv

The final result of the lostruct analysis, produced by combining each chromosome result. Columns: chrom - chromosome name, start - start of window, end - end of window, V1 - MDS factor 1, V2 - MDS factor 2.

Data and code from: Diversification and divergence in Myioborus warblers: insights into evolutionary relationships and plumage genetics

Data files

Abstract

README: Data and code from: Diversification and divergence in Myioborus warblers: insights into evolutionary relationships and plumage genetics

Folder: data-processing

Code file:

rim-align-sort-mark-index.sh

Folder: variant-calling

Code files:

call-variants_mito.sh

filter_mito.sh

call-variants_example.sh

filter_example.sh

concat.bcfs.sh

Data files:

pops.list

ploidy.list

samples.list

myiob.filt1.mito.vcf.gz

Folder: phylogenies

Code files:

make-fastas_mito.sh

iqtree_mito.sh

make-fastas_uces.sh

phyluce.sh

iqtree_concat.sh

iqtree_gene-trees.sh

treeqmc.sh

iqtree_treeqmc-bls.sh

Data files:

uce-regions-2k.bed

myiob.ind.list

mito.tree

concat.tre

genes.tre

species-nobl.tre

Folder: genomewide-fst

Code files:

plink_make-pgen.sh

plink_fst.sh

Data files

pops.pheno

myiob.fst.summary

Folder: pcas

Code files:

GLs_andes.sh, GLs_andes-noalb.sh, GLs_chrysops-bairdi.sh

pcangsd_andes.sh, pcangsd_andes-noalb.sh

pcangsd_bairdi-chrysops_zregs.sh

Data files:

andes.cov, andes-noalb.cov

chrysops-bairdi.zctl1.beagle.gz, chrysops-bairdi.zctl2.beagle.gz, chrysops-bairdi.zctl3.beagle.gz, and chrysops-bairdi.zfocal.beagle.gz

chrysops-bairdi.zctl1.cov, chrysops-bairdi.zctl2.cov, chrysops-bairdi.zctl3.cov, and chrysops-bairdi.zfocal.cov

Folder: windowed-fst

Code files:

windowed-fst_bairdi-chrysops.sh

angsd.source.R

windowed.fst.and.mds.R

Data files:

bairdi-chrysops.popgen

clustering_windows.csv

setophaga_1mb_windows.csv

BBWA-BLPW.fst.csv

mds_allchroms.csv

Folder: lostruct

Code files:

lostruct_10kb.job

lostruct_10kb.R

lostruct_plot_all-chroms.R

Data files:

mds_allchroms.csv