Data and code from: A colorful legacy of hybridization in wood-warblers includes frequent sharing of carotenoid genes among species and genera
Data files
Nov 14, 2025 version files 318.11 MB
-
data-and-code.zip
318.10 MB
-
README.md
8.96 KB
Abstract
Introgression between species has the potential to shape evolutionary trajectories in important ways, but uncovering complex introgression dynamics has only recently been made possible by advances in genomics. Warblers of the avian family Parulidae exemplify rapid diversification and sexual trait divergence, and we endeavored to study historical introgression in the family. We sequenced multiple genomes of nearly every species, constructed a phylogeny for the family, and investigated gene flow across the genome and at genes known for controlling feather color. The dataset and code presented here and in GenBank are intended for reproduction of all analyses, including processing reads, making the phylogeny, and several methods for investigating gene flow at various genes and in 10-kb windows across the genome.
Dataset DOI: 10.5061/dryad.1zcrjdg3v
Description of the data and file structure
The dataset consists of a main folder, data-and-code.zip
Process-data
Data
- bam-locations.example.txt - Example of a file to feed into merge-species-bams.sh to merge individual sample bam files into species bam files.
- align-rate.txt - Read alignment rate output from bowtie log files compiled in one place. Used to make Fig. S8.
Code
- trim-align-mark-index.sh - Code that takes demultiplexed raw reads and a reference genome, then trims adapters, aligns to the reference, marks duplicates, and indexes the resulting bam file.
- merge-species-bams.sh - Code that merges individual sample bam files into species bam files.
- call-variants.sh - Code that makes a (very large) unfiltered VCF file of SNPs, indels, and invariant sites from the species bams.
- filter.sh - Filtering code with basic QUAL, GQ, and DP low and high filtering. Sites or genotypes not passing filters are recoded as uncalled, so all sites are retained. The output is the final master VCF file used in multiple other analyses.
Phylogeny
Data
- allspecies.list - A list of species names. Used in several scripts.
- uce-regions.bed - A bed file containing the starting and ending point of each 2-kilobase UCE locus.
- subset-taxa.list - A list of species used to make the subset tree, described in Code below.
- allspecies.fa.gz - A gzipped fasta file containing UCE loci from each species. This is the input file for running phyluce-alltaxa-concat.sh and the output from make-uce-fastas.sh.
- concat.tre - The final concatenated UCE tree, used in discordance analyses. The manuscript text describes the procedure for making this tree ultrametric for plotting in Fig. 1.
- uce-trees.tre - A file containing each UCE locus tree. This is used as the input for TREE-QMC.
- treeqmc.no-bl.tre - The output from TREE-QMC with no branch lengths.
- treeqmc.tre - The output from iqtree-treeqmc-bl.sh, which uses the concatenated alignment to estimate branch lengths constrained on the TREE-QMC topology.
- subset-inds.list - List of samples used to separate-individual vs. combined individual comparison (Fig. S9).
- subset-inds.topology.tre - Topology to use for separate-inds analysis, based on Fig. 1 topology to allow for comparison.
- subset-inds.concat.treefile - Output from iqtree_subset-inds.sh, which makes the separate-individual tree to compare against the Fig. 1 tree (see Fig. S9).
- divergence-times.txt - Parulidae crown age estimates from three previous studies and relative node ages of particular nodes from this study. Used in Fig. S3.
Code
- make-uce-fastas.sh - Code that takes in the master filtered VCF and uce-regions.bed and makes fasta files containing the UCE loci for each species.
- phyluce-alltaxa-concat.sh - Phyluce scripts that align UCE loci, do internal trimming with Gblocks, evaluate missingness, and concatenate alignments.
- iqtree-subsettaxa-concat.sh - IQ-Tree script that uses the model finder option to evaluate the best fitting model of sequence evolution for a subset of taxa (for computational feasibility). The code to generate the concatenated subset alignment is the same as the alltaxa code above. The taxa used in this analysis can be found in subset-taxa.list.
- iqtree-alltaxa-separate.sh - This code runs IQ-Tree on each UCE locus separately, generating individual locus trees found in uce-trees.tre.
- iqtree-alltaxa-concat.sh - This is the IQ-Tree code that generates the final concatenated phylogeny from the output of phyluce-alltaxa-concat.sh.
- treeqmc.sh - This code runs the program TREE-QMC, a program that estimates a species tree from multiple gene trees. The output is treeqmc.no-bl.tre, which has no branch lengths.
- iqtree-treeqmc-bl.sh - This code runs IQ-Tree using the concatenated UCE alignment but constrained to the TREE-QMC topology found in treeqmc.no-bl.tre. The output is treeqmc.tre.
- call-filter_subset-inds.sh - This is the first step of the separate-individuals vs. combined individuals comparison (results shown in Fig. S9). It calls variants filters/indexes the VCF file. Individuals used are listed in subset-inds.list.
- make-uce-fastas_subset-inds.sh - Uses VCF output of call-filter_subset-inds.sh to make UCE fastas, which will be the Phyluce input. Output fastas just need to be concatenated.
- phyluce_subset-inds_concat.sh - Runs Phyluce to make the concatenated alignment for IQ-Tree. Again, this is just for the Fig. S9 analysis.
- iqtree_subset-inds.sh - Makes the separate-individuals tree to compare against the combined-individuals tree from Fig. 1. Takes in the output from Phyluce and subset-inds.topology.tre. Output is subset-inds.concat.treefile
Gene-trees
Data
- bco2-1.8kb.aligned.fa - Alignment file generated by align-bco2-1.8kb.sh.
- bco2-1.8kb.tre - All-taxa gene tree of the BCO2 introgression region. This is the BCO2 tree in Fig. 2A.
- redtaxa.list - A list of taxa used in the CYP2J19 and BDH1L gene trees. Because red Myioborus miniatus miniatus was not split out as its own taxon in the master VCF file and to avoid many days of running code to regenerate the file, we made smaller red-subset VCF files using this list.
- redtaxa.bdh1l-11kb.aligned.fa - The BDH1L alignment used to make the gene tree.
- redtaxa.cyp2j19-37kb.aligned.fa - The CYP2J19 alignment used to make the gene tree.
- bdh1l-11kb.tre - The red taxa gene tree of the BDH1L introgression region.
- cyp2j19-37kb.tre - The red taxa gene tree of the CYP2J19 introgression region.
Code
- align-bco2-1.8kb.sh - Code that makes species-fastas of the BCO2 introgression region and aligns them.
- iqtree-bco2-1.8kb.sh - IQ-Tree run to make the BCO2 introgression region gene tree.
- call-variants_redtaxa.sh - Code that makes a smaller versions of the master VCF file, using the taxa of interest (from redtaxa.list, including splitting out red and yellow M. miniatus miniatus) and regions of interest.
- filter_redtaxa.sh - Code that reproduces the master VCF filtering but using the new red-taxa and red gene region-specific VCF files.
- align-bdh1l-11kb.sh - Code that makes species-fastas of the BDH1L introgression region and aligns them.
- align-cyp2j19-37kb.sh - Code that makes species-fastas of the CYP2J19 introgression region and aligns them.
- iqtree-bdh1l-11kb.sh - IQ-Tree run to make the BDH1L introgression region gene tree.
- iqtree-cyp2j19-37kb.sh - IQ-Tree run to make the CYP2J19 introgression region gene tree.
Introgression
Data
10 files with the filename format [species]-[species].ab - These are output files from the fd analysis runs, with the presumed "donor" species name first and "recipient" species name second.
p1-p2-p3-outgroups.txt - A file laying out how the fd analysis runs were set up. The meaning of P1, P2, P3, and outgroup can be found here: https://doi.org/10.1093/molbev/msu269. The taxa in the table can be substituted for the examples under -P1, -P2, -P3, and -O in ab-windows.example.sh.
Code
- call-variants.example.sh - An example script to call variants for particular combinations of species unique to each fd analysis comparison. The location of the bam files for each individual sample from the species listed in p1-p2-p3-outgroups.txt should be supplied as [comparison].bamlist.
- parse-vcf.example.sh - An example script that converts the VCF generated in call-variants.example.sh to a .geno file (see https://github.com/simonhmartin/genomics_general).
- ab-windows.example.sh - An example script for running ABBABABAwindows.py from the above GitHub repository. Each Manhattan-style plot in Fig. 3 and Fig. 5 plots data generated from this script.
D-Suite
Data
- leio-seto.set - Samples to use for the Leiothlypis-Setophaga-Cardellina subset.
- verm-geo.set - Samples to use for the Vermivora-Geothlypis subset.
- card-myiob.set - Samples to use for the Cardellina-Myioborus subset.
- leio-seto.tree - Provided to dtrios.
- verm-geo.tree - Provided to dtrios.
- card-myiob.tree - Provided to dtrios.
- leio-seto_BBAA.txt - dtrios output.
- verm-geo_BBAA.txt - dtrios output.
- card-myiob_BBAA.txt - dtrios output.
Code
- call-variants_leio-seto.sh - Calls variants with a subset of samples for use in D-Suite. Just change the bamlist and output file name to change the subset of samples. Which samples to use for each subset can be found in leio-seto.set, verm-geo.set, and card-myiob.set.
- dtrios_leio-set.sh - Runs D-Suite on the leio-seto subset. Repeat for the other two, just changing the .set and .tree files, plus the output. Result is leio-seto.BBAA.
