Replicate avian hybrid zones reveal the progression of genetic and trait introgression through time
Data files
Feb 06, 2026 version files 11.96 GB
-
RaFlam.v1_AllSamples.vcf.gz
667.98 MB
-
README.md
17.81 KB
-
RI301_1.fq.gz
5.42 GB
-
RI301_2.fq.gz
5.51 GB
-
S1_Table.xlsx
49.04 KB
-
S18_Table.txt
3.22 MB
-
UROC_RaFlam.v1_Scaffolded.fasta.gz
336.61 MB
-
UROC_RaFlam.v1_Scaffolded.gff
30.98 MB
Abstract
Replicate hybrid zones between the same pair of taxa provide a unique opportunity to uncover repeatable outcomes of hybridization, which may point to loci and traits under parallel selection and those contributing to reproductive isolation. In addition, replicates allow for the exploration of the factors causing shifts in hybrid zone structure over time. Here we take advantage of a pair of avian taxa that form multiple hybrid zones to assess the predictability of hybridization and explore the progression of trait introgression through evolutionary time. The Lemon-rumped (Ramphocelus flammigerus icteronotus) and Flame-rumped Tanagers (R. f. flammigerus) inhabit the Pacific coast and mid-elevation slopes of the Cauca Valley of Colombia, hybridizing where they encounter each other in low passes across the Andes. We sampled transects along three such geographically separate passes and found that hybrid zones along these transects were formed independently, show parallel patterns of phenotypic divergence across ecological gradients, and have similar demographic histories. We also found parallel patterns of asymmetric introgression of neutral markers from the yellow icteronotus subspecies into the hybrid zone across transects. However, the age of the hybrid zones varied, as did the extent to which geographic and genomic clines are displaced away from environmental transitions into the red flammigerus range. The greatest displacement was in the oldest southern transect, followed by moderate displacement in the middle transect and little to no displacement in the youngest northern transect. Also, the only shared introgression outliers across all three transects were in a genomic region that predicts plumage color and clines for these loci were consistently narrow independent of the age of the hybrid zone, suggesting a role in reproductive isolation maintained over time. Altogether, our analyses of replicate hybrid zones showed that 1) locus-specific introgression is largely stochastic, but the magnitude and directionality of neutral introgression can be predictable if factors that influence major allele frequency dynamics—such as demography—are similar across replicates, and 2) independent of time and local environmental conditions, aspects of the hybrid zone dynamics can be predictable for traits likely involved in reproductive isolation.
Dataset DOI: 10.5061/dryad.zkh1893n9
Description of the data and file structure
This data repository contains the first genome assembly and homology-based annotation for a female Ramphocelus flammigerus icteronotus from Colombia. Additionally, it contains the raw unfiltered vcf files (Variant Call Format) and associated sample table (S1_Table), which is the primary dataset needed to replicate the analyses in the research article “Replicate avian hybrid zones reveal the progression of genetic and trait introgression through time”, Castaño et al. (In review). Details of the methods are in the description of each file. The vcf file includes samples from three hybrid zones between the Flame-rumped Tanager subspecies (R. f. icteronotus and R. f. flammigerus) in the Western Andes of Colombia, samples from *R. f. icteronotus populations in Ecuador and Panama and samples from other species in the genus Ramphocelus loaned from the LSU Museum of Natural Science (S1 Table). The RNA-seq reads used for the annotation are included in this repository. The bash scripts with the code used to perform variant calling with Tassel v5 and all other analysis described in the article are available at https://github.com/mi-castano10/ReplicateHybridZones.
Files and variables
File: UROC_RaFlam.v1_Scaffolded.gff
Description: Homology-based annotation of the R. flammigerus genome. We used RepeatModeler (v.2.0.1) to identify and classify repetitive elements de novo in our assembly and used the output, together with the T. guttata repeat database, as input for RepeatMasker (v.4.0.7) to soft mask the genome (-xsmall). Then we used the software STAR (Spliced Transcripts Alignment to a Reference v2.7.1) to map the RNA-seq reads to the soft-masked genome (Raw RNA-seq reads available in this repository). Finally, we used these mapped RNA-seq reads (-bam) and the annotated proteins from the published genome of T. guttata (NCBI annotation ID: GCF_003957565.2) as evidence input for GeMoMa (v.1.9;), to generate splice site aware homology-based gene predictions. Approximately 14.6% of the genome was contained in repeats and the homology-based annotation recovered a total of 17,128 genes.
1. Flynn JM, Hubley R, Goubert C, Rosen J, Clark AG, Feschotte C, et al. RepeatModeler2 for automated genomic discovery of transposable element families. Proc Natl Acad Sci U S A. 2020;117(17): 9451–9457. doi: 10.1073/pnas.1921046117
2. Smit AF, Hubley R, Green P. RepeatMasker. 2015. Available from: http://repeatmasker.org
3. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29(1): 15–21. doi: 10.1093/bioinformatics/bts635
4. Keilwagen J, Hartung F, Grau J. GeMoMa: homology-based gene prediction utilizing intron position conservation and RNA-seq data. Methods Mol Biol. 2019;1962: 161–177. doi: 10.1007/978-1-4939-9173-0_9
File: UROC_RaFlam.v1_Scaffolded.fasta.gz
Description: Near chromosome-scale de novo genome assembly for a member of the genus Ramphocelus using PacBio HiFi long-read sequencing at 100X coverage. We sent a single female R. f. icteronotus blood sample stored in Longmire’s lysis buffer to the DNA sequencing and Genotyping Center of the Delaware Biotechnology Institute at the University of Delaware for library preparation and PacBio HiFi sequencing to produce a high-quality reference genome. High molecular weight DNA was isolated using the Qiagen MagAttract kit (QIAGEN, shanghai, China) following the manufacturer’s instructions. One SMRTbell library of circular consensus sequencing (CCS) was constructed and sequenced on three PacBio Sequel II SMRT cell platforms in CCS mode. This mode calls consensus reads from subreads that are generated after multiple passes of the enzyme around the circularized template. After adapter removal as part of the CCS analysis, sequencing yielded 93.9 Gb of data and ~6.5 million HiFi (high-fidelity) reads for the female R. f. icteronotus sample (available at the NCBI sequence read archive: PENDING).
Prior to assembly we assessed the estimated genome size, rate of heterozygosity and repetitive element content of our genome. We used Meryl (v1.4.1) to count and compute a histogram of k-mer frequencies from the raw adapter trimmed PacBio Hifi reads using count k = 21. We uploaded the output of Meryl to the Galaxy web platform and used the public server at usegalaxy.org to run GenomeScope2.0 with the following parameters: --ploidy 2 –kmer_length 21. The model indicates that R. f. icteronotus has an estimated genome size of 1,164,347 bp, with 0.88% heterozygosity levels and around 20% of the genome composed of repetitive elements (S1 Fig). We generated a phased genome assembly using the -primary flag and default parameters of hifiasm (v0.16.1). The primary assembly was 1,394,387,738bp long, assembled in 347 contigs with an N50 of 71.1 Mb and a GC content of 44.97%. We used Blobtools2 from the BlobToolKit suite [120] to screen the assembly for contamination. First, we performed a BLASTn search of our assembly against the general RefSeq blast -nt database using the following parameters: -outfmt '6 qseqid staxids bitscore std' -max_target_seqs 1 -max_hsps 1 -num_threads 12 -evalue 1e-25. Then we used minimap2 (v2.17) to map the raw reads back to the assembly. We used the function blobtools –add to create a BlobDir database that included the bast output (hits file) and the read coverage (bam file). Lastly, we employed the online viewer (blobtools host ‘pwd’) to make a Blobplot with the hits and respective coverage of our contamination scan and removed 11 contigs that didn’t match the class Aves. After removing spurious contigs, we assembled the mitochondrial genome de novo using MitoHiFi (v3.2) with default parameters and the Black-throated Flowerpiercer (Diglossa brunneiventris; NCBI Accession: NC_062466.1) mitochondrial genome as a reference. The final mitochondrial genome for R. f. icteronotus was 16,784 bp long, with 37 genes. We used BLASTn to identify and remove other mitochondrial contaminant sequences present in the nuclear genome and added the contig with the MitoHiFi mitochondrial genome at the end of the R. f. icteronotus assembly. We evaluated assembly quality and completeness using Benchmarking Universal Single-copy Orthologs (BUSCO v5.8; [124]) with the passeriformes_odb10 database, available in the public server at usegalaxy.org. Finally, to achieve a chromosome-level genome assembly, we scaffolded our assembly using the chromosome-level genomes of the Black-throated Flowerpiercer (Diglossa brunneiventris; NCBI assembly ID: GCA_019023105.1) and the Zebra finch (Taeniopygia guttata; NCBI assembly ID: GCA_003957565.4). We mapped our assembly to the reference genomes using minimap2, and ordered and numbered our contigs first according to identity with D. brunneiventris and then T. guttata for the micro chromosomes that were not present in the D. brunneiventris assembly. We used custom python scripts to assemble the contigs that mapped to the same reference chromosome into scaffolds by adding 100 Ns in between. After contamination screening, mitogenome assembly and reference-guided scaffolding our highly contiguous genome was 1.3 Gb long, contained in 109 scaffolds and 336 contigs. The scaffold and contig N50 were 75.4 and 71.1 Mb respectively, with > 95% of all the sequence contained in the largest 42 scaffolds, corresponding to 39 autosomes, the Z and W chromosomes and the mitochondria.
1. Rhie A, Walenz BP, Koren S, Phillippy AM. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol. 2020;21: 245. doi:10.1186/s13059-020-02134-9
2. The Galaxy Community. The Galaxy platform for accessible, reproducible, and collaborative data analyses: 2024 update. Nucleic Acids Res. 2024;52(W1): W83–W94. doi: 10.1093/nar/gkae410
3. Ranallo-Benavidez TR, Jaron KS, Schatz MC. GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nat Commun. 2020;11(1): 1432. doi: 10.1038/s41467-020-14998-3
4. Cheng H, Concepcion GT, Feng X, Zhang H, Li H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods. 2021;18(2): 170–175. doi: 10.1038/s41592-020-01056-5
5. Challis R, Richards E, Rajan J, Cochrane G, Blaxter M. BlobToolKit – Interactive quality assessment of genome assemblies. G3 (Bethesda). 2020;10(4): 1361–1374. doi: 10.1534/g3.119.400908
6. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009;10(1): 421. doi: 10.1186/1471-2105-10-421
7. Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34(18): 3094–3100. doi: 10.1093/bioinformatics/bty191
8. Uliano-Silva M, Ferreira JGRN, Krasheninnikova K, Blaxter M, Mieszkowska N, Hall N, et al. MitoHiFi: a python pipeline for mitochondrial genome assembly from PacBio high fidelity reads. BMC Bioinformatics. 2023;24(1): 288. doi: 10.1186/s12859-023-05385-y
9. Manni M, Berkeley MR, Seppey M, Simão FA, Zdobnov EM. BUSCO update: novel and streamlined workflows along with broader and deeper phylogenetic coverage for scoring of eukaryotic, prokaryotic, and viral genomes. Mol Biol Evol. 2021;38(10): 4647–4654. doi: 10.1093/molbev/msab199
File: RaFlam.v1_AllSamples.vcf.gz
Description: Raw unfiletered vcf file used for population genomic analyses. We extracted genomic DNA from a total of 252 birds (S1 Table) using Qiagen DNeasy Blood and Tissue Extraction Kit (Qiagen, USA) following the manufacturer’s instructions, with the optional addition of 4 μL of RNase to remove any potential RNA contaminants. We sent the samples for genotype-by-sequencing (GBS) to the University of Wisconsin Biotechnology Center. Samples were digested with the Ape Ki restriction enzyme, and short fragments from throughout the genome were sequenced on an illumina NovaSeq 6000 2 X 150 Shared Sequencing. We sent one 96-well plate with samples from T1 and other species from the genus Ramphocelus for sequencing in 2021, and two plates with the rest of the samples for T2 and T3 in 2023. The 2021 plate yielded 627 million sequences, and the 2023 plates for T2 and T3 yielded 550 and 523 million sequences, respectively. We used Tassel (v5.2.65) to process the raw reads from the three plates combined, which included samples from our three transects, yellow icteronotus samples from Ecuador and Panama, and samples from other species in the genus Ramphocelus (S1 Table). First, we demultiplexed the reads and called variants following the GBSv2 SNP discovery and production pipeline. We aligned the short reads to our reference genome assembly for a female R. f. icteronotus with Bowtie2 and we used the default parameter settings for the rest of the pipeline, except for the minimum quality score (default mnQS = 0), which we increased to 20. After read demultiplexing, mapping and variant calling with tassel, we obtained 2,598,209 genome-wide SNPs for 254 individuals (Tassel script is available at https://github.com/mi-castano10/ReplicateHybridZones).
1. Elshire RJ, Glaubitz JC, Sun Q, Poland JA, Kawamoto K, Buckler ES, et al. A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species. PLoS One. 2011;6(5): e19379. doi: 10.1371/journal.pone.0019379
1. Glaubitz JC, Casstevens TM, Lu F, Harriman J, Elshire RJ, Sun Q, et al. TASSEL-GBS: a high capacity genotyping by sequencing analysis pipeline. PLoS One. 2014;9(2): e90346. doi: 10.1371/journal.pone.0090346
2. Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9(4): 357–359. doi: 10.1038/nmeth.1923
File: RI301_1.fq.gz
Description: Raw forward RNA-seq reads for a female R. f. icteronotus. We extracted RNA from 5 emerging feather follicles of the same female for which we constructed the de novo reference assembly using the Monarch Total RNA Miniprep Kit (New England Biolabs, USA). RNA samples were received and validated via standard quality control procedures at Novogene Co. Ltd. Sacramento, CA. Sequencing was performed in a NovaSeq PE 150 platform, using an mRNA library preparation with standard poly A enrichment, which generated 18 Gb of raw data to complement the genome annotation.
File: RI301_2.fq.gz
Description: Raw reverse RNA-seq reads for a female R. f. icteronotus. We extracted RNA from 5 emerging feather follicles of the same female for which we constructed the de novo reference assembly using the Monarch Total RNA Miniprep Kit (New England Biolabs, USA). RNA samples were received and validated via standard quality control procedures at Novogene Co. Ltd. Sacramento, CA. Sequencing was performed in a NovaSeq PE 150 platform, using an mRNA library preparation with standard poly A enrichment, which generated 18 Gb of raw data to complement the genome annotation.
File: S18_Table.txt
Description: Results from a genome-wide association study (GWAS) with individuals from all transects combined, to identify regions of the genome associated with rump color (hue; H4). We used GEMMA (v0.98.5) to fit Bayesian sparse linear mixed models to the data accounting for population structure and relatedness. We implemented the Wald test to evaluate association significance and applied a Bonferroni correction for multiple comparisons accounting for the total number of sites to calculate the significance threshold (alpha < 1.115 × 10-6).
Variables
- chr: Chromosome — the chromosome where the SNP is located.
- rs: SNP ID — usually the SNP name or identifier (e.g., rsID, variant name).
- ps: Position — physical base-pair position of the SNP on the chromosome.
- n_miss: Number of missing genotypes for that SNP across all individuals.
- allele1: Reference allele (or major/minor depending on encoding).
- allele0: Alternative allele — the other allele at that SNP.
- af: Allele frequency — frequency of
allele1in your dataset. - beta: Effect size — estimated effect size (regression coefficient) of the SNP on the trait.
- se: Standard error of the
betacoefficient. - logl_H1: Log likelihood under alternative hypothesis (SNP has an effect).
- l_remle: Log-likelihood using REML (Restricted Maximum Likelihood) — used for estimating variance components.
- l_mle: Log-likelihood using MLE (Maximum Likelihood) — used for model fitting.
- p_wald: P-value from the Wald test — tests whether
betais significantly different from 0. - p_lrt: P-value from the Likelihood Ratio Test — compares models with and without the SNP effect.
- p_score: P-value from the Score test — a test similar to the LRT but often faster to compute.
- Zhou, X., Stephens, M. Genome-wide efficient mixed-model analysis for association studies. Nat Genet 44, 821–824 (2012). https://doi.org/10.1038/ng.2310
File: S1_Table.xlsx
Description: Metadata for samples included in the study.
Variables
- Museum: The code of the museum where samples were deposited. ANDES: Museo de Historia Natural de la Universidad de los Andes, LSUMZ: Luisiana State University Museum of Natural Science.
- Skin ID: ID number at the museum collection where the skin was deposited. *NA means the individual only has blood samples available because it was not collected.
- Tissue ID: ID number at the museum collection where blood was deposited.
- Field Number: Name of the sample in field datasheets.
- GBS_ID: Name of the sample in the VCF file.
- Genus
- Species
- Subspecies
- Sex: Sex of the bird. M - Male, F - Female, U - Unknown.
- Age: Age of the bird when sample was collected. 1 - Unknown, 2 - Fledgling, 3 - Juvenile, 4 - Immature, 5 - Adult.
- Transect: Transect number if sampled as part of the replicate hybrid zones study.
- Locality: Locality name along the transect if sampled as part of the replicate hybrid zones study. *Letters followed by .1 represent admixed birds in transect 3 that potentially belong to a 4th hybrid zone to the northeast.
- Country
- Municipality: Municipality in Colombia
- Location: Name of the land or specific area in which the sample was collected.
- Elevation: Elevation in meters.
- site.lat: Latitude
- site.long: Longitude
- Wing.cord: Wing cord length in millimeters (mm).
- Tail.length: Length of the longest tail feather in millimeters (mm).
- Tarsus.length: Tarsus length (to the elbow) in millimeters (mm).
- Bill.length: Bill length (to the nostrils) in millimeters (mm).
- Bill.width: Bill width (at the base) in millimeters (mm).
- Bill.height: Bill height (at the base) in millimeters (mm).
- Weight: Weight of the bird in grams (g).
All the scripts used to conduct the analysis are available at: https://github.com/mi-castano10/ReplicateHybridZones
Access information
Other publicly accessible locations of the data:
