Data and code from: The distribution and dispersal of large haploblocks in a superspecies
Data files
Mar 19, 2025 version files 46.22 GB
-
GW_all4plates.Fst_groups.txt
13.72 KB
-
GW_all4plates.Fst_groups.with_dates_etc.txt
33.76 KB
-
GW_haploblocks_processing_scripts_Irwinetal2025.txt
47.13 KB
-
GW2022_all4plates.genotypes.allSites.chrgw3.infoSites.max2allele_noindel.maxmiss60.MQ20.lowHet.tab.012.indv
6.33 KB
-
GW2022_all4plates.genotypes.allSites.chrgw3.infoSites.max2allele_noindel.maxmiss60.MQ20.lowHet.tab.012.pos
24.38 MB
-
GW2022_all4plates.genotypes.allSites.chrgw3.infoSites.max2allele_noindel.maxmiss60.MQ20.lowHet.tab.012minus1
1.29 GB
-
GW2022_all4plates.genotypes.allSites.chrgw4A.infoSites.max2allele_noindel.maxmiss60.MQ20.lowHet.tab.012.indv
6.33 KB
-
GW2022_all4plates.genotypes.allSites.chrgw4A.infoSites.max2allele_noindel.maxmiss60.MQ20.lowHet.tab.012.pos
4.70 MB
-
GW2022_all4plates.genotypes.allSites.chrgw4A.infoSites.max2allele_noindel.maxmiss60.MQ20.lowHet.tab.012minus1
252.59 MB
-
GW2022_all4plates.genotypes.SNPs_only.chrgw2.max2allele_noindel.vcf.maxmiss60.MQ20.lowHet.vcf.idepth
10.98 KB
-
GW2022_all4plates.genotypes.SNPs_only.chrgwZ.max2allele_noindel.vcf.maxmiss60.MQ20.lowHet.vcf.idepth
10.96 KB
-
GW2022_all4plates.genotypes.SNPs_only.whole_genome.max2allele_noindel.vcf.maxmiss60.MQ20.lowHet.tab.012.indv
6.33 KB
-
GW2022_all4plates.genotypes.SNPs_only.whole_genome.max2allele_noindel.vcf.maxmiss60.MQ20.lowHet.tab.012.pos
31.96 MB
-
GW2022_all4plates.genotypes.SNPs_only.whole_genome.max2allele_noindel.vcf.maxmiss60.MQ20.lowHet.tab.012minus1
1.69 GB
-
GW2022_all4plates.genotypes.SNPs_only.whole_genome.vcf
41.62 GB
-
GW2022ref.fa
1.31 GB
-
README.md
8.88 KB
Abstract
Haploblocks are regions of the genome that coalesce to an ancestor as a single unit. Differentiated haplotypes in these regions can result from the accumulation of mutational differences in low-recombination chromosomal regions, especially when selective sweeps occur within geographically structured populations. We introduce a method to identify large well-differentiated haploblock regions (LHBRs), based on the variance in standardized heterozygosity (ViSHet) of single nucleotide polymorphism (SNP) genotypes among individuals, calculated across a genomic region (500 SNPs in our case). We apply this method to the greenish warbler (Phylloscopus trochiloides) ring species, using a newly assembled reference genome and genotypes at more than 1 million SNPs among 257 individuals. Most chromosomes carry a single distinctive LHBR, containing 4-6 distinct haplotypes that are associated with geography, enabling detection of hybridization events and transition zones between taxa. LHBRs have exceptionally low within-haplotype nucleotide variation and moderately low between-haplotype nucleotide distance, suggesting their establishment through recurrent selective sweeps at varying geographic scales. Meiotic drive is potentially a powerful mechanism of producing such selective sweeps, and the LHBRs are likely to often represent centromeric regions where recombination is restricted. Links between populations enable introgression of favored haplotypes and we identify one haploblock showing a highly discordant distribution compared to the rest of the genome, being present in two distantly separated geographic regions that are at similar latitudes in both east and west Asia. Our results set the stage for detailed studies of haploblocks, including their genomic location, gene content, and contribution to reproductive isolation.
https://doi.org/10.5061/dryad.8w9ghx3xr
Description of the data and file structure
This dataset is associated with this publication, which explains the research that produced the data:
Irwin, D., S. Bensch, C. Charlebois, G. David, A. Geraldes, S.K. Gupta, B. Harr, P. Holt, J.H. Irwin, V.V. Ivanitskii, I.M. Marova, Y. Niu, S. Seneviratne, A. Singh, Y. Wu, S. Zhang, T.D. Price. The distribution and dispersal of large haploblocks in a superspecies. Molecular Ecology, in press.
Files and variables
File: GW_haploblocks_processing_scripts_Irwinetal2025.txt
Description: This text file contains the annotated code used in mapping GBS reads to the reference genome and calling genotypes of individuals.
File: GW2022ref.fa
Description: This fasta file contains the new Phylloscopus trochiloides reference genome in the format used in the analysis described in the script file above.
File: GW2022_all4plates.genotypes.SNPs_only.whole_genome.vcf
Description: This file contains genotypic information from 310 individual genotyping-by-sequencing (GBS) runs and more than 5.5 million sites in the greenish warbler genome. The file is in standard Variant Call Format (VCF). This file contains all variant sites, prior to filtering. See the associated script file for details regarding how it was produced.
File: GW2022_all4plates.genotypes.SNPs_only.whole_genome.max2allele_noindel.vcf.maxmiss60.MQ20.lowHet.tab.012minus1
Description: This file contains genotypes from 310 individual genotyping-by-sequencing (GBS) runs and 2,431,709 sites, following some filtering for biallelic single nucleotide polymorphisms (SNPs), less than 60% of individuals having missing genotypes, mapping quality (MQ) at least 20, and heterozygosity lower than 60. The format is a tab-delimited text file containing a matrix in which rows correspond to individual runs and columns correspond to sites in the genome. In each cell, 0
represents homozygous reference, 1
represents heterozygous, 2
represents homozygous alternate, and -1
represents a missing genotype.
File: GW2022_all4plates.genotypes.SNPs_only.whole_genome.max2allele_noindel.vcf.maxmiss60.MQ20.lowHet.tab.012.indv
Description: This text file contains the individual run IDs corresponding to the rows of the file above (the one ending in 012minus1
).
File: GW2022_all4plates.genotypes.SNPs_only.whole_genome.max2allele_noindel.vcf.maxmiss60.MQ20.lowHet.tab.012.pos
Description: This tab-delimited text file contains the genomic coordinates of the sites corresponding to the columns of the file above (the one ending in 012minus1
). There are two columns, with the first indicating the scaffold name and the second indicating the base number on that scaffold.
File: GW_all4plates.Fst_groups.txt
Description: This file provides metadata for the samples used in the 310 individual GBS runs. Columns indicate the ID
of the sample run, the location
code of the sample, the group
identifier of each sample, the Fst_group
identifier of each sample, and the plot_order
that determines the ordering of samples in genotype-by-individual plots. This file is used in the Julia scripts that process the data.
File: GW_all4plates.Fst_groups.with_dates_etc.txt
Description: This file is like the GW_all4plates.Fst_groups.txt
file above, but with several columns added that indicate additional information about the samples: previous_pub
lists if the sample is new
with this 2025 publication or lists previous publications that used information from the sample; sampling_date
provides the date of sampling; otherID
and otherID2
provide other identifiers for the individual samples (e.g. band numbers or color combinations); providers_and_notes
provides that information when available.
File: GW2022_all4plates.genotypes.SNPs_only.chrgwZ.max2allele_noindel.vcf.maxmiss60.MQ20.lowHet.vcf.idepth
Description: This file contains read depths of Z chromosome markers (used in the Julia scripts).
File: GW2022_all4plates.genotypes.SNPs_only.chrgw2.max2allele_noindel.vcf.maxmiss60.MQ20.lowHet.vcf.idepth
Description: This file contains read depths of chromosome 2 markers (used in the Julia scripts).
File: GW2022_all4plates.genotypes.allSites.chrgw3.infoSites.max2allele_noindel.maxmiss60.MQ20.lowHet.tab.012minus1
Description: This file contains genotypes for all sequenced sites on chromosome 3 (see the end of this page for its use: https://darreni.github.io/GreenishWarblerGenomics2025/GW_Heterozygosity_Variance.html).
File: GW2022_all4plates.genotypes.allSites.chrgw3.infoSites.max2allele_noindel.maxmiss60.MQ20.lowHet.tab.012.indv
Description: This file contains the individual run IDs corresponding to the rows of the chromosome 3 file above (the one ending in 012minus1
).
File: GW2022_all4plates.genotypes.allSites.chrgw3.infoSites.max2allele_noindel.maxmiss60.MQ20.lowHet.tab.012.pos
Description: This tab-delimited text file contains the genomic coordinates of the sites corresponding to the columns of the chromosome 3 file above (the one ending in 012minus1
). There are two columns, with the first indicating the scaffold name and the second indicating the base number on that scaffold.
File: GW2022_all4plates.genotypes.allSites.chrgw4A.infoSites.max2allele_noindel.maxmiss60.MQ20.lowHet.tab.012minus1
Description: This file contains genotypes for all sequenced sites on chromosome 4A (see the end of this page for its use: https://darreni.github.io/GreenishWarblerGenomics2025/GW_Heterozygosity_Variance.html).
File: GW2022_all4plates.genotypes.allSites.chrgw4A.infoSites.max2allele_noindel.maxmiss60.MQ20.lowHet.tab.012.indv
Description: This file contains the individual run IDs corresponding to the rows of the chromosome 4A file above (the one ending in 012minus1
).
File: GW2022_all4plates.genotypes.allSites.chrgw4A.infoSites.max2allele_noindel.maxmiss60.MQ20.lowHet.tab.012.pos
Description: This tab-delimited text file contains the genomic coordinates of the sites corresponding to the columns of the chromosome 4A file above (the one ending in 012minus1
). There are two columns, with the first indicating the scaffold name and the second indicating the base number on that scaffold.
Code/software
Processing of the sequencing reads was done according to the file GW_haploblocks_processing_scripts_Irwinetal2025.text
, which resulted in the production of the first files listed above ending in 012minus1
, indv
, and pos
.
These three files were then imported into scripts written in the Julia programming language. These scripts are shown and described at https://darreni.github.io/GreenishWarblerGenomics2025/ . The underlying Quarto files containing Julia code blocks are also provided in this repository, with these filenames (listed in order in which the code blocks in them should be run):
GreenishWarblerGenomics2025.qmd
GW_Zchromosome_analysis.qmd
GW_PCAplots.qmd
GW_Heterozygosity_Variance.qmd
They above scripts use a new registered Julia package made available along with this paper: GenomicDiversity.jl
, available at https://github.com/darreni/GenomicDiversity.jl and through the official Julia registry.
Access information
Associated data at other locations:
- The new reference genome is provided at NCBI under PRJNA1210605.
- New GBS reads have been deposited at NCBI SRA under accession PRJNA1207594; within this accession are data for 3 sets (i.e., plates) of samples: runs SRR31958018, SRR31958020, and SRR31958019.
- This study also used GBS reads from a previously-sequenced set of samples, run SRR1176844 from NCBI accession PRJNA238841 (Alcaide et al., 2014, Nature).
Associated software at other locations:
- Julia functions used in data processing and graphing are provided in the new GenomicDiversity.jl package (https://github.com/darreni/GenomicDiversity.jl), and the complete analysis scripts that call these functions are available at https://darreni.github.io/GreenishWarblerGenomics2025 and at a Github repository (https://github.com/darreni/GreenishWarblerGenomics2025).
This Dryad archive contains genotypes of 257 individual greenish warblers (Phylloscopus trochiloides / P. plumbeitarsus / P. nitidus) at more than 1 million loci located on a newly generated high-quality reference genome.