Data from: the great tit HapMap project: a continental-scale analysis of genomic variation in a songbird

Published Apr 13, 2024 on Dryad. https://doi.org/10.5061/dryad.w3r2280z5

Abstract

A major aim of evolutionary biology is to understand why patterns of genomic diversity vary within taxa and space. Large-scale genomic studies of widespread species are useful for studying how environment and demography shape patterns of genomic divergence. Here, we describe one of the most geographically comprehensive surveys of genomic variation in a wild vertebrate to date; the great tit (Parus major) HapMap project. We screened ca 500,000 SNP markers across 647 individuals from 29 populations, spanning ~30 degrees of latitude and 40 degrees of longitude - almost the entire geographic range of the European subspecies. Genome-wide variation was consistent with a recent colonisation across Europe from a South-East European refugiam, with bottlenecks and reduced genetic diversity in island populations. Differentiation across the genome was highly heterogeneous, with clear “islands of differentiation”, even among populations with very low levels of genome-wide differentiation. Low local recombination rates were a strong predictor of high local genomic differentiation (F_ST), especially in island and peripheral mainland populations, suggesting that the interplay between genetic drift and recombination causes highly heterogeneous differentiation landscapes. We also detected genomic outlier regions that were confined to one or more peripheral great tit populations, probably as a result of recent directional selection at the species’ range edges. Haplotype-based measures of selection were related to recombination rate, albeit less strongly, and highlighted population-specific sweeps that likely resulted from positive selection. Our study highlights how comprehensive screens of genomic variation in wild organisms can provide unique insights into spatio-temporal evolutionary dynamics.

https://doi.org/10.5061/dryad.w3r2280z5

The data are the input and output files from a series of population genetics analyses performed on single nucleotide polymorphism (SNP) data generated from populations of great tits (Parus major) distributed around Europe. The majority of the analyses were run in Plink (Version 1.9) and R (Version 3.3). The pipelines and scripts used to generate the results are available on GitHub at https://github.com/lgs85/SpurginBosse_Hapmap/tree/main.

Description of the data and file structure

Plink files:

HapMapMajor.bed, HapMapMajor.fam and HapMapMajor.bim are Plink formatted binary files (see https://www.cog-genomics.org/plink/1.9/input#bed) of the data before any filtering. The .fam file contains sample information and the .bim file contains marker/genomic information. The .bed file contains the binary-formatted genotypes and is not readable as a text file.

HapMapMajor.fam columns:

Population (Plink users will know this column as FID)
Sample ID 9Plink users will know this column as IID)
Father ID (0 = unknown)
Mother ID (0 = unknown)
Sex (1 = Male, 2 = Female, 0 = Unknown)
Phenotype (always -9, for dummy phenotype, as no phenotype used in our analyses)

HapMapMajor.bim columns:

Chromosome (an integer)
SNP name
Position in centiMorgans (here always set to 0 as unknwon)
Position in BP
Reference Allele
Alternative Allele

HapMapMajorPruned.bed, HapMapMajorPruned.fam and HapMapMajorPruned.bim are as above, except are generated as a result of the filtering steps described in the paper. Filtering parameters are --geno 0.8 --maf 0.01 --indep-pairwise 50 10 0.1 --thin 0.25 --not-chr 30-32,34-36

Other input files:

LatLongAllPops.txt is a text file describing the location of each population studied.

Population
Latitude
Longitude
Country (to enable pooling of populations from the same population)

recomb_jon_500kb.txt is a text file with estimated local recombination rates (measured in centimorgans per Mbp) at 500Kbp intervals. First two columns are chromosome and position. Third column is the local recombination rate

Chromosome
Position that the 500kbp window starts at.
Local recombination rate for that window (measured in cM/Mbp)

gene_density10kb.csv and gene_density500kb.csv are two comma-delimited text files that report the gene density across the genome, using windows of either 10kbp or 500kbp.

CHROM - chromosome
WINDOW_START - the position of the first base in the window
WINDOW_STOP - the position of the last base in the window
GENE_BP - the proportion of bases in that window that are within genes
GENES_PER_MB - the gene density in that window, measured in genes per million bases.

Output Files:

HapMapMajorPruned.1.Q - HapMapMajorPruned.10.Q.

Output files from analyses with the program Admixture. Each row is a different bird, in the same order as in the HapMapMajorPruned.fam file. Each column represents the probability of assigning that individual to population n, where there are n possible populations (i.e. files *.1.Q to * .10.Q contain 1 - 10 columns respectively).

CV_error.txt A text file describing the proportion of birds assigned to the 'wrong' population in the Admixture analysis, for runs of 1 to 10 populations.

K - the number of populations in the Admixture run (values increase from 1 to 10)
CV - the proportion of birds for that value of K that were assigned to the wrong population in the cross-validation analysis

PairwiseFST.txt A text file containing values of genomewide population differentiation (Fst) between each pair of populations.

Column 1 is the first population
Column 2 is the second population
Column 3 is the value of Fst between populations 1 and 2

Turkey.fst A text file describing the Fst between each population and the Turkey population. Fst is calculated in 500kbp windows.

CHROM - the chromosome
BIN_START - the position (in bp) where the window starts
BIN_END - the position (in bp) where the window ends
N_VARIANTS - the number of SNPs in that window
WEIGHTED_FST - The weighted (across SNPs) Fst in that window between the Turkish population and the population in column Pop.
MEAN_FST - The mean (across SNPs) Fst in that window between the Turkish population and the population in column Pop.
Pop - the population being compared to Turkey.

HapMapMajor10kb.windowed.fst and HapMapMajor500kb.windowed.fst Two text files, with estimates of Fst calculated between all populations. Fst was calculated in windows of 10kbp and 500kbp. The number of SNPs per window is also reported as windows with zero SNPs cannot be used to estimate Fst.

CHROM - the chromosome
WINDOW_START - the position (in bp) where the window starts
WINDOW_STOP - the position (in bp) where the window ends
N_SNPs - the number of SNPs in that window
FST - the mean (across SNPs) Fst in that window, across all populations.

HapMapLD.txt A text file describing the amount of linkage disequilibrium in each population, at a given distance between 100bp and 50,000 bp. First column is distance (in bp) and second column is mean r^2 (a measure of linkage disequilibrium).

Column 1 - the distance over which LD is being measured e.g. 100 means LD is measured between SNPs 1-100bp apart. 200 means LD measured in the distance from 101-200 bp, etc
Column 2 - the mean r^2 between all SNPs in that interval
Column 3 - the number of SNPs used to measure LD (r^2) in that interval
Column 4 - the country containing the populations in which that LD is being measured.

HapMapMajor.eigenvec. A principal component analysis was run on the HapMapMajor dataset. The first 6 eigenvector scores of each individual is reported. Individual IDs cross reference with the HapMapMajor.fam file.

Column 1 is the population (same as first column in the .fam file)
Column 2 is the individual ID (same as second column in the .fam file)
Columns 3-6 are the eigen vectors of each individual for principal components 1-4 (column 3 = PC1, column 4 = PC2, etc)

21 different zipped .csv (i.e. comma-delimited text) files. Each csv file is named after a different population e.g. Austria.csv.gz, etc. Each csv file contains information from a comparison between the focal population and the Turkish population (to represent the refugial population). For 500kbp windows the diversity in each population is reported, along with two measures of differentiation (dxy and Fst) between the two populations.

scaffold - the chromosome
start - the first position (in bp) in that window
end - the last position (in bp) in that window
mid - the mean position (in bp) of each SNP in that window
sites - the number of SNPs in that window
pi_XX - nucleotide diversity in the focal population in that window, where XX is the focal population e.g. Switzerland.
pi_Turkey - nucleotide diversity in the Turkey population in that window.
dxy_XX_Turkey - the between population diversity, in that window, between XX and Turkey, where XX is the focal population e.g. Switzerland.
Fst_XX_Turkey - the between population Fst in that window, between XX and Turkey, where XX is the focal population e.g. Switzerland.

Hapmap_rsb.txt A text file containing output from an analysis of three populations (Finland, Spain and Westerheide in The Netherlands). Each row contains data from a two-population pairwise comparison in a 500kb window. In each window the statistics Fst and Rsb (calculated using the R package REHH) are reported, along with the mean recombination rate (in CM/Mbp) in that interval.

comparison - the two populations being compared
CHR - the chromosome
window - the first position in that 500kbp window (in bp)
meanFST - the mean FST averaged across all SNPs in that window
meanRsb - the mean RSB averaged across all SNPs in that window
MEAN-cM - the mean recombination rate (measured in cM/Mbp) in that 500kbp window

Code/Software

All code to run the analyses is available at https://github.com/lgs85/SpurginBosse_Hapmap/tree/main

Data from: the great tit HapMap project: a continental-scale analysis of genomic variation in a songbird

Data files

Abstract

README: Data from: the great tit HapMap project: a continental-scale analysis of genomic variation in a songbird

Description of the data and file structure

Code/Software

Methods

Works referencing this dataset