Data from: the great tit HapMap project: a continental-scale analysis of genomic variation in a songbird
Data files
Apr 13, 2024 version files 119.51 MB
Abstract
A major aim of evolutionary biology is to understand why patterns of genomic diversity vary within taxa and space. Large-scale genomic studies of widespread species are useful for studying how environment and demography shape patterns of genomic divergence. Here, we describe one of the most geographically comprehensive surveys of genomic variation in a wild vertebrate to date; the great tit (Parus major) HapMap project. We screened ca 500,000 SNP markers across 647 individuals from 29 populations, spanning ~30 degrees of latitude and 40 degrees of longitude - almost the entire geographic range of the European subspecies. Genome-wide variation was consistent with a recent colonisation across Europe from a South-East European refugiam, with bottlenecks and reduced genetic diversity in island populations. Differentiation across the genome was highly heterogeneous, with clear “islands of differentiation”, even among populations with very low levels of genome-wide differentiation. Low local recombination rates were a strong predictor of high local genomic differentiation (FST), especially in island and peripheral mainland populations, suggesting that the interplay between genetic drift and recombination causes highly heterogeneous differentiation landscapes. We also detected genomic outlier regions that were confined to one or more peripheral great tit populations, probably as a result of recent directional selection at the species’ range edges. Haplotype-based measures of selection were related to recombination rate, albeit less strongly, and highlighted population-specific sweeps that likely resulted from positive selection. Our study highlights how comprehensive screens of genomic variation in wild organisms can provide unique insights into spatio-temporal evolutionary dynamics.
README: Data from: the great tit HapMap project: a continental-scale analysis of genomic variation in a songbird
https://doi.org/10.5061/dryad.w3r2280z5
The data are the input and output files from a series of population genetics analyses performed on single nucleotide polymorphism (SNP) data generated from populations of great tits (Parus major) distributed around Europe. The majority of the analyses were run in Plink (Version 1.9) and R (Version 3.3). The pipelines and scripts used to generate the results are available on GitHub at https://github.com/lgs85/SpurginBosse_Hapmap/tree/main.
Description of the data and file structure
Plink files:
HapMapMajor.bed, HapMapMajor.fam and HapMapMajor.bim are Plink formatted binary files (see https://www.cog-genomics.org/plink/1.9/input#bed) of the data before any filtering. The .fam file contains sample information and the .bim file contains marker/genomic information. The .bed file contains the binary-formatted genotypes and is not readable as a text file.
HapMapMajor.fam columns:
- Population (Plink users will know this column as FID)
- Sample ID 9Plink users will know this column as IID)
- Father ID (0 = unknown)
- Mother ID (0 = unknown)
- Sex (1 = Male, 2 = Female, 0 = Unknown)
- Phenotype (always -9, for dummy phenotype, as no phenotype used in our analyses)
HapMapMajor.bim columns:
- Chromosome (an integer)
- SNP name
- Position in centiMorgans (here always set to 0 as unknwon)
- Position in BP
- Reference Allele
- Alternative Allele
HapMapMajorPruned.bed, HapMapMajorPruned.fam and HapMapMajorPruned.bim are as above, except are generated as a result of the filtering steps described in the paper. Filtering parameters are --geno 0.8 --maf 0.01 --indep-pairwise 50 10 0.1 --thin 0.25 --not-chr 30-32,34-36
Other input files:
LatLongAllPops.txt is a text file describing the location of each population studied.
- Population
- Latitude
- Longitude
- Country (to enable pooling of populations from the same population)
recomb_jon_500kb.txt is a text file with estimated local recombination rates (measured in centimorgans per Mbp) at 500Kbp intervals. First two columns are chromosome and position. Third column is the local recombination rate
- Chromosome
- Position that the 500kbp window starts at.
- Local recombination rate for that window (measured in cM/Mbp)
gene_density10kb.csv and gene_density500kb.csv are two comma-delimited text files that report the gene density across the genome, using windows of either 10kbp or 500kbp.
- CHROM - chromosome
- WINDOW_START - the position of the first base in the window
- WINDOW_STOP - the position of the last base in the window
- GENE_BP - the proportion of bases in that window that are within genes
- GENES_PER_MB - the gene density in that window, measured in genes per million bases.
Output Files:
HapMapMajorPruned.1.Q - HapMapMajorPruned.10.Q.
Output files from analyses with the program Admixture. Each row is a different bird, in the same order as in the HapMapMajorPruned.fam file. Each column represents the probability of assigning that individual to population n, where there are n possible populations (i.e. files *.1.Q to * .10.Q contain 1 - 10 columns respectively).
CV_error.txt A text file describing the proportion of birds assigned to the 'wrong' population in the Admixture analysis, for runs of 1 to 10 populations.
- K - the number of populations in the Admixture run (values increase from 1 to 10)
- CV - the proportion of birds for that value of K that were assigned to the wrong population in the cross-validation analysis
PairwiseFST.txt A text file containing values of genomewide population differentiation (Fst) between each pair of populations.
- Column 1 is the first population
- Column 2 is the second population
- Column 3 is the value of Fst between populations 1 and 2
Turkey.fst A text file describing the Fst between each population and the Turkey population. Fst is calculated in 500kbp windows.
- CHROM - the chromosome
- BIN_START - the position (in bp) where the window starts
- BIN_END - the position (in bp) where the window ends
- N_VARIANTS - the number of SNPs in that window
- WEIGHTED_FST - The weighted (across SNPs) Fst in that window between the Turkish population and the population in column Pop.
- MEAN_FST - The mean (across SNPs) Fst in that window between the Turkish population and the population in column Pop.
- Pop - the population being compared to Turkey.
HapMapMajor10kb.windowed.fst and HapMapMajor500kb.windowed.fst Two text files, with estimates of Fst calculated between all populations. Fst was calculated in windows of 10kbp and 500kbp. The number of SNPs per window is also reported as windows with zero SNPs cannot be used to estimate Fst.
- CHROM - the chromosome
- WINDOW_START - the position (in bp) where the window starts
- WINDOW_STOP - the position (in bp) where the window ends
- N_SNPs - the number of SNPs in that window
- FST - the mean (across SNPs) Fst in that window, across all populations.
HapMapLD.txt A text file describing the amount of linkage disequilibrium in each population, at a given distance between 100bp and 50,000 bp. First column is distance (in bp) and second column is mean r^2 (a measure of linkage disequilibrium).
- Column 1 - the distance over which LD is being measured e.g. 100 means LD is measured between SNPs 1-100bp apart. 200 means LD measured in the distance from 101-200 bp, etc
- Column 2 - the mean r^2 between all SNPs in that interval
- Column 3 - the number of SNPs used to measure LD (r^2) in that interval
- Column 4 - the country containing the populations in which that LD is being measured.
HapMapMajor.eigenvec. A principal component analysis was run on the HapMapMajor dataset. The first 6 eigenvector scores of each individual is reported. Individual IDs cross reference with the HapMapMajor.fam file.
- Column 1 is the population (same as first column in the .fam file)
- Column 2 is the individual ID (same as second column in the .fam file)
- Columns 3-6 are the eigen vectors of each individual for principal components 1-4 (column 3 = PC1, column 4 = PC2, etc)
21 different zipped .csv (i.e. comma-delimited text) files. Each csv file is named after a different population e.g. Austria.csv.gz, etc. Each csv file contains information from a comparison between the focal population and the Turkish population (to represent the refugial population). For 500kbp windows the diversity in each population is reported, along with two measures of differentiation (dxy and Fst) between the two populations.
- scaffold - the chromosome
- start - the first position (in bp) in that window
- end - the last position (in bp) in that window
- mid - the mean position (in bp) of each SNP in that window
- sites - the number of SNPs in that window
- pi_XX - nucleotide diversity in the focal population in that window, where XX is the focal population e.g. Switzerland.
- pi_Turkey - nucleotide diversity in the Turkey population in that window.
- dxy_XX_Turkey - the between population diversity, in that window, between XX and Turkey, where XX is the focal population e.g. Switzerland.
- Fst_XX_Turkey - the between population Fst in that window, between XX and Turkey, where XX is the focal population e.g. Switzerland.
Hapmap_rsb.txt A text file containing output from an analysis of three populations (Finland, Spain and Westerheide in The Netherlands). Each row contains data from a two-population pairwise comparison in a 500kb window. In each window the statistics Fst and Rsb (calculated using the R package REHH) are reported, along with the mean recombination rate (in CM/Mbp) in that interval.
- comparison - the two populations being compared
- CHR - the chromosome
- window - the first position in that 500kbp window (in bp)
- meanFST - the mean FST averaged across all SNPs in that window
- meanRsb - the mean RSB averaged across all SNPs in that window
- MEAN-cM - the mean recombination rate (measured in cM/Mbp) in that 500kbp window
Code/Software
All code to run the analyses is available at https://github.com/lgs85/SpurginBosse_Hapmap/tree/main
Methods
Genotype data was obtained by typing great tit blood samples collected by a research consortium, mostly from around Europe. Genotyping was performed on an Affymetrix high density SNP chip. Data were converted to Plink format, and from those Plink files various population genetic analyses were performed. All of the scripts are available on GitHub at https://github.com/lgs85/SpurginBosse_Hapmap/tree/main.
A manuscript describing the work has been submitted to Molecular Ecology Resources.