Quantifying CpG variants in a pied flycatcher population

Data files

Oct 08, 2025 version files 32.27 KB

EP_dispersal_piefly_Dataset.xlsx

25.67 KB
README.md

6.60 KB

Abstract

Natal dispersal is a key life-history trait determining fitness and driving population dynamics, genetic structure and species distributions. Despite existing evidence that not all phenotypes are equally likely to successfully establish in new areas, the mechanistic underpinnings of natal dispersal remain poorly understood. The propensity to disperse into a new environment can be favored by a high degree of phenotypic plasticity which facilitates local adaptation and may be achieved via epigenetic mechanisms, which modify gene expression and enable rapid phenotypic changes. Epigenetic processes occur in particular genomic regions - DNA methylation on CpG sites in vertebrates -, and thus individual genomes may differ in their capacity to be modified epigenetically. This “Epigenetic potential” (EP) may represent the range of phenotypic plasticity attainable by an individual, and be a key determinant of successful settlement in novel areas. We investigated the association between EP – quantified as the number of genome-wide CpG variants – and natal dispersal propensity in a long-term study population of Pied flycatchers (Ficedula hypoleuca) monitored since colonization of a new habitat 35 years ago. We tested this association at three levels, comparing EP between: i) individuals dispersing between and within habitat patches; ii) immigrants to the population and locally-born individuals; and iii) individuals from first (comprising colonizers or their direct descendants) and later generations of the population (consisting of locally-born individuals, which did not show natal dispersal). Results show a significant, positive association between EP and dispersal propensity in comparisons i) and iii), but not ii). Furthermore, CpG variants were non-randomly distributed across the genome, suggesting species- and/or population-specific CpGs being more frequent in promoters and exons. Our findings point to EP playing a role in dispersal propensity at spatial and temporal scales, supporting the idea that epigenetically-driven phenotypic plasticity facilitates dispersal and environmental coping in free-living birds.

This repository contains all scripts, code, and data necessary to recreate the analyses presented in Jimeno et al (2025) 'Epigenetic potential and dispersal propensity in a free-living bird: a spatial and temporal approach' in Molecular Ecology.

The analysis is based on called variant and invariant sites from whole genome resequencing data of Ficedula hypoleuca. All raw reads will be made available on the European Nucleotide Archive. Called variants can be recreated using our standardized genotyping pipeline, which is available here.

Here, we will walk through the steps required to recreate the CpG count data and to perform the final analyses from the study.

Data and software files included

cpg_counter_v0.2: Code for the count of CpG positions.

cpg_totaler_v0.1: Code for the count of number of total CpG within the genome.

cpg_totals_array_saga: Code for running the previous scripts as an array on a supercomputer.

EP_CpG_polyCpG_missingness_calculation: Code for the calculation of CpG and polyCpG totals and sequencing missingness.

EP_Annotation_AllSites: Code for the annotation of CpG sites.

EP_Annotation_CpG_polyCpG_plot_stats: Code for the plot and stats performed on the annotation of CpG sites.

EP_Stats_Models: Code for the statistical models testing the association between dispersal categories and CpG count.

EP_dispersal_piefly_Dataset.xlsx: Main dataset with the study variables, including the information to recreate the analyses testing the association between dispersal categories and CpG count. This dataset contains the following sheets: 1) Counts_CpGvariants, showing the number of genome-wide CpG variants per individual; 2) Counts_CpGvariants_prom, showing the number of CpG variants within gene promoters per individual; 3) Counts_CpGtotals, showing the number of total CpG sites (including variant and non-variant sites) per individual; 4) Variable explanation.

Part 1 - Quantifying epigenetic potential

We used a set of bash commands and R scripts that are necessary to quantify the CpG positions across a population genomic dataset. This repository contains the version of the software used for our analyses, but updated and maintained versions of the scripts can be found here. The pipeline is relatively simple, and we provide instructions here on how to use it.

The pipeline depends on a number of other tools, including

To use everything here, you will need the Ficedula albicolis reference genome, a set of genomic data - i.e. variant calls and information on the genome annotation, stored as a gff. The GFF for F. albicolis is available at the same location as the reference genome.

NB: all examples here use variables as placeholders - i.e. $VCF - remember to switch them for your own paths/files

Step 1: Extract flanking sites

In order to identify CpG positions, we need flanking sites from the reference genome. First, we extract the positions of your variant calls:

bcftools query -f '%CHROM\t%POS\n' $VCF > init.bed

We next calculate the positions of the flanking sites - this is very easy with awk:

awk '{print $1"\t"$2-2"\t"$2+1}' init.bed > variant.bed

Finally we use bedtools to extract these regions from the reference genome:

bedtools getfasta -fi $REF -bed variant.bed -tab > cpg_flanking_sites.bed

We now have a file of flanking sites for the downstream analysis.

Step 2: Count CpG positions

In order to count the CpG positions, we use the R script cpg_counter_v0.2.R. It can be used as a commandline tool like so:

Rscript cpg_counter_v0.2.R -v ${VCF} -f cpg_flanking_sites.bed -g ${GFF} -o $cpg.count.gz

Where the options are as follows:

-v - path to vcf
-f - path to flanking sites bedfile
-g - path to gff
-o - output file path

The script is designed to run on a single chromosome or block of a chromosome. It cannot run across the entire genome at once. Therefore it is advised you split the analyses across the chromosomes or genome windows in order to speed it up.

Step 3: Count CpG totals

If you want to calculate the total CpG site numbers, you can use the cpg_totaler_v0.1.R script. This is more lightweight than the CpG counter facility and is designed to run on very large datasets (i.e. all variant and invariant sites in a genome). It's usuage is almost identical to the counting script.

Rscript cpg_totaler_v0.1.R -v ${VCF} -f ${CHR}_cpg_flanking_sites.bed -g ${GFF} -o ${CHR}_cpg.count.gz

Where the options are as follows:

-v - path to vcf
-f - path to flanking sites bedfile
-g - path to gff
-o - output file path

This script should also be run across chromosomes.

Running the scripts as an array on a supercomputer

In order to demonstrate how to run this analysis on a supercomputer, we have included slurm script - cpg_totals_array_saga.slurm. This script is designed to run on a slurm job scheduler across all chromosomes simultaneously. It also incorporates code to extract flanking sites.

Output explained

Step 4: Calculation of CpG, polyCpG counts and missingness per sample

This step should be run on a supercomputer

EP_CpG_polyCpG_missingness_calculation.R

Step 5: Annotation of all sites in the analysis

For this step you are going to need the collared flycatcher reference genome (https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000247815.1/)
This step should be run on a supercomputer
This step requires gtfToGenePred and genePredToBed programs (available at https://hgdownload.soe.ucsc.edu/downloads.html#utilities_downloads)

EP_Annotation_AllSites.txt

Step 6: Annotation of identified CpGs and polyCpGs, plot, stats

EP_Annotation_CpG_polyCpG_plot_stats.R

Step 7: Testing the association between CpG number and dispersal categories

This code uses a .csv file containing the output of the calculation of CpG, polyCpG, and missingness per sample.

EP_Stats_Models