Quantifying CpG variants in a pied flycatcher population
Data files
Oct 08, 2025 version files 32.27 KB
-
EP_dispersal_piefly_Dataset.xlsx
25.67 KB
-
README.md
6.60 KB
Abstract
Natal dispersal is a key life-history trait determining fitness and driving population dynamics, genetic structure and species distributions. Despite existing evidence that not all phenotypes are equally likely to successfully establish in new areas, the mechanistic underpinnings of natal dispersal remain poorly understood. The propensity to disperse into a new environment can be favored by a high degree of phenotypic plasticity which facilitates local adaptation and may be achieved via epigenetic mechanisms, which modify gene expression and enable rapid phenotypic changes. Epigenetic processes occur in particular genomic regions - DNA methylation on CpG sites in vertebrates -, and thus individual genomes may differ in their capacity to be modified epigenetically. This “Epigenetic potential” (EP) may represent the range of phenotypic plasticity attainable by an individual, and be a key determinant of successful settlement in novel areas. We investigated the association between EP – quantified as the number of genome-wide CpG variants – and natal dispersal propensity in a long-term study population of Pied flycatchers (Ficedula hypoleuca) monitored since colonization of a new habitat 35 years ago. We tested this association at three levels, comparing EP between: i) individuals dispersing between and within habitat patches; ii) immigrants to the population and locally-born individuals; and iii) individuals from first (comprising colonizers or their direct descendants) and later generations of the population (consisting of locally-born individuals, which did not show natal dispersal). Results show a significant, positive association between EP and dispersal propensity in comparisons i) and iii), but not ii). Furthermore, CpG variants were non-randomly distributed across the genome, suggesting species- and/or population-specific CpGs being more frequent in promoters and exons. Our findings point to EP playing a role in dispersal propensity at spatial and temporal scales, supporting the idea that epigenetically-driven phenotypic plasticity facilitates dispersal and environmental coping in free-living birds.
This repository contains all scripts, code, and data necessary to recreate the analyses presented in Jimeno et al (2025) 'Epigenetic potential and dispersal propensity in a free-living bird: a spatial and temporal approach' in Molecular Ecology.
The analysis is based on called variant and invariant sites from whole genome resequencing data of Ficedula hypoleuca. All raw reads will be made available on the European Nucleotide Archive. Called variants can be recreated using our standardized genotyping pipeline, which is available here.
Here, we will walk through the steps required to recreate the CpG count data and to perform the final analyses from the study.
Data and software files included
cpg_counter_v0.2: Code for the count of CpG positions.
cpg_totaler_v0.1: Code for the count of number of total CpG within the genome.
cpg_totals_array_saga: Code for running the previous scripts as an array on a supercomputer.
EP_CpG_polyCpG_missingness_calculation: Code for the calculation of CpG and polyCpG totals and sequencing missingness.
EP_Annotation_AllSites: Code for the annotation of CpG sites.
EP_Annotation_CpG_polyCpG_plot_stats: Code for the plot and stats performed on the annotation of CpG sites.
EP_Stats_Models: Code for the statistical models testing the association between dispersal categories and CpG count.
EP_dispersal_piefly_Dataset.xlsx: Main dataset with the study variables, including the information to recreate the analyses testing the association between dispersal categories and CpG count. This dataset contains the following sheets: 1) Counts_CpGvariants, showing the number of genome-wide CpG variants per individual; 2) Counts_CpGvariants_prom, showing the number of CpG variants within gene promoters per individual; 3) Counts_CpGtotals, showing the number of total CpG sites (including variant and non-variant sites) per individual; 4) Variable explanation.
Part 1 - Quantifying epigenetic potential
We used a set of bash commands and R scripts that are necessary to quantify the CpG positions across a population genomic dataset. This repository contains the version of the software used for our analyses, but updated and maintained versions of the scripts can be found here. The pipeline is relatively simple, and we provide instructions here on how to use it.
The pipeline depends on a number of other tools, including
To use everything here, you will need the Ficedula albicolis reference genome, a set of genomic data - i.e. variant calls and information on the genome annotation, stored as a gff. The GFF for F. albicolis is available at the same location as the reference genome.
NB: all examples here use variables as placeholders - i.e. $VCF - remember to switch them for your own paths/files
Step 1: Extract flanking sites
In order to identify CpG positions, we need flanking sites from the reference genome. First, we extract the positions of your variant calls:
bcftools query -f '%CHROM\t%POS\n' $VCF > init.bed
We next calculate the positions of the flanking sites - this is very easy with awk:
awk '{print $1"\t"$2-2"\t"$2+1}' init.bed > variant.bed
Finally we use bedtools to extract these regions from the reference genome:
bedtools getfasta -fi $REF -bed variant.bed -tab > cpg_flanking_sites.bed
We now have a file of flanking sites for the downstream analysis.
Step 2: Count CpG positions
In order to count the CpG positions, we use the R script cpg_counter_v0.2.R. It can be used as a commandline tool like so:
Rscript cpg_counter_v0.2.R -v ${VCF} -f cpg_flanking_sites.bed -g ${GFF} -o $cpg.count.gz
Where the options are as follows:
-v- path to vcf-f- path to flanking sites bedfile-g- path to gff-o- output file path
The script is designed to run on a single chromosome or block of a chromosome. It cannot run across the entire genome at once. Therefore it is advised you split the analyses across the chromosomes or genome windows in order to speed it up.
Step 3: Count CpG totals
If you want to calculate the total CpG site numbers, you can use the cpg_totaler_v0.1.R script. This is more lightweight than the CpG counter facility and is designed to run on very large datasets (i.e. all variant and invariant sites in a genome). It's usuage is almost identical to the counting script.
Rscript cpg_totaler_v0.1.R -v ${VCF} -f ${CHR}_cpg_flanking_sites.bed -g ${GFF} -o ${CHR}_cpg.count.gz
Where the options are as follows:
-v- path to vcf-f- path to flanking sites bedfile-g- path to gff-o- output file path
This script should also be run across chromosomes.
Running the scripts as an array on a supercomputer
In order to demonstrate how to run this analysis on a supercomputer, we have included slurm script - cpg_totals_array_saga.slurm. This script is designed to run on a slurm job scheduler across all chromosomes simultaneously. It also incorporates code to extract flanking sites.
Output explained
Step 4: Calculation of CpG, polyCpG counts and missingness per sample
This step should be run on a supercomputer
EP_CpG_polyCpG_missingness_calculation.R
Step 5: Annotation of all sites in the analysis
For this step you are going to need the collared flycatcher reference genome (https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000247815.1/)
This step should be run on a supercomputer
This step requires gtfToGenePred and genePredToBed programs (available at https://hgdownload.soe.ucsc.edu/downloads.html#utilities_downloads)
EP_Annotation_AllSites.txt
Step 6: Annotation of identified CpGs and polyCpGs, plot, stats
EP_Annotation_CpG_polyCpG_plot_stats.R
Step 7: Testing the association between CpG number and dispersal categories
This code uses a .csv file containing the output of the calculation of CpG, polyCpG, and missingness per sample.
EP_Stats_Models
Samples were collected in population of pied flycatchers (Ficedula hypoleuca) breeding in nest boxes in central Spain (La Hiruela, 41°04’ N, 3°27’ W; 1250 masl), and monitored since 1984. Total genomic DNA was extracted from blood samples using a standard ammonium acetate protocol, and diluted to a working concentration of 25 ng/μL. Besides, on an additional set of samples from 1998 (used in one of the group comparisons), DNA had already been extracted from whole blood using Chelex resin-based extraction.
Whole genome sequencing was performed by Novogene Europe in two sequencing runs, where paired-end Illumina next-generation sequencing (150bp) was carried out on an Illumina NovaSeq 6000 sequencer. Libraries were prepared using either the Novogene NGS DNA Library Prep Set (Cat No.PT004) at Novogene, or small-scale library prep (1/10th standard volumes) for Illumina DNA Prep with IDT index sets, using the Mosquito aliquoting system at the DeepSeq Facility at the University of Nottingham, UK. Both library preparation methods have previously been validated and shown to have comparable sequencing outcomes in another passerine species, the house sparrow (M. Ravinet, unpubl. data).
Genome sequences were aligned against the collared flycatcher (Ficedula albicollis) reference genome (https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000247815.1/ ; accession AGTO00000000.2), the closest relative species with a reference genome available. In order to map reads to the collared flycatcher reference genome, we used a custom genotyping pipeline developed in Nextflow (https://github.com/markravinet/genotyping_pipeline). The pipeline is based on a previously used protocol and is designed to go from raw reads to sequence alignment, genotyping and variant filtering in several simple, reproducible steps. Following variant calling, we quantified CpG polymorphism across the genomes of all individuals.
