Data from: Evolutionary rescue of freshwater copepods during historical lake acidification
Data files
Dec 03, 2025 version files 23.41 GB
-
lat_L145_R1_trim.fastq.gz
11.56 GB
-
lat_L145_R2_trim.fastq.gz
11.86 GB
-
README.md
14.25 KB
Abstract
The persistence of populations facing severe environmental disturbance can be enabled by natural selection on heritable phenotypic variation - a process known as evolutionary rescue. Few studies have documented this process in complex natural settings and the long-term outcome of evolutionary rescue. Here, we used copepod resting eggs of Leptodiaptomus minutus from three time periods of lake ecological history, spanning ≈ 200 generations (100 years) in two populations impacted by historical acidification. Whole-genome sequencing of the resting eggs revealed significant allele-frequency shifts associated with acidification followed by pH recovery. We used a resurrection ecology approach to retrace adaptive shifts concomitant with environmental transitions. Copepods from the pre-acidification period were sensitive to acidity, whereas those from the acidification period were adapted to acidic pH. This tolerance was subsequently lost during pH recovery, implying an adaptive reversal. Demographic models indicated a decline during the acidification process, followed by population recovery based on historical data, suggesting that selection led to evolutionary rescue. This study fills a critical knowledge gap about the long-term implications of evolutionary rescue in the wild.
Description of the data and file structure
We used copepod resting eggs of the freshwater species Leptodiaptomus minutus extracted from sediment cores dating from three time periods of lake ecological history, spanning ≈200 generations, in two populations impacted by historical acidification. Whole genome sequencing of the resting eggs was used to detect putative SNPs under selection from the acidification and follow allele frequency shifts associated with the environmental fluctuations. Demographic modelling was applied to the genomic data to retrace changes in genetic diversity related to the acidification. We used a resurrection ecology approach to retrace adaptive shifts concomitant with the environmental transitions.
Files and variables
This repository contains several top-level files and folders, as well as subfolders:
- code_hpc: This folder contains code written to be processed on an HPC cluster.
- code_rscripts: This folder contains code written to be processed with R on a regular computer.
- processed_data: This folder contains data files that have been processed and are ready to use. Generally, these files are the result of running data processing scripts on the code_hpc folder on raw data in the 'raw_data' folder.
- raw_data: This folder contains raw data files, including spreadsheets of raw data.
- Results: The report contains statistical results, tables, and figures.
Computing environment
Bioinformatic analyses were performed using the R programming language and several software packages (discosnp++, fastp, Baypass, dadi). Non-HPC analyses were performed on an Asus VivoBook with an Intel Core i5 processor and 12GB of RAM, within a virtual machine (Oracle VM VirtualBox, 6GB of RAM) running on Ubuntu 20.04.5 LTS. HPC analyses were performed on the Digital Research Alliance of Canada cluster Beluga.
Other data
Raw Illumina reads for this project are available on the NCBI SRA portal under the BioProject ID BioProject PRJNA1229077. We also used Illumina reads from a population (unpublished data, ta), which was generated with 200 taxonomically identified Leptodiaptomus minutus pooled and sequenced with Novaseq 6000, then trimmed with fastp, as an outgroup in the reference-free SNP discovery method with discosnp++. These fastq files are provided in the raw data folder.
List of folders and files
code_hpc:
1_Discosnp:
This folder contains scripts to run the reference-free SNP discovery pipeline implemented in discosnp++, using the raw Illumina paired-end sequencing files from five libraries as input and Illumina reads from a separate population (unpublished data) as an anchor. The files lat_L145_R1_trim.fastq.gz and lat_L145_R2_trim.fastq.gz are provided as raw sequence data. These reads were generated from a pooled sample of 200 taxonomically identified Leptodiaptomus minutus individuals. The sample was sequenced using a NovaSeq 6000 platform, and the resulting reads were quality-trimmed using fastp.
2_Baypass:
This folder contains scripts to analyze the genotyping data files with Baypass v2.2, with the input files being generated with poolfstat v2.1.1. The analysis is run in two parts, first with the core model, and thwith en the Aux model. The input genotyping data files correspond to five sub-dataset Baypass input files generated with the "thinning" subsampling method, with a sub-sample size of 842,255 SNPs. The Baypass program also uses the minutus_killarney_baypas s_input. poolsize (file with pool size for each population), the acid_ecotype and minutuscore_kill_sub*_mat_omega (for the Aux model) as input. For a description of the options used, see the Baypass manual
3_dadi:
This folder contains the code used to run the demographic analyses with dadi v2.3.3 from Python v3.11.5. The codes are based on the well-documented user guide for dadi (https://dadi.readthedocs.io/) This folder is divided into two sub-folders named killarney_1D_*_fixedtime for each population, and each of these folders contains 4 sub-folders for each of the 4 demographic models analyzed: 1) bottleneck and exponential growth (most complex model) with TB is the fixed scaled time between the bottleneck and present, 2) bottleneck only (two epochs), 3) growth only and 4) neutral. These 4 sub-folders are in the order listed below:
· bottlegrowth
· twoepochs
· growth
· neutral
Each of these sub-folders contains a Python file .py used to run the optimization of the parameters for each model and a bash .sh script to run tPythonhon file on a computing cluster. The Python scripts for the optimization consist of several:
1. Data import and size frequency spectrum SFS (fs) generation: pop_id and ns give the population ID and the number of chromosomes included (2X nb of individuals)
2. Define custom demography model: define model parameters (tailored for each model) and output the model SFS.
3. Parameter optimization: this part runs a local optimizer on the log of the parameters based on the defined grid size, the demographic model defined above, initial parameter values p0, and upper and lower bounds of the parameters. p0, the upper and lower bounds of the parameters are modified after each optimization run until the optimization reaches a plateau (log-likelihood of the three best models within 1% of each other).
4. Log-likelihood of the model calculation: outputs the log-likelihood of the model based on the model SFS with the optimized parameters and the observed SFS.
5. Plot: After each optimization run, it outputs a plot of the folded SFS for the observed data and the model (optimized parameters) as well as the residuals of the model.
The input data for these analyses are the dadi_input_Killarney_REC files located in the processed_data > 1_dadi_input sub-folders, which were generated with the R scripts named script_prep_input_dadi located in the code_rscripts > dadi sub-folder.
For each population, an additional sub-folder named bootstrapping_uncertainties contains two .py scripts:
· bootstrapping_.py: code to generate 100 bootstrapped SFS with a chunk size of 100K SNPs.
· uncertainty_GIM_.py: Calculate the uncertainties on the parameters for the best model (to run after finishing the optimizations for all models tested) using the Godambe Information Matrix. This script takes as input the observed SFS generated at the beginning of each of the optimization scripts and the bootstrapped SFS generated with tbootstrapping.py* script. It redefines the best model in the same manner as in the optimization scripts and defines the optimized parameters as input. The scripts output the standard deviations from the GIM, the estimated 95% confidence interval from the GIM, and the upper and lower bounds of the 95% confidence interval of the parameters.
· lrt_test_.py: Performs a likelihood ratio test between two nested models
4_fastp:
This folder contains bash scripts to run fastp v.0.23.4 on the raw Illumina paired-end reads for trimming and QC.
code_rscripts
1_Poolfstat:
This folder contains R scripts to process the vcf output by discosnp++, filter the SNPs, NPs, and generate input files for Baypass, output pairwise FST and heterozygosities, and to detect/plot putSNPsve SNP under selection from the acidification using an FST scan.
2_Baypass:
This folder contains R scripts to process and plot the results output by Baypass. For a description of the variables in the dataset output by Baypass, see the Baypass manual.
3_candidates_SNPs_outliers:
This folder contains one R script, used to check the list of outliers for the Baypass core and Aux models as well as FST scan analyses, and to produce a Venn diagram plot.
4_dadi:
This folder contains R scripts to generate the input data files for dadi named prep_data_dadi_killarney_.R. These R scripts take as input the _freq_pij_outputBaypassAux.out (allele frequency) files generated by Baypass, the snpinfo_killarney_poolfstat_length50_filter.tsv file generated with poolfstat from the pooldata object, and the list of outliers obtained with Baypass core and aux models (outliers_minutus_killarney_baypass_core.tsv, outliers_baypassaux_killarney.tsv) as inputs. The dadi_input_pools function from the genomalicious R package,kage, with the “probs” parameter in the methodSFS, is then used to transform the allele frequency dataset into the SNP data format from dadi.
5_Resurrection_ecology_experiment:
This folder contains R scripts used to process the results of the resurrection ecology experiment and to analyze the 210Pb data with the rplum package.
processed_data
1_dadi_input
This folder contains files that are used as input for the demographic analyses with dadi and are in the SNP data format https://dadi.readthedocs.io/en/latest/user-guide/importing-data/. The files dadi_input_killarney__REC are for each population analyzed, and give the allele count for each allele.
2_Candidates
This folder contains output files of the genome scans analyses with Baypass and of the FST scan
raw_data
1_ Resurrection_ecology_experiment:
This folder contains the raw data file for the resurrection ecology experiment and the files used for the 210Pb analysis with rplum
lat_L145_R1_trim.fastq.gz and lat_L145_R2_trim.fastq.gz
These compressed zip files contain the trimmed paired fastq files for the population used as an anchor in the reference-free SNP discovery method with discosnp++.
results
1_Baypass:
This folder contains subfolders for the results of the Baypass core and Aux models.
2_Candidates:
This folder contains the Venn diagram plot.
3_dadi:
This folder contains the results files output by running the demographic analyses with dadi, organized in subfolders for the 4 models in each population. Each of these sub-folders contains the output files generated by running the optimization of the parameters for each model and population, as well as the accompanying plots of each optimization run. An .ods file summarizes the results of all the optimization runs in each model and population subfolders.
4_Figures_combined_Inkscape:
This folder contains .svg files for various figures from the main manuscript and the supplementary material, which were combined with Inkscape.
5_poolfstat:
This folder contains the pairwise FST matrix and heterozygosities files output by poolfstat and the plots output by the FST scan R script.
6_Resurrection_ecology_experiment:
This folder contains the plots of the resurrection ecology experiment, the raw output of rp, lum, and the plot of the dates output by rplum alon,g with the pH.
Code/software
Code/software
code_hpc:
1_Discosnp:
discosnp++ v2.3.X
· discosnp_killarney_check_b1.sh
2_Baypass:
Baypass v2.2
· Core_model: code files named minutus_killarney_baypass_core_sub12.sh, minutus_killarney_baypass_core_sub34,.sh, and minutus_killarney_baypass_core_sub5.sh to run the Baypass core model from pairs of input genotyping data files.
· aux_model: code files named minkill_aux_sub*.sh to run the Baypass Aux model from individual input genotyping data files.
3_dadi:
dadi v2.1.0
python v3.8.10
Four demographic models were analyzed with dedicated scripts for each population:
· killarney_1D_*_bottlegrowth.py
· killarney_1D_*_growth.py
· killarney_1D_GEO_neutral.py
· killarney_1D_GEO_twoepochs.py
· bootstrapping_.py: code to generate 100 bootstrapped SFS with a chunk size of 100K SNPs
· uncertainty_GIM_.py: Calculate the uncertainties on the parameters for the best model (to run after finishing the optimizations for all models tested) using the Godambe Information Matrix. This script takes as input the observed SFS generated at the beginning of each of the optimization scripts and the bootstrapped SFS generated with the bootstrapping*_*.py script. It redefines the best model in the same manner as in the optimization scripts and defines the optimized parameters as input. The scripts output the standard deviations from the GIM, the estimated 95% confidence interval from the GIM, and the upper and lower bounds of the 95% confidence interval of the parameters.
· lrt_test_*.py: Performs a likelihood ratio test between two nested models
4_fastp:
Fastp v0.23.4
fastp_GEO-ACID.sh, fastp_GEO-REC.sh, fastp_LUM-ACID.sh, fastp_LUM-PRE.sh, fastp_LUM-REC.sh to run fastp on the Illumina paired-end libraries (options -L 50, -q 20, -3) used as input for discosnp++ and downstream analyses.
code_rscripts
R v4.1.2 and v4.4.0
R packages:
poolfstat v2.1.1
data.table v1.14.2
stringr v1.4.0
tidyr v1.2.0
dplyr v1.0.9
ggplot2 v3.3.6
cowplot v1.1.1
gplots v3.1.3.1
corrplot v0.92
venndiagram v1.7.3
genomalicious v0.4
rplum v0.5.1
lubridate v1.9.4
stats v4.4.1
ggpubr v0.4.0
lme4 v1.1.35.4
lattice v0.22-6
bbmle v1.0.25.1
mumin v1.48.4
gridextra v2.3
faraway v1.0.8
ggsignif v0.6.4
fitdistrplus v1.1-8
coefplot2 v0.1.3.3
Access information
Other publicly accessible locations of the data:
- The raw sequencing reads have been deposited in the National Center for Biotechnology Information Sequence Read Archive SRA repository (BioProject PRJNA12,29077), and the accompanying metadata are also stored in the SRA (BioProject PRJNA1229077), using the Eukaryotic water MIxS package https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA1229077
