Limited migration from physiological refugia constrains the rescue of native gastropods facing an invasive predator
Data files
Aug 30, 2024 version files 1.12 GB
-
file_upload_Dryad_MSALAMON_EvoApps.zip
-
README.md
Abstract
Biological invasions have caused the loss of freshwater biodiversity worldwide. The interplay between adaptive responses and demographic characteristics of populations impacted by invasions is expected to be important for their resilience, but the interaction between these factors is poorly understood. The freshwater gastropod Amnicola limosus is native to the Upper St. Lawrence River and distributed along a water calcium concentration gradient within which high-calcium habitats are impacted by an invasive predator fish (Neogobius melanostomus, round goby), whereas low-calcium habitats provide refuges for the gastropods from the invasive predator. Our objectives were to 1) test for adaptation of A. limosus to the invasive predator and the low calcium habitats, and 2) investigate if migrant gastropods could move from refuge populations to declining invaded populations (i.e., demographic rescue), which could also help maintain genetic diversity through gene flow (i.e., genetic rescue). We conducted a laboratory reciprocal transplant of wild F0 A. limosus sourced from the two habitat types (high calcium/invaded and low calcium/refuge) to measure adult survival and fecundity in home and transplant treatments of water calcium concentration (low/high) and round goby cue (present/absent). We then applied pooled whole-genome sequencing of twelve gastropod populations from across the calcium/invasion gradient. We identified patterns of life-history traits and genetic differentiation across the habitats that are consistent with local adaptation to low calcium concentrations in refuge populations and to round goby predation in invaded populations. We also detected restricted gene flow from the low-calcium refugia towards high-calcium invaded populations, implying that the potential for demographic and genetic rescue is limited by natural dispersal. Our study highlights the importance of considering the potentially conflicting effects of local adaptation and gene flow for the resilience of populations coping with invasive predators.
README: Limited migration from physiological refugia constrains the rescue of native gastropods facing an invasive predator
https://doi.org/10.5061/dryad.rxwdbrvjq
Description of the data and file structure
We conducted a laboratory reciprocal transplant of wild F0 Amnicola limosus gastropods sourced from two habitat types (high calcium/invaded and low calcium/refuge) to measure adult survival and fecundity in home and transplant treatments of water calcium concentration (low/high) and round goby cue (present/absent). We then applied pooled whole-genome sequencing of twelve gastropod populations from across the calcium/invasion gradient. The pooled sequencing was supported by a new draft genome for A. limosus. Population genomic analyses (genome scans, analyses of genomic diversity, population structure, demographic modeling) were used downstream after processing the sequencing data with a custom bioinformatic pipeline.
Files and variables
This repository contains several top-level files and folders, as well as subfolders:
- code_hpc: This folder contains code written to be processed on an HPC cluster.
- code_rscripts: This folder contains code written to be processed with R on a regular computer.
- processed_data: This folder contains data files that have been processed and are ready to use. Generally, these files are the result of running data processing scripts on the code_hpc folder on raw data in the 'raw_data' folder.
- raw_data: This folder contains raw data files, including spreadsheets of raw data.
- results: This folder contains statistical results, tables, and figures.
In many cases, the files are named after or contain the names of the 12 populations studied, referred to as RAF, PST, PON, PG, PDC, PB, OKA, IPE, IB, HA, GOY and BEA. In the readme, when files or folder names contain one or multilple population names, these were replaced by asterisks * for simplicity (except for the scripts for the Baypass analyses, where they refer to the replicate numbers).
Computing environment
- Bioinformatic analyses were performed using the R programming language and several softwares (fastp, Baypass, popoolation1 and popoolation2, dadi, poolFreqDiff, BUSCO, Jellyfish, samtools_flagstat). Non-HPC analyses were performed on an Asus VivoBook with an Intel Core i5 processor and 12GB of RAM, within a virtual machine (Oracle VM VirtualBox, 6GB of RAM) running on Ubuntu 20.04.5 LTS. HPC analyses were performed on the Digital Research Alliance of Canada cluster Beluga.
Other data
Raw Illumina reads are available on the NCBI SRA portal under the BioProject ID PRJNA1035459. The draft genome of Amnicola limosus has been deposited on the NCBI Genome database (BioProject PRJNA1136450).
List of folders and files
code_hpc:
1_pipeline_SNPcalling:
This folder contains code to run the bioinformatic pipeline with raw paired-end sequencing files as input and a sync file as output. The pipeline is divided into two parts
- trimming_bwa_sambam_part1: 1) trimming of raw fastq files with the function trim-fastq.pl from Popoolation1 v1.2.2 running with perl v5.30.2, 2) align trimmed reads to draft genome with bwa mem function from bwa v0.7.17, 3) filter aligned reads based on a quality of 20 with samtools view function from samtools v1.13, 4) sort bam files with samtools sort from samtools v1.13. This is done separately for each population with dedicated .sh files named pipeline_pt1_populationname.sh
- mpileup_sync_part2: pipeline_pt2_mpileup_sync.sh file to process the sorted bam files from all populations together. 1) Generate an mpileup file with samtools mpileup from samtools v1.13 using the bam_list_all.txt list of sorted bam files and the draft genome as input, 2) filter SNPs with reads having a coverage of 5 or more across all populations using grep -vw, 3) Create .sync file which is the main input files for downstream analyses and is generated with the function mpileup2sync.pl in Popoolation2 v1.10.03
2_popoolation:
This folder contains code to generate diversity indices with the Variance-sliding.pl function from Popoolation1 v1.2.2. For each population, three separate .sh files are used for each diversity index.
- *diversity.sh: ran first, 1) generate an individual pileup file from the sorted bam file using samtools mpileup from samtools v1.13, 2) calculate Tajima's pi with the sliding window method.
- *_diversity_theta.sh: calculate Watterson's Theta with the sliding window method.
- *_diversity_tajima.sh: calculate Tajima's D with the sliding window method.
3_Baypass:
This folder contains code to analyze the input genotyping data files (amnicola_baypass_input.genobaypass.sub1 to amnicola_baypass_input.genobaypass.sub27) with Baypass v2.2, with the input files being generated with poolfstat v2.1.1 (see poolfstat_amnicola.R code in code_rscripts > Poolfstat subfolder). The analysis is run in two parts, first with the core model, and then the STD model. The input genotyping data files correspond to 27 sub-dataset Baypass input files generated with the "thinning" subsampling method, with a sub-sample size of 750,000 SNPs. The Baypass program also uses the amnicola_baypass_input.poolsize (file with pool size for each population, located in processed_data > Baypass_input) as input. For a description of the options used, see the Baypass manual
- Core_model: This folder contains code files named amni_baypass_core_sub12.sh to amni_baypass_core_sub27.sh to run the Baypass core model from pairs of input genotyping data files.
- STD_model: This folder contains code files named amni_baypass_covis_sub*.sh to run the Baypass STD model from individual input genotyping data files. This analysis requires two additional input files located in processed_data > Baypass_input: 1) contrast_goby gives the round goby presence (1) or absence (-1) status population in order, which is needed to run the contrast analysis to compute the C2 statistic (for the round goby binary covariable), 2) cov_data gives the round goby presence/absence (1st line) then the water calcium concentration (2nd line). This folder contains a subfolder, Replicate_runs with code files named amni_baypass_covis_sub*_rep*.sh to run two additional replicate runs of the Baypass STD model (N= 3 runs in total).
4_dadi:
This folder contains the code used to run the demographic analyses with dadi v2.1.0 from python v3.8. The codes are based on the well-documented user guide for dadi https://dadi.readthedocs.io/[.](https://dadi.readthedocs.io/) This folder is divided into three sub-folders named fixed_bottleneck_time_*_** for each population pair, and each of these fixed_bottleneck_time__* folders contains eight sub-folders for each of the seven demographic models analyzed: 1) bottleneck and growth in both populations (most complex model) with defined effective population sizes after the split (nu1 and nu2), followed by a bottleneck in both populations followed by exponential recovery in both populations. TS is the scaled time between the split and the bottleneck and TB is the scaled time between the bottleneck and present, 2) bottleneck and growth only in the invaded population (constant Ne for the refuge population) with uneven migration, 3) only bottlenecks in both populations without recovery and uneven migration, 4) only bottleneck in the invaded population with uneven migration (constant Ne for the refuge population), 5) a simple population split at TS with uneven migration, 6) a population split with symmetric migration and 6) a population split without migration). These eight sub-folders are in the order listed below:
- twopops_bottlegrowth
- *_bottlegrowth
- twopops_splitmigbottleneck
- splitmigbottleneck*
- splitmig_unevenMig
- splitmig_evenMig
- splitnomig
Each of these sub-folders contains a python file .py used to run the optimization of the parameters for each model and a bash .sh script to run the python file on a computing cluster. The Python scripts for the optimization consists of several parts:
- Data import and size frequency spectrum SFS (fs) generation: pop_ids and ns give the population pair ID and the number of chromosomes included (2X nb of individuals). The fs is masked for the first 5 alleles.
- Define custom demography model: define model parameters (tailored for each model) and output the model SFS.
- Parameter optimization: this part runs a local optimizer on the log of the parameters based on the defined grid size, the demographic model defined above, initial parameter values p0, and upper and lower bounds of the parameters. p0, the upper and lower bounds of the parameters are modified after each optimization run until the optimization reaches a plateau (log-likelihood of the three best models within 1% of each other).
- Log-likelihood of the model calculation: outputs the log-likelihood of the model based on the model SFS with the optimized parameters and the observed SFS.
- Plot: After each optimization run, outputs a plot of the folded joint SFS for the observed data and the model (optimized parameters) as well as the residuals of the models.
The input data for these analyses are the dadi_input_filtercov__ files located in the processed_data > dadi_input sub-folders, which were generated with the R scripts named script_prep_input_dadi__ located in the code_rscripts > dadi sub-folder.
For each population pair, an additional sub-folder named bootstrapping_uncertainties, contains two .py scripts:
- bootstrapping__**.py: code to generate 100 bootstrapped SFS with a chunk size of 100K SNPs and masking the first 5 alleles in each bootstrapped SFS.
- uncertainty_GIM_*.py: Calculate the uncertainties on the parameters for the best model (to run after finishing the optimizations for all models tested) using the Godambe Information Matrix. This script takes as input the observed SFS generated at the beginning of each of the optimization scripts and the bootstrapped SFS generated with the bootstrapping.py script. It redefines the best model in the same manner as in the optimization scripts and defines the optimized parameters as input. The scripts output the parameters standard deviations from the GIM, the estimated 95% confidence interval from the GIM, and the upper and lower bounds of the 95% confidence interval of the parameters.
5_poolFreqDiff:
This folder contains two .sh scripts to run the poolFreqDiff analysis, one for each environmental covariable analysis (independent analyses for the calcium concentration and round goby presence/absence status). This analysis takes modified .sync files as input, which were reorganized from the initial .sync file output at the end of the bioinformatic pipeline described above, by alternating one population from a low-calcium habitat (or round goby absent) and one population from a high-calcium habitat (or round goby present). The parameters for the scripts are -npops = 6 for six populations in each environmental level (low-calcium vs high-calcium or round goby absent vs round goby present) defined by -nlevels = two levels of environmental characteristics possible, -n = 40 individuals, -mincnt = 2 for a minimum read count of two per alleles, -minc = 5 reads minimum per SNP per population, -maxc = 300 reads maximum per SNP per population, -rescale = neff to rescale the allele counts to an effective sample size neff as defined in (Feder et al., 2012) and -zeroes = 1 to add one to zero count cells.
6_BUSCO:
This folder contains one .sh script to run the BUSCO analysis on the A. limosus draft genome. It uses the --offline and --mode = genome options, and takes as input the mollusca_odb10 lineage dataset which was pre-downloaded from https://busco-data.ezlab.org/v5/data/lineages/ in 2020 with the 2020-08-05 time stamp.
7_fastp:
This folder contains three .sh scripts to run fastp v.0.23.4 on the three 10X paired-end reads used for the assembly of the draft genome, to obtain the insert size, using the -Q option to disable quality filtering.
8_Jellyfish:
This folder contains one .sh script jellyfish_amnicola_illumina_3pops_R1_untrimmed.sh to estimate the genome size with Jellyfish v2.3.0. It uses as input the R1 reads of the three runs of 10X sequencing simultaneously (-F 3) using the options count 21-mers (-m 21), a hash of 2 billion elements (-s 2G), and 20 threads (-t 20). The Jellyfish output file .jf is then converted to a histogram of the k-mer occurrences in a .histo file. The resulting histogram was processed online with GenomeScope http://qb.cshl.edu/genomescope/ to obtain an estimate of the haploid genome length, the % of repeats, and the % of heterozygosity.
9_samtools_flagstat:
This folder contains one .sh script, flagstat_amnicola_sortedbam.sh to estimate the percentage of Illumina reads aligned to the draft reference genome. This was done separately for each population using the samtools flagstat function from samtools v1.18, with the _sorted.bam files as input obtained by running the bioinformatic pipeline in the 1_pipeline_SNPcalling folder.
code_rscripts
1_Poolfstat:
This folder contains three R scripts to generate input files for other downstream analyses and to plot/assess patterns of isolation by distance IBD and isolation by environment IBE:
- poolfstat_amnicola.R: This script generates input files for other downstream analyses by converting the All.sync file output from the bioinformatic pipeline into a pooldata object, using pool sizes of 80 for all populations, a mean read count min.rc = 2, a minimum coverage per pool of min.cov.per.pool = 5, a maximum coverage per pool max.cov.per.pool = 300, a minimum minor allele frequency of min.maf = 0.0125, and removing indels. The scripts then generate input files for Baypass with the pooldata2genobaypass() function, creating 27 sub-datasets of 778,846 SNPs with the thinning method. An allele frequency file freq_amnicola.tsv is generated by dividing the reference allele (determined arbitrarily) read count by the coverage for each SNP, which is needed as input to be converted in a dadi-compatible input format. The script also generates the snpinfo_amnicola_poolfstat.tsv and coverage_amnicola_poolfstat.tsv files needed as input for other data processing scripts. Finally, the script outputs the pairwise fst matrix FSTpoolfstat_matrix.tsv file and the population-specific heterozygosities pop_heterozygosity.tsv files.
- poolfstat_amnicola_nooutliers.R: This script similarly generates input files for other downstream analyses, but removes the outliers identified with Baypass and poolFreqDiff. It takes as input the Chr_pos file name of the scaffold + position on the scaffold as a single column separated by a dot, next to which is added the numbers 1 to 21,312,700 as an index. Then the input files outliers_baypasscore.tsv, outliers_baypassSTD.tsv, outliers_baypasscontrast.tsv, outliers_poolfreqdiff_calcium.tsv, and the outliers_poolfreqdiff_goby.tsv are imported, and the SNPs with matching scaffold + position on scaffold identified as outliers are removed. Finally, the output files are generated with the pooldata.subset() function.
- results_IBD_IBE_plots_neutralSNPs.R: This script is used to plot pairwise FST, assess IBD and IBE with Mantel tests, and plot the regression of the genetic distance (linearized FST with the formula fst/(1-fst) as in Rousset 1997) as a function of geographical or environmental distances. It takes as input and generates several processed files located in the processed_data > Poolfstat_input folder. The figure IBD_IBE_amnicola.pdf is output as the plot of the IBD and IBE.
- FSTpairwise_triangularmatrix: lower pairwise FST matrix
- FSTpairwise_triangularmatrix_ordered.csv: reordered lower pairwise FST matrix by habitat type
- FSTpairwise_fullmatrix.csv: output by the poolfstat_amnicola_nooutliers.R script, modified with the melt() function to create the IBD_fst_amnicola.tsv
- IBD_fst_amnicola.tsv: file generated above, to which the distances calculated along rivers with Google Earth (in m) are added.
- logdistance_amnicola.csv: full matrix of the distances calculated along rivers with Google Earth (in m) converted to a log (ln) scale.
- linearizedfst_fullmatrix_amnicola.csv: full matrix output by the poolfstat_amnicola_nooutliers.R script, then converted to linearized FST with the formula fst/(1-fst).
- env_inputdata_IBE_amnicola.csv: three-column file with the population name, the round goby presence/absence status, and the calcium concentration, which is then converted to a Mahalanobis distance full matrix and saved as env_Mahalanobisdist_amnicola.tsv
- env_Mahalanobisdist_amnicola.tsv: used to do the Mantel test for IBE along with the linearizedfst_fullmatrix_amnicola.csv input file and to do a Mantel test using the distance_amnicola.csv file of distances calculated along rivers with google earth (in m) to check for a correlation between environmental and geographical distance.
- IBE_fst_amnicola.csv: generated by adding the Mahalanobis distance to the file IBD_fst_amnicola.tsv mentioned above.
2_Baypass:
This folder contains three R scripts to process and plot the results output by Baypass. For a description of the variables in the dataset output by Baypass, see the Baypass manual
- amnicola_results_baypassouput_core.R: This R script is used to process the output of the core model. It takes amnicore_sub*_*mat_omega.out files as input to produce the heatmap and hierarchical clustering tree plot. The file amnicore_all_summary_pi_xtx.out outputted by Baypass is used as input for the rest of the processing, which was obtained by concatenating all the summary_pi_xtx.out files from each sub-sample and removing all the headers of the sub-sample output files except the first one. It also takes the amnicola_baypass_input.snpdet.core.all file as input, which gives information on the location of the SNP on the genome (scaffold and position on scaffold), the identity of the reference allele (defined arbitrarily), and of the minor allele, and was similarly generated by concatenating all the amnicola_baypass_input.snpdet.sub files output by Baypass for each sub-sample (no header). It outputs the list of outliers for the core model defined with the recalculated p-values of the estimated XtX as p-value <= 0.001
- amnicola_results_baypassouput_STD_contrast.R: This R script is used to process the output of the Baypass STD model for each independent run and has one part for the results of the association with calcium concentration and a second part for the contrast analysis (round goby presence/absence status).
- Part 1: takes as input the amniSTDcontrast_rep*_summary_BFis.out output file, which was generated by combining all the *summary_betai_reg.out files from each sub-dataset into one result file, then only the columns keeping the columns COVARIABLE, MRK, and BF.db. It also takes as input the amnicola_baypass_input.snpdet.core.all file. The script calculates the median of the BFis for the three runs and outputs a list of outlier SNPs for the association with calcium when the median BFis is > 20
- Part 2: takes as input the amniSTDcontrast_all_summary_betai_reg.out and amniSTDcontrast_all_summary_contrast.out output files, which were generated by combining all the summary_betai_reg.out or summary_contrast.out files from each sub-dataset into one result file, and removing all the headers of the sub-sample output files except the first one. It also takes as input the amnicola_baypass_input.snpdet.core.all file. The script applies the FDR procedure with the package q-value on the back-transformed p-values (initially on the -log10 scale), with an alpha level of 0.01. It outputs a list of outlier SNPs for the association with goby predation when the q-value is below alpha.
- check_convergence_omega_STD.R: This R script is used to assess the convergence of MCMC for the STD model in Baypass by comparing the omega matrices of the three independent runs for each subsample and calculating the Förstner and Moonen distance FMD distance. It takes as input the amniSTDcontrast_sub*_mat_omega.out (rep run #1), amnicovis_sub*_rep1_mat_omega.out (rep run #2), and amnicovis_sub*_rep2_mat_omega.out (rep run #3) files as input and calculates the FMD distance for all the omega matrices, which are then collated in a table. The median FMD distance was calculated from that table.
3_Poolfreqdiff:
This folder contains one Amnicola_results_poolfreqdiff_goby_calcium.R script to process and plot the results of the poolFredDiff analyses for the association of SNPs allele frequency with the calcium concentration and the round goby presence/absence status. It takes as input the amni_poolfrediff_goby.out and amni_poolfrediff_calcium.out files. The script recalibrates p-values with a genomic inflation factor gif (Francois 2016) value of 0.85, then applies a FDR procedure on the recalibrated p-values with the q-value package. The results are saved as data tables in the results_poolfreqdiff_amnicola_goby_qval.tsv and results_poolfreqdiff_amnicola_calcium_qval.tsv files. Two lists of outliers are generated by filtering SNPs with q-value below the alpha level 0.01 and saved as outliers_poolfreqdiff.tsv and outliers_poolfreqdiff_calcium.tsv files.
4_candidates_SNPs_outliers:
This folder contains one R script, results_oultiers_candidate_snps_Baypass_poolfreqdiff.R, used to check the list of outliers output from the R scripts used to process the results of the genome scans analyses (Baypass core and STD models, poolFredDiff analyses), and to produce Venn diagram plots. It takes as input the outliers_baypasscore.tsv, outliers_baypassSTDcalcium_repruns.tsv, outliers_baypasscontrast.tsv, outliers_poolfreqdiff_calcium.tsv and outliers_poolfreqdiff_goby.tsv files.
5_dadi:
This folder contains three R scripts to generate the input data files for dadi named script_prep_input_dadi__.R. These R scripts take as input three files generated with poolfstat from the pooldata object, using the poolfstat_amnicola.R script in the **code_rscripts > 6_Poolfstat subfolder. The allele frequency file freq_amnicola.tsv reduced to two columns (only keeping the allele frequency for the population pair of interest), the snpinfo_amnicola_poolfstat.tsv and the coverage_amnicola_poolfstat.tsv files are needed as inputs. Starting from the initial SNPs dataset output by poolfstat (21,312,700 SNPs), the script includes several filtering steps: 1) SNPs were retained if they fell within the 1st and 3rd quartiles of coverage in both populations, 2) filter out SNPs that were detected as outliers (putatively under selection) in the Baypass (core and aux or STD models) and poolFreqDiff analyses and 3) removed uninformative SNPs (fixed or lost in both populations). To accommodate for large computation time during the optimization, the datasets are also thinned at random to retain a final dataset of ≈ 1 million SNPs per population pair. The dadi_input_pools function from the genomalicious R package with the “probs” parameter in the methodSFS option is then used to transform the allele frequency dataset into the SNP data format from dadi.
6_Figures:
This folder contains two R scripts to generate plots.
- bitplot_baypass_poolfreqdiff.R: This script takes as input the output files from the amnicola_results_baypassouput_STD_contrast.R and the Amnicola_results_poolfreqdiff_goby_calcium.R scripts, which generated the results_amniSTDcalcium_repruns.tsv, results_amniSTD_contrast_goby.tsv, results_poolfreqdiff_amnicola_calcium_qval.tsv and results_poolfreqdiff_amnicola_goby_qval.tsv data tables. This scripts generate the biplots results with a) the q-value from poolFreqDiff for the calcium vs the BFis from STD analysis for calcium and b) the q-value from poolFreqDiff for the round goby predation status vs q-values from the Baypass contrast analysis for the goby round goby predation status. It outputs the biplot_baypass_poolfreqdiff.png plot.
- snps_outliers_allele_frequency.R: This script is used to transform the column M_P (mean of the posterior distribution of the αij parameter which is closely related to the frequency of the reference allele) from the concatenated Baypass core model output amnicore_sub_summary_yij_pij.out files (removing all the headers of the sub-sample output files except the first one) into a data frame giving the allele frequency for all populations and SNPs (sub-datasets combined). Then this allele frequency table is combined with the SNP information from the amnicola_baypass_input.snpdet.core.all file. It takes as input the outlier files from the STD model in Baypass and the poolFreqDiff analyses: outliers_baypasscore.tsv, outliers_baypassSTDcalcium_repruns.tsv, outliers_baypasscontrast.tsv, outliers_poolfreqdiff_calcium.tsv and outliers_poolfreqdiff_goby.tsv. The scripts plot the allele frequency of outlier SNPs in common between poolFreqDiff and Baypass for the calcium or the round goby predation status, and uniquely associated with each covariable, as well as the allele frequency of the non-outliers (not outlier for both the poolFreqDiff and Baypass analyses and thinned). Plots are generated for each outlier SNPs depending if the allele frequency is increasing or decreasing with the covariable (water calcium concentration gradient and goby abundance).
7_Popoolation:
This folder contains one R script, results_diversity_popoolation.R, used to process the genomic diversity indices (Tajima's pi, Watterson's Theta, Tajima's D) output by Popoolation1 and of the heterozygosity output by poolfstat. It takes as input the .pi, *.theta, and the *.D files for each population the and pop_heterozygosity.csv file in the **processed_data > Diversity_indices_input* folder. The *.pi, *.theta, and the *.D data frames are then filtered to remove windows with zero coverage (V3 column) and a minimum fraction covered < 0.05. The data frames from all populations are then combined for each diversity indices, and characterized by the habitat type: goby present or goby absent. Hedge's G for different sample sizes is computed for each diversity index and habitat type to obtain the effect size of the goby predation status. Violin plots for each diversity index and colored by habitat type are generated. For the population's heterozygosity, effect sizes of the goby predation status are computed as Cohen's D and plotted by habitat type. All the plots are then combined and output as the popoolation_heterozygosity_diversity.pdf file.
8_Transplant_experiment:
This folder contains one R script, script_amnicola_exp.R, used to process the results of the reciprocal transplant experiment. It takes as input the file Amnicola_experiment.csv located in the subfolder raw_data > transplant_experiment. This script is used to remove data from combo water treatments as mortality was too high (mentioned in the paper and presented in the supplementary material, results for combo water plotted at the end of the script), calculate survival and fecundity, and prepare the data table for plotting. The script then presents the statistical analysis for the survival (GLMM) and the fecundity (GLM). Raw data and results of the models are plotted with ggplot and combined to output Figure 2.
processed_data
1_Baypass_input:
This folder contains three files that are used as input for the Baypass analysis with the core and STD model. amnicola_baypass_input.poolsize gives the pool size for each population, contrast_goby gives the round goby presence (1) or absence (-1) status population in order, and cov_data gives the round goby presence/absence (1st line) then the water calcium concentration (2nd line).
2_dadi_input
This folder contains three files that are used as input for the demographic analyses with dadi and are in the SNP data format https://dadi.readthedocs.io/en/latest/user-guide/importing-data/. The files dadi_input_filtercov__ are for each population pair analyzed, and give the allele count for each allele. These input files were generated with the script_prep_input_dadi__.R scripts located in the **code_rscripts > 5_dadi subfolder
3_Diversity_indices_input:
This folder contains the files .pi, *.theta, and the *.D for each population and the pop_heterozygosity.csv file, which were output by the *_diversity.sh, *_diversity_theta.sh and *_diversity_tajima.sh scripts in the **code_hpc > 2_popoolation **subfolder and the *poolfstat_amnicola.R script in the code_rscripts > 1_Poolfstat subfolder. They are used as input for the R script results_diversity_popoolation.R located in the code_rscripts > 7_Popoolation subfolder. The files *.pi, *.theta, and the *.D have 5 columns:
- Column 1: Scaffold
- Column 2: Position on the scaffold (sliding window of 50Kb)
- Column 3: Number of SNPs in the sliding window
- Column 4: fraction of the window covered by a sufficient number of reads (higher than min coverage and lower than max coverage)
- Column 5: diversity index
4_Poolfstat_input:
This folder contains 14 files which are used as input and are outputs of the results_IBD_IBE_plots_neutralSNPs.R script in the code_rscripts > 1_Poolfstat subfolder:
- distance_amnicola.csv: full matrix file of distances calculated along rivers with google earth (in m)
- FSTpoolfstat_matrix_neutralSNPs.ods: spreadsheet document with the pairwise FST matrix for neutrals SNPs (generated with the poolfstat_amnicola_nooutliers.R script in the code_rscripts > 1_Poolfstat subfolder) from which multiple input files found in this subfolder were generated.
- FSTpairwise_fullmatrix.csv: output by the poolfstat_amnicola_nooutliers.R script, modified with the melt() function to create the IBD_fst_amnicola.tsv.
- FSTpairwise_triangularmatrix: lower pairwise FST matrix
- FSTpairwise_triangularmatrix_ordered.csv: reordered lower pairwise FST matrix by habitat type
- linearizedfst_fullmatrix_amnicola.csv: full matrix output by the poolfstat_amnicola_nooutliers.R script, then converted to linearized FST with the formula fst/(1-fst).
- linearizedfst_fullmatrix_amnicola_nooutliers: file used for the Mantel test where the effect of removing the pairs PB-IPE and GOY-BEA considered as outliers is tested
- logdistance_amnicola.csv: full matrix of the distances calculated along rivers with google earth (in m) converted to a log (ln) scale.
- logdistance_amnicola_nooutliers: file used for the Mantel test where the effect of removing the pairs PB-IPE and GOY-BEA considered as outliers is tested
- IBD_fst_amnicola.tsv: file generated above, to which the distances calculated along rivers with Google Earth (in m) are added.
- env_inputdata_IBE_amnicola.csv: three-column file with the population name, the round goby presence/absence status, and the calcium concentration, which is then converted to a Mahalanobis distance full matrix and saved as env_Mahalanobisdist_amnicola.tsv
- env_Mahalanobisdist_amnicola.tsv: used to do the Mantel test for IBE along with the linearizedfst_fullmatrix_amnicola.csv input file and to do a Mantel test using the distance_amnicola.csv file of distances calculated along rivers with google earth (in m) to check for a correlation between environmental and geographical distance.
- IBE_fst_amnicola.csv: generated by adding the Mahalanobis distance to the file IBD_fst_amnicola.tsv mentioned above.
5_Candidates
This folder contains files of the outliers of the Baypass and poolFreqDiff generated with the amnicola_results_baypassouput_core.R, amnicola_results_baypassouput_STD_contrast.R scripts in the code_rscripts > 2_Baypass subfolder and the Amnicola_results_poolfreqdiff_goby_calcium.R script in the code_rscripts > 3_poolFreqDiff subfolder: outliers_baypasscore.tsv, outliers_baypassSTDcalcium_repruns.tsv, outliers_baypasscontrast.tsv, outliers_poolfreqdiff_calcium.tsv and outliers_poolfreqdiff_goby.tsv. These files are input for the results_oultiers_candidate_snps_Baypass_poolfreqdiff.R script in the code_rscripts > 4_candidates_SNPs_outliers subfolder.
raw_data
1_transplant_experiment:
This folder contains the raw data file Amnicola_experiment.csv for the reciprocal transplant experiment on Amnicola limosus snails sampled in the wild, which gives for each population and treatment tested the survival and number of eggs found at different time points throughout the experiment. In that data table, the following variables are recorded:
- Pop: name of the populations of origin of the snails
- origin_water: rivers where the populations were sampled. SL refers to the St. Lawrence River (high calcium and round goby present, except for PDC which is high calcium but round goby absent) and OR to the Ottawa River (low calcium and round goby absent, except for RAF which is low calcium but round goby present)
- Pot: refers to the population of origin and the water treatment. BW is for "brown water", which was sampled in the Ottawa River, GW is for "green water", which was sampled in the St. Lawrence River, CW is for "combo water" reffering to the artificial growth medium Combo without the addition of calcium, HCW is for "high combo water" reffering to the artificial medium Combo with the addition of calcium, GP is for "goby present", which correspond to the addition of goby cues to the water treatment. When GP does not appears, it means goby cues were not added.
- Traitement: treatment water, see explanation above.
- goby_cue: goby cue treatment, for GP, see above, NOGP means goby cues were not added.
- initial_nb_ind: initial number of individuals at the starting date of the experiment.
- Prise_1 to Prise_4: X note if data were recorded in a given pot at a given date.
- date1 to date4: date at which the data was recorded.
- NB_IND1 to NB_IND4: number of individuals found alive in the treatment pot at the recording date.
- NB_EGGS1 to NB_EGGS4: number of eggs found in the treatment pot at the recording date.
results
1_Baypass:
This folder contains two subfolders for the results of the Baypass core and STD models.
- Core: This folder contains all the omega matrix files amnicore_sub*_mat_omega output by the core model, and three plots generated with the amnicola_results_baypassouput_core.R script in the code_rscripts > 2_Baypass subfolder.
- STD_BFis_contrast: This folder contains one plot generated with the amnicola_results_baypassouput_STD_contrast.R script in the code_rscripts > 2_Baypass subfolder.
2_Candidates:
This folder contains the Venn diagram plots generated with the results_oultiers_candidate_snps_Baypass_poolfreqdiff.R script in the code_rscripts > 4_candidates_SNPs_outliers subfolder, and the allele frequency plot generated with the snps_outliers_allele_frequency.R script in the code_rscripts > 6_Figures subfolder.
3_dadi:
This folder contains the results files output by running the demographic analyses with dadi. Within the subfolder for each population pair analyzed, there is an information file info_data_input on the input dataset for dadi for each population pair analyzed, as well as the uncertainties_twopopsbottlegrowth__ file generated with the uncertainty_GIM_.py script located in the **code_hpc > 4_dadi* subfolder and subfolders within. This folder is divided into three sub-folders named fixed_bottleneck_time__**__* for each population pair, and each of these fixed_bottleneck_time_*_* folders contains eight sub-folders for each of the seven demographic models analyzed (see above).
- twopops_bottlegrowth
- *_bottlegrowth
- twopops_splitmigbottleneck
- splitmigbottleneck*
- splitmig_unevenMig
- splitmig_evenMig
- splitnomig
Each of these sub-folders contains the output files generated by running the optimization of the parameters for each model and population pair, as well as the accompanying plots of each optimization run. A .ods file summarizes the results of all the optimization runs in each model and population pair subfolders.
4_Figures_combined_Inkscape:
This folder contains seven .svg files for various figures from the main manuscript and the supplementary material which were combined or modified with Inkscape.
5_genome_processing:
This folder contains the output files obtained by running the scripts in the code_hpc > 6_BUSCO, code_hpc > 7_fastp, code_hpc > 8_Jellyfish, and the code_hpc > 9_samtools_flagstat subfolders.
- 1_BUSCO: contains the output file from the BUSCO analysis
- 2_fastp: contains the output files from running fastp on the three 10X paired-end reads used for the assembly of the draft genome, to obtain the insert size, using the -Q option to disable quality filtering.
- 3_fastqc_illumina_reads: contains two subfolders, raw and trim, which each contain the fastqc.html file output by fastqc for the raw and trimmed Illumina read for each population and R1/R2 reads.
- 4_flagstat: contains the output files obtained by running the flagstat_amnicola_sortedbam.sh script. There are 12 .csv output files for each population, and one .ods file with the combined information for all the populations
- 5_jellyfish: contains the subfolder 10X_allruns_R1, in which is located the amnicola_10X_allruns_R1.histo output file of Jellyfish, and the plot and results output from running the GenomeScope program online.
6_poolFreqDiff:
This folder contains two subfolders, calcium, and goby_predation, which each have two .pdf plot files output by running the Amnicola_results_poolfreqdiff_goby_calcium.R script in the code_rscripts > 3_Poolfreqdiff subfolder.
7_poolfstat:
This folder contains two plot .pdf files generated by running the results_IBD_IBE_plots_neutralSNPs.R script in the code_rscripts > 1_Poolfstat subfolder.
8_popoolation:
This folder contains one .pdf plot file obtained by running the R script, results_diversity_popoolation.R script in the code_rscripts > 7_Popoolation subfolder.
9_Transplant_experiment:
This folder contains the plots generated by running the script_amnicola_exp.R in the code_rscripts > 8_Transplant_experiment subfolder
Code/software
Code/software
code_hpc:
1_pipeline_SNPcalling:
Popoolation1 v1.2.2
perl v5.30.2
bwa v0.7.17
samtools v1.13
- trimming_bwa_sambam_part1: 1) trimming of raw fastq files with the function trim-fastq.pl from Popoolation1 v1.2.2 running with perl v5.30.2, 2) align trimmed reads to draft genome with bwa mem function from bwa v0.7.17, 3) filter aligned reads based on a quality of 20 with samtools view function from samtools v1.13, 4) sort bam files with samtools sort from samtools v1.13. This is done separately for each population with dedicated .sh files named pipeline_pt1_populationname.sh
- mpileup_sync_part2: pipeline_pt2_mpileup_sync.sh file to process the sorted bam files from all populations together. 1) Generate an mpileup file with samtools mpileup from samtools v1.13 using the bam_list_all.txt list of sorted bam files and the draft genome as input, 2) filter SNPs with reads having a coverage of 5 or more across all populations using grep -vw, 3) Create .sync file which is the main input files for downstream analyses and is generated with the function mpileup2sync.pl in Popoolation2 v1.10.03
2_popoolation:
Popoolation1 v1.2.2
- *_diversity.sh: ran first, 1) generate an individual pileup file from the sorted bam file using samtools mpileup from samtools v1.13, 2) calculate Tajima's pi with the sliding window method.
- *_diversity_theta.sh: calculate Watterson's Theta with the sliding window method.
- *_diversity_tajima.sh: calculate Tajima's D with the sliding window method.
3_Baypass:
Baypass v2.2
- Core_model: code files named amni_baypass_core_sub12.sh to amni_baypass_core_sub27.sh to run the Baypass core model from pairs of input genotyping data files.
- STD_model: code files named amni_baypass_covis_sub*.sh to run the Baypass STD model from individual input genotyping data files. This analysis requires two additional input files located in processed_data > Baypass_input: 1) contrast_goby gives the round goby presence (1) or absence (-1) status population in order, which is needed to run the contrast analysis to compute the C2 statistic (for the round goby binary covariable), 2) cov_data gives the round goby presence/absence (1st line) then the water calcium concentration (2nd line). The subfolder Replicate_runs has code files named amni_baypass_covis_sub*_rep*.sh to run two additional replicate runs of the Baypass STD model (N= 3 runs in total).
4_dadi:
dadi v2.1.0
python v3.8
Seven demographic models were analyzed with dedicated scripts for each population pair:
- amnicola_timefixed_twopops_splitmigbottlegrowth.py: model of bottleneck and growth in both populations (most complex model) with defined effective population sizes after the split (nu1 and nu2), followed by a bottleneck in both populations followed by exponential recovery in both populations. TS is the scaled time between the split and the bottleneck and TB is the scaled time between the bottleneck and present,
- amnicola_demo_dadi_splitmigbottlegrowth*.py: model ofbottleneck and growth only in the invaded population (constant Ne for the refuge population) with uneven migration
- amnicola_timefixed_twopops_splitmigbottleneck.py: model ofonly bottlenecks in both populations without recovery and uneven migration
- amnicola_timefixed_splitmigbottleneck*.py: model ofonly bottleneck in the invaded population with uneven migration (constant Ne for the refuge population)
- amnicola_dadi_splitmig_unevenMig.py: model of a simple population split at TS with uneven migration
- amnicola_dadi_splitmig_evenMig.py: model ofa population split with symmetric migration
- amnicola_dadi_splitnomig.py: model of population split without migration
- bootstrapping__.py: code to generate 100 bootstrapped SFS with a chunk size of 100K SNPs and masking the first 5 alleles in each bootstrapped SFS.
- uncertainty_GIM__.py: Calculate the uncertainties on the parameters for the best model (to run after finishing the optimizations for all models tested) using the Godambe Information Matrix. This script takes as input the observed SFS generated at the beginning of each of the optimization scripts and the bootstrapped SFS generated with the bootstrapping__**.py script. It redefines the best model in the same manner as in the optimization scripts and defines the optimized parameters as input. The scripts output the parameters standard deviations from the GIM, the estimated 95% confidence interval from the GIM, and the upper and lower bounds of the 95% confidence interval of the parameters.
5_poolFreqDiff
poolFreqDiff (no version)
poolfreqdiff_amnicola_calcium.sh and poolfreqdiff_amnicola_goby.sh scripts to run the poolFreqDiff analysis, one for each environmental covariable analysis (independent analyses for the calcium concentration and round goby presence/absence status). This analysis takes modified .sync files as input, which were reorganized from the initial .sync file output at the end of the bioinformatic pipeline described above, by alternating one population from a low-calcium habitat (or round goby absent) and one population from a high-calcium habitat (or round goby present). The parameters for the scripts are -npops = 6 for six populations in each environmental level (low-calcium vs high-calcium or round goby absent vs round goby present) defined by -nlevels = two levels of environmental characteristics possible, -n = 40 individuals, -mincnt = 2 for a minimum read count of two per alleles, -minc = 5 reads minimum per SNP per population, -maxc = 300 reads maximum per SNP per population, -rescale = neff to rescale the allele counts to an effective sample size neff as defined in (Feder et al., 2012) and -zeroes = 1 to add one to zero count cells.
6_BUSCO:
Busco v5.2.2
script_amnicola_busco to run the BUSCO analysis on the A. limosus draft genome. It uses the --offline and --mode = genome options, and takes as input the mollusca_odb10 lineage dataset which was pre-downloaded from https://busco-data.ezlab.org/v5/data/lineages/ in 2020 with the 2020-08-05 time stamp.
7_fastp:
Fastp v0.23.4
fastp_Amnicola_Limosa_run333_Lane2.sh, fastp_Amnicola_Limosa_run399_Lane2.sh, fastp_Amnicola_Limosa_run399_Lane3.sh to run fastp on the three 10X paired-end reads used for the assembly of the draft genome, to obtain the insert size, using the -Q option to disable quality filtering.
8_Jellyfish:
Jellyfish v2.3.0
jellyfish_amnicola_illumina_3pops_R1_untrimmed.sh to estimate the genome size with Jellyfish v2.3.0. It uses as input the R1 reads of the three runs of 10X sequencing simultaneously (-F 3) using the options count 21-mers (-m 21), a hash of 2 billion elements (-s 2G), and 20 threads (-t 20). The Jellyfish output file .jf is then converted to a histogram of the k-mer occurrences in a .histo file. The resulting histogram was processed online with GenomeScope http://qb.cshl.edu/genomescope/ to obtain an estimate of the haploid genome length, the % of repeats, and the % of heterozygosity.
9_samtools_flagstat:
Samtools v1.18
flagstat_amnicola_sortedbam.sh to estimate the percentage of Illumina reads aligned to the draft reference genome. This was done separately for each population using the samtools flagstat function from samtools v1.18, with the _sorted.bam files as input obtained by running the bioinformatic pipeline in the **1_pipeline_SNPcalling* folder.
code_rscripts
R v4.1.2 and v4.4.0
R packages:
Poolfstat v2.1.1
Data.table v1.14.2
Stringr v1.4.0
Tidyr v1.2.0
Dplyr v1.0.9
Ggplot2 v3.3.6
Cowplot v1.1.1
Reshape2 1.4.4
Ggpubr v0.4.0
Effsize v0.8.1
Ecodist v2.0.9
Vegan v2.6-2
ade4 v1.7-19
ape v5.5
ggtree v3.0.2
stats 4.4.0
RColorBrewer v1.1-3
R.utils v2.12.3
Corrplot v0.92
Gplots v3.1.3.1
Qvalue v2.26.0
VennDiagram v1.7.3
Genomalicious v0.4
Ggrepel v0.9.5
Hmisc v5.1-3
lme4 v1.1-35.4
lattice v0.22-6
MASS v7.3-6.1
Bbmle v1.0.25.1
MuMIn v1.48.4
gridExtra v2.3
faraway v1.0.8\
sjPlot v2.8.16
sjlabelled v1.2.0
sjmisc v2.8.10
Access information
Other publicly accessible locations of the data:
- The draft genome of Amnicola limosus has been deposited on the NCBI Genome database (BioProject PRJNA1136450): https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1136450
- The raw sequencing reads have been deposited in the National Center for Biotechnology Information Sequence Read Archive SRA repository (BioProject PRJNA1035459) and the accompanying metadata are also stored in the SRA (BioProject PRJNA1035459), using the Eukaryotic water MIxS package https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1035459
Methods
1. Study system
Gastropods have been widely used to study adaptation in response to predation (Brookes & Rochette, 2007; Hooks & Padilla, 2021), with abiotic factors such as calcium concentration modulating this response through changes in shell morphology and behavior (Rundle et al., 2004; Bukowski & Auld, 2014). As such, they are a useful biological study model for addressing evolutionary responses to biological invasions. Amnicola limosus is a small dominant freshwater gastropod species with a wide geographical distribution in the USA and Canada (www.gbif.org/species/5192461). This gastropod does not have a pelagic larval phase: egg masses are deposited on the substrate, and juveniles move from the substrate to the macro-algal substrate (Pinel-Alloul & Magnin, 1973). Part of the range of A. limosus has been invaded by Neogobius melanostomus (common name: round goby), a molluscivorous fish, from the lower Great Lakes and running downstream throughout the Upper St. Lawrence River (Hickey & Fowlie, 2005). Amnicola limosus is commonly found in the stomach contents of round gobies, and following the round goby invasion of Lake Saint-Louis, A. limosus populations experienced a 0.5-1 mm reduction in shell size (Kipp et al., 2012). Because the mean gape size of the round goby is larger than the maximum size of A. limosus, round gobies do not have to crush the gastropod, which suggests that shell size reduction is likely to be due to reduced predation pressure on smaller and less visible individuals (round gobies are visual predators; Kipp et al., 2012). A considerable reduction in small gastropod abundance (down to 2-5% of the original population size, with A. limosus being the most abundant species) and species richness in the Upper St. Lawrence River were also reported since the invasion of round gobies in this ecosystem in 2005 (Kipp et al., 2012). However, round gobies cannot tolerate low calcium concentrations (Baldwin et al., 2012; Iacarella & Ricciardi, 2015), and have not invaded the Ottawa River at its junction with the Upper St. Lawrence River (Calcium concentrations below 22 mg/L; Sanderson et al., 2021; Morissette et al., 2024). On the contrary, this low calcium concentration is not a physiological limit for A. limosus embryonic development (> 1.1 mg/L; Shaw & Mackie, 1990), and Pinel-Alloul & Magnin (1973) showed that A. limosus was present in the Ottawa river before the invasion of round gobies, indicating that this species can tolerate the calcium concentration found in the Ottawa river. These calcium-poor waters are thus acting as a refuge from round goby predation in this system (Astorg et al., 2021; Morissette et al., 2024). Calcium-poor waters could potentially provide demographic subsidies for the native populations at invaded sites (e.g., amphipods; Derry et al., 2013).
2. Study sites, sample, and environmental data collection
Twelve study sites were located at the junction of the Ottawa River and the St. Lawrence River near Montreal, QC, Canada (Fig. 1A). Ottawa River water is calcium-poor (10-15 mg/L calcium), and St. Lawrence River water is comparatively calcium-rich (30-40 mg/L) due to the different geological characteristics of their watersheds. These water masses join at the junction of two major river systems at Lake Saint Louis, a widening of the St. Lawrence River, but the calcium gradient persists between the north and south shores, and water masses are distinct (Vis et al., 1998). In 2005, round gobies invaded the upper St. Lawrence River and the southern shore of Lake Saint-Louis, but not the calcium-poor Ottawa River nor calcium-poor sites on the north shore of Lake Saint-Louis (Kipp & Ricciardi, 2012).
We sampled twelve Amnicola limosus populations from the study sites in this fluvial ecosystem (Fig. 1A), with three populations fully in the Ottawa River, three fully in the St. Lawrence River, and six populations in the Lake St-Louis, including three on the north shore and three on the south shore. The gastropod species identification was verified in the lab. We confirmed that our sampling locations corresponded to distinct population units with our population structure analyses (Fig. 1B, 1C and S1). Two populations in close proximity (2.3 km; BEA and GOY) showed low levels of differentiation (FST = 0.005), but the scaled covariance matrix Ω indicated that these two populations were indeed distinct (correlation coefficient between pairs of populations ρij = 0.8, Fig. S1). We coded populations collected in the Ottawa River water as LCGA (low calcium- water and round gobies absent) and populations from the St. Lawrence River water as HCGP (high calcium water and round gobies present). Two populations had inverted patterns: RAF is calcium-poor, but round gobies are present (LCGP), and PDC is calcium-rich, but round gobies are absent (HCGA). It is noteworthy that PDC is located in a refuge habitat (wetland; Astorg et al., 2021) but is close to invaded sites (~8 km from the nearest invaded site). Field-collected A. limosus gastropods were obtained near the shore via hand picking and brought back to the lab for further processing (DNA extractions and the common garden experiment) in June-October 2017. Round goby abundance was measured in the field between July and September 2017 on a single occasion at each site. For this, each site was sampled using three seine net passes. The seine net used for sampling nearshore habitats was 9.14 m long by 1.8 m deep and 1/8 mesh on a 10 m distance. Round gobies were placed into bins and released after the three hauls. The geographic location and environmental characteristics of our sampling sites are detailed in Table S1. We measured dissolved oxygen (DO; mgL-1), pH, water temperature (oC), and conductivity (µS.cm-2) using a Professional Plus Model YSI multi-parameter probe (model 10102030; Yellow Springs Inc.) at each study site in 2017 at the time of gastropod collection. On the same occasions, we collected water samples and analyzed them for calcium (Ca), total phosphorus (TP), total nitrogen (TN), as well as dissolved organic carbon (DOC) at the GRIL-UQAM analytical lab (Supplementary Methods). Site-specific invasion status by round goby (invaded /refuge) is defined by presence/absence (Table S1).
3. Reciprocal transplant experiment
We conducted a laboratory reciprocal transplant experiment at the Université du Québec à Montréal (UQAM) with field-collected F0-generation A. limosus to investigate the response of gastropods from populations experiencing different source habitat types (low calcium/refuge Ottawa River LCGA or high calcium/invaded St. Lawrence River HCGP) to home and transplant water (calcium-poor water from the Ottawa River LCGA or calcium-rich water from the St. Lawrence River HCGP), in the presence or absence of round goby cues (Fig. 2A). The round goby cue treatment was used to test for predator effects on life history fitness components (adult survival and fecundity). A. limosus snails that were involved in the experiment were mostly at adult or sub-adult stages as we selected the largest individuals collected in the field and the dates of collection correspond to the presence of adult cohorts in the field (Pinel-Alloul & Magnin, 1973). Two additional water treatments were also tested: the artificial freshwater medium COMBO (Kilham et al., 1998), with and without the addition of calcium, to test for the specific effect of calcium (Ca) concentration on fitness components. The overall design was therefore a two (origin water: St. Lawrence River LCGA versus Ottawa River HCGP) four (treatment water from St. Lawrence HCGP versus Ottawa River LCGA, COMBO growth media with/without Ca) two (presence versus absence of round goby cue) factorial experiment, with 12 replicates (corresponding to our sampling populations) per treatment combination. As the experiment was conducted within a single generation, we acknowledge that our reciprocal transplant experiment did not allow us to differentiate plastic vs genetic vs maternal effects on the measured traits.
We raised wild F0 individuals from the 12 populations in the laboratory for up to 73 days. Between 15 and 22 individuals (average: 19.61.3) were initially placed in 250 ml plastic cups with river water and reared in growth chambers (Thermo Scientific Precision Model 818) at 18°C with a light:dark photoperiod cycle of 12:12 hours. We fed A. limosus snails ad libitum with defrosted spinach every 2-3 days if needed or at each water change. Water in the water treatments was changed, and old spinach was removed every 3-4 days. For the round goby cue treatment, round gobies were kept in a 50-liter aquarium for two weeks before the experiment, set in a growth chamber at 18°C with a 12:12h light. Round gobies were fed 3-4 times a week with flake fish food (TetraFin). The round goby cue treatment was added as 5 mL of water from the round goby aquarium per A. limosus culture at each water change (every 2-3 days), which represents 2% of the volume of the culture. The addition of water was done manually with a 30 mL syringe. We recorded the survival of the adults and their fecundity (total number of eggs produced per individual) as response variables every 19 13 days throughout the experiment, using high-resolution stereomicroscopes (Olympus). However, due to the very low adult survival for all populations for the treatment testing the effects of calcium in growth media, we removed this comparison from further analyses (see Fig. S2).
We analyzed fecundity (total number of eggs produced) and adult survival rates with a generalized linear model (GLM) and a generalized linear mixed effect model (GLMM) using the lme4 package in R respectively. We modeled fecundity with a negative binomial distribution and adult survival with a binomial distribution and a logit link function. We checked the models for overdispersion using the overdisp_fun function from https://bbolker.github.io. We tested both models with and without the random effect of populations, using an AIC approach corrected for small sample size (AICc) and the ΔAIC criterion to evaluate the random effects (kept when ΔAIC > 2) with the R package bbmle. Likelihood ratio tests were used to evaluate the fixed effects for both the GLM and GLMM models. For the GLM model of fecundity, we checked for the influence of outliers on the model, by using both visual and quantitative diagnostics of the leverage and Cook's distance. We did not find a consistent effect of outliers on this model and thus did not remove outliers. Fixed effect coefficients and their confidence intervals were converted to incident rate ratios (fecundity) and odd ratios (adult survival) using an exponential function.
4. De novo genome assembly and pool-sequencing
As pool-sequencing approaches require a reference genome (Schlötterer et al., 2014) to align the raw sequence reads and to conduct analyses that require information about the position of SNPs on the genome (e.g., to calculate diversity indices along the genome; Kofler, Orozco-terWengel, et al., 2011), we conducted a de novo genome assembly for Amnicola limosus. We extracted DNA from the tissue of one individual snail collected in 2017 using a standard Phenol Chloroform extraction method, after removing the shell and excising the mollusk guts to avoid contaminants. Briefly, tissue samples were placed in a digestion buffer containing proteinase K and digested at 55°C. DNA was then isolated using an isoamyl-phenol-chloroform solution, followed by ethanol precipitation. DNA quantity and quality were verified using a combination of different quality control methods: Qubit assay (Thermo Fisher Scientific Inc.), Tapestation (Agilent Inc.), and Femto Pulse (Agilent Inc,). Fragments longer than 1 kb were selected for further processing. Library preparation was performed using 10X Chromium Linked-Read library kit (10X Genomics Inc.) and sequenced on 3 lanes of Illumina HiSeqX PE150 at Genome Quebec. A long-read approach (10X) was selected for the sequencing as long-read-based assembly methods are more powerful at dealing with some issues (e.g. repeats, high GC content) arising in short-reads-based assemblies (Jung et al., 2020). We ran fastp v.0.23.4 on the three 10X paired-end reads to obtain the insert size, using the -Q option to disable quality filtering. Fastp results showed two estimated insert size peaks at 175 and 270 bp. Reads were assembled with Supernova v.2.1.1. The assembled genome is 1,899,346,312 bp in length, with 815,134 scaffolds and a N50 of approximately 5 kb. We estimated the genome size with Jellyfish 2.3.0 by reading simultaneously the R1 reads of the three runs of 10X sequencing using the options -F 3, -m 21, and -s 2G. The resulting histogram was then processed with GenomeScope http://qb.cshl.edu/genomescope/ (Vurture et al., 2017), which yielded an estimated haploid genome length of 382,882,063 bp, with 2.25% of repeats and 4.72% of heterozygosity. This is much smaller than the assembled genome size (1,899,346,312 bp) due to fragmentation. We also used Benchmarking Universal Single-Copy Orthologs BUSCO v5.2.2 (Manni et al., 2021) to assess gene completeness by searching for core mollusk orthologous genes, using the option -genome and the BUSCO.v4 lineage mollusca_odb10.2019-11-20. Most core genes were missing from our draft genome (74.7%), with only 19% complete core genes recovered (C:19%[S:18.1%, D:0.9%], F: 6.3%, M:74.7%, n=5295, with C: complete single copy BUSCO genes, S: complete and single-copy BUSCOs, D: complete and duplicated BUSCOs, F: fragmented BUSCOs, M: missing BUSCOs, n: total BUSCOs searched). We acknowledge that the draft assembly presented here is of low-quality and completeness. However, the draft genome was sufficient for our population genomics goals, as we were not attempting to resolve the precise genetic architecture of specific traits (Savolainen et al., 2013). For a similar application, see Brennan et al. (2022).
For the pooled sequencing, we extracted DNA from the tissues of 40 individuals per pool/population using the same standard Phenol Chloroform extraction method mentioned above. We quantified all samples using a Picogreen ds DNA assay (Thermo Fisher Scientific Inc.) on an Infinite 200 Nanoquant (Tecan Group Ltd). Samples were normalized to a dsDNA concentration of 15 ng/µL, re-quantified, and pooled according to the sampling population. Thus, we created 12 pools of 40 individuals each at 15 ng/µL. Libraries were prepared with the NEB Ultra II kit for shotgun sequencing and sequenced on 5 lanes of HiSeq2500 125 bp pair-ended at Genome Quebec. The number of reads sequenced per population was between 187-248 million paired-end reads. We used the following formula to calculate the expected mean coverage: read length number of reads / estimated haploid genome length. Given an estimated genome size of 382,882,063 bp, a read length of 125bp, and 93.7-124.2 million single-end reads sequenced, we calculated that our expected mean coverage was between 30-40X. We assessed the quality of our pool-seq Illumina libraries with fastqc 0.11.5, from which we obtained a percentage of repeats between 18.4 and 39.7%. Pool-seq presents some caveats as it offers limited information about linkage disequilibrium (Feder et al., 2012) and can lead to bias during the estimation of allele frequencies (Gautier et al., 2013). However a large array of methods have been developed to account for such biases and accordingly, we have used Baypass and poolFreqDiff to detect outlier loci as well as poolfstat and popoolation to estimate genetic diversity indices (Kofler, Orozco-terWengel, et al., 2011; Kofler, Pandey, et al., 2011; Schlötterer et al., 2014; Gautier, 2015; Wiberg et al., 2017; Gautier et al., 2022).
5. Read processing and SNPs calling
We prepared the assembled reference genome of Amnicola limosus by first indexing it with the Burrows-Wheeler Aligner ( BWA; Li & Durbin, 2009) v0.7.17 and with Samtools faidx v1.12, and by creating a dictionary with Picard Tools v2.23.3. We then used a custom pipeline for pool-seq quality processing, read alignment, and SNP discovery. We first trimmed reads with the function trim-fastq.pl from popoolation v1.2.2 (Kofler, Orozco-terWengel, et al., 2011) for a base quality of 20 and a minimum length of 50 bp, and assessed the quality of the trimmed reads with fastqc. We aligned trimmed reads to the reference genome with bwa-mem v0.7.17. We filtered out ambiguously aligned reads with samtools v1.13 using a score of 20 and sorted bam files with samtools. We used samtools flagstat to find the percentage of Illumina reads aligned to the reference genome, which was on average 53.5% SD 4.0%. We obtained an mpileup file with samtools mpileup, then filtered SNPs with a minimum coverage of five across all populations. We converted the mpileup file to a sync file with Popoolation2 v1.10.03 (Kofler, Pandey, et al., 2011), with a quality score of 20. The sync file was then converted to a "pooldata" object with the poolfstat package in R (Hivert et al., 2018), using a haploid pool size of 80 for all populations, a minimum read count per base of two, a minimum coverage of five and a maximum of 300, a minimal minor allele frequency of 0.0125 (to remove singletons) and discarding indels. This pipeline retained 21,312,700 biallelic SNPs.
6. Detecting genomic signatures of selection
To detect putative loci under selection, we used both outlier and environmental association analysis approaches with the hierarchical Bayesian models implemented in Baypass (Gautier, 2015). Baypass is advantageous in the context of our study as it enables the detection of outlier SNPs after taking demographic history into account, thus avoiding the confounding effects of population structure. The core model estimates the scaled covariance matrix Ω of population allele frequencies, which summarizes population history. The scaled covariance matrix Ω is based on the deviation of population allele frequency from an average allele frequency calculated across populations, where populations that are more closely related tend to have more similar deviations (i.e. covariance) from the average (Coop et al., 2010). The scaled covariance matrix Ω is calculated based on the whole SNP dataset (non-outliers and outliers) and is not related to pairwise FST except in some conditions (Coop et al., 2010). During the calculation of SNP-specific statistics of differentiation (XtX and BFis see below), this covariance is then explicitly accounted for through the scaled covariance matrix Ω (Gautier, 2015). The full dataset was divided into 27 pseudo-independent datasets to overcome computing limitations. The "pooldata" object from poolfstat was converted to the 27 sub-dataset Baypass input files with the "thinning" subsampling method and sub-sample size of 750,000 SNPs. We conducted the outlier analysis using the core model, to estimate the XtX statistic and associated p-value under a χ2 distribution with 12 degrees of freedom (bilateral test, Baypass manual). The XtX statistic is similar to a SNP-specific FST but corrects for covariance using the scaled covariance matrix Ω (Gautier, 2015). We considered SNPs as outliers when their p-value derived from the XtX estimator was < 0.001. The shape of the histogram p-values derived from the XtX statistics confirmed that they were well-behaved (A peak close to 0 for loci putatively under selection and a uniform distribution between [0,1] for neutral loci; Fig. S3B; François et al., 2016). The scaled covariance matrices Ω from the 27 sub-datasets were compared visually to assess the concordance of the results, and then the statistics obtained for each sub-dataset were combined (Baypass manual).
For the environmental association analysis, we opted for the standard model STD under the Importance Sampling approach in Baypass, in which the association between covariables and SNP allele frequencies is assessed independently for each covariable. If this model is not able to distinguish the effect of the calcium concentration from the effect of round goby presence/absence, we should observe a large overlap between outlier SNPs associated with each covariable (see the supplementary material for a more in-depth explanation of the selection of the STD model). This model computes for each SNP its regression coefficient βik of the association between the SNP allele frequencies and a covariable to compare the model with association (βik ≠ 0) against the null model (βik = 0), from which a Bayes factor BFis is derived. We selected two environmental covariables: invasion status (presence/absence of the round gobies) and calcium concentration. We also estimated the C2-statistic with the STD model (Olazcuaga et al., 2021), which is more appropriate for binary variables and was used for the association with round goby presence/absence. We checked the Pearson correlation coefficient between covariables with the function pairs.panel() in the package psych in R, which was r = 0.71 (slightly above the recommended threshold for the regression method of |r| < 0.7, Fig. S4). For the calcium covariable, we ran three independent runs of the STD model with the random number -seed option, then computed the median BFIS across runs to ensure the convergence of the MCMC results. To check for convergence of the independent runs, we calculated the Förstner and Moonen distance (FMD; Förstner & Moonen, 2003) between the scaled covariance matrix Ω matrices from each sub-dataset with the fmd.dist function in R (included in Baypass). Results were found to be convergent, with all FMD values < 0.12. Covariables were all standardized to = 0 and = 1, and we used default parameters for the MCMC analyses. For the calcium association, SNPs were considered significantly associated with a covariable when BFis > 20 dB (Jeffrey’s rule for “decisive evidence”; Gautier, 2015). For the association with round goby presence/absence, we used the R package qvalue to correct for multiple hypothesis testing on the p-values derived from the C2-statistic and applied a False Discovery Rate of α = 0.01 as a q-values cut-off for outlier detection (Olazcuaga et al., 2021).
As a complementary analysis to investigate the potential for adaptation to the invasive predator and low calcium concentrations, we used poolFreqDiff (Wiberg et al., 2017) to identify outlier SNPs showing consistent allele frequency differences in the same direction between replicate populations contrasting the two calcium concentration (low vs high) and round goby predation (absent vs present) environment types. Note that with this approach, our aim was not to identify independent instances of parallel adaptation but rather detect genotype-environment associations; consistent allele frequency differences could arise due to adaptation occurring in a shared recent ancestor. This method relies on modeling allele frequencies with a generalized linear model (GLM) and a quasibinomial error distribution, which should result in a uniform distribution of p-values between [0,1] under the neutral (null) scenario (Wiberg et al., 2017). It also accounts for bias in allele frequency estimation (e.g., Gautier et al., 2013) by rescaling allele counts to an effective sample size neff (Feder et al., 2012). We ran the analysis separately for the two covariables as binary comparisons: invasion status (presence/absence) and calcium concentration (low < 24.3 mg/L, high > 34.3 mg/L). For the minimum read count per base, and the minimum and maximum coverage, we used the same values as for the poolfstat filtering, and we also rescaled the allele counts with neff and added one to zero count cells. To account for demography and genetic structure, we applied the empirical-null hypothesis approach (François et al., 2016) to recalibrate p-values based on a genomic inflation factor of λ = 0.85. We confirmed that recalibrated p-values were well-behaved based on the observed peak close to 0 and the uniform distribution between [0,1] (Fig. S5; François et al., 2016). We then transformed the recalibrated p-values into q-values with the R package qvalue, and defined outliers if their q-value was below the FDR α = 0.01.
We considered candidate SNPs for adaptation to low-calcium concentration or to the round goby predation as SNPs that were in common between the environmental association analyses with Baypass and poolFredDiff and were uniquely associated with each covariable (Fig. S6). We also excluded SNPs showing inconsistent allele frequency associations with environmental variables, which were considered false positives (N = 2 for the association with the calcium gradient and N = 1118 for the association with round goby predation). Note that the positive or negative associations with the co-variables do not correspond to positive or negative selection, as the reference allele is chosen arbitrarily in poolfstat (and thus in Baypass).
7. Population structure, genetic diversity, and demography
We first estimated population structure with a genome-wide pairwise FST matrix from the poolfstat package, using the same parameters as described above and removing the outlier SNPs detected with the Baypass and poolFredDiff analyses. We used the pairwise FST matrix to visualize phylogenetic relationships between populations with an UPGMA tree implemented in the R package ape and evaluated the support of nodes with bootstrapping (N = 1000). The pairwise FST matrix was also used to assess the potential for isolation by distance, using the relationship between the genetic distance (FST/(1- FST); Rousset, 1997) and the log of the geographical distance (2D distribution of populations) with a Mantel test (9999 permutations) using the vegan package in R. Distances between sites were obtained by measuring paths between populations along the rivers (in m) with Google Earth v.10.38.0.0. Two outliers were present In the IBD analysis, but the relationship was still significant when these two data points were removed. We tested for isolation by environment, by calculating the environmental distance between population pairs using the squared Mahalanobis distance, determined from the calcium concentration and round goby presence/absence with the R package ecodist. We verified that there was no correlation between the environmental distance and geographic distance (non-significant Mantel test with 9999 permutations: r2 = 0.03, p-value = 0.20). Then we tested for a relationship between environmental distance and genetic distance as FST/(1-FST) with a Mantel test (9999 permutations). Finally, we compared the results of the pairwise FST matrix with the scaled covariance matrix Ω of population allele frequencies from the core model in Baypass, which summarizes some aspects of population history, though the scaled covariance matrix Ω is based on the complete dataset and thus includes outliers and putatively neutral SNPs.
To test if the round gobies impacted levels of genetic diversity in invaded A. limosus populations, we compared diversity indices between invaded and refuge populations. However, it should be noted that moderate bottlenecks do not necessarily trigger decreases in nucleotide diversity and heterozygosity (Adams & Edmands, 2023), particularly in the initial generations after a bottleneck or in the presence of high levels of standing genetic variation or gene flow (Gompert et al., 2021). We obtained the observed heterozygosity from the poolfstat package and compared heterozygosity levels between round goby-impacted and refuge populations with a t-test after checking for the assumptions of normality (qqplot) and homoscedasticity (Bartlett test). We calculated genome-wide diversity indices (Tajima's pi, Watterson theta, and Tajima's D) using popoolation (Kofler, Orozco-terWengel, et al., 2011). First, we generated mpileup files for each population separately with samtools (Li et al., 2009) from the sorted.bam files output by the custom pipeline. Then we computed the genome-wide diversity indices using non-overlapping windows of 100 kb, a minimum coverage of 20 (as recommended in Kofler, Orozco-terWengel, et al., 2011 except for Tajima’s D with minimum coverage = 13, as the corrected estimator requires the pool size < 3 minimum coverage), a minimum quality of 20, a minimum fraction covered of 0.05 and a pool size of 40. It should be noted that popoolation calculates the diversity indices along chromosomes; thus, due to the fragmentation of our draft genome, the diversity indices were calculated mostly among separate contigs and in windows < 100 kb. The minimum number of SNPs per window across populations using this filter was 25. We used Hedge's G to detect a potential difference in the three diversity indices between the invaded and refuge populations.
We also investigated the demographic history of three population pairs using the diffusion approximation method implemented in δaδi (Gutenkunst et al., 2009). We aimed to detect a potential bottleneck in the invaded populations and to quantify the magnitude and direction of gene flow between the two habitat types (invaded and refuge). Due to the significant computational time required for these analyses, we selected a subset of populations from each habitat type for our models; we selected the populations PB-LCGA, IPE-LCGA, and PDC-HCGA as refuges, paired with PG-HCGP, BEA-HCGP, and GOY-HCGP as invaded populations respectively. PDC-HCGA was of particular interest as a high calcium population located in an uninvaded wetland (refuge). Our most complex model (Fig. S7) has defined effective population sizes after the split (nu1 and nu2), followed by a bottleneck in both populations (modeling a scenario in which the round goby invasion impacted population abundance at the whole ecosystem scale) followed by exponential recovery in both populations. TS is the scaled time between the split and the bottleneck and TB is the scaled time between the bottleneck and present. Migration rates are asymmetric but constant through time after the split, with mIR the migration from refuge to invaded populations and mRI in the opposite direction. As we knew the time of the potential bottleneck (12 years before sampling with one generation per year), we set TB as a fixed parameter. Because TB is set as a fixed parameter, the parameter was an explicit parameter in the models that included a bottleneck. We defined θ with μ the mutation rate of substitutions per site per year from the Caenograstropoda species Nucella lamellose (McGovern et al., 2010), and L the effective sequenced length, calculated as .
We investigated six additional non-nested models: a) bottleneck and growth only in the invaded population (constant Ne for the refuge population) with uneven migration, b) only bottlenecks in both populations without recovery and uneven migration, c) only bottleneck in the invaded population with uneven migration (constant Ne for the refuge population), d) a simple population split at TS with uneven migration, e) a population split with symmetric migration and f) a population split without migration (Fig. S7). The default local optimizer was used on the log of parameters with random perturbation of the parameters to obtain a set of parameter values resulting in the highest composite likelihood. Optimization was conducted repeatedly until convergence was reached (i.e., three optimization runs with log-likelihood within 1% of the best likelihood). One model in one population pair did not reach convergence after 30 optimization runs (bottleneck without recovery for the PDC-HCGA and GOY-HCGP population pair). Finally, we compared our seven models based on the differences in the likelihoods and plots of residuals of the models. We did not use AIC for model selection as it is not appropriate to compare composite likelihoods generated by dadi when SNPs are linked (Noskova et al., 2020) As we obtained unlikely results during the conversion of parameters in our best models, possibly due to imprecise mutation rates, we did not conduct parameter conversion. To obtain uncertainties on our parameters while accounting for the effect of linkage, we used bootstrapping and the Godambe Information Matrix approach (Coffman et al., 2016). For this, we generated 100 bootstrapped datasets with a chunk size of bp. Note that due to the draft genome fragmentation, the bootstrapping was executed between scaffolds of size usually smaller than 100 kb, which may have led to underestimated uncertainties.
To address the potential effect of using a pool-seq approach on the variance in allele frequency estimates stemming from differences in coverage between pools (Gautier et al., 2013), we used a filter to obtain relatively homogenous coverage between our two selected populations/pools. From the initial SNPs dataset output by poolfstat (21,312,700 SNPs), we retained SNPs that fell within the 1st and 3rd quartiles of coverage in both populations (11-19X for PB-LCGA and 10-18X for PG-HCGP; 11-18X for IPE-LCGA and 9-15X for BEA-HCGP; 10-16X for PDC-HCGA and 10-17X for GOY-HCGP). We also filtered out SNPs that were detected as outliers (putatively under selection) in the Baypass (core and aux or STD models) and poolFreqDiff analyses and removed uninformative SNPs (fixed or lost in both populations). To accommodate for large computation time during the optimization, the datasets were thinned at random to retain a final dataset of ≈ 1 million SNPs per population pair. We used a custom script and the dadi_input_pools function from the genomalicious R package (Thia & Riginos, 2019) with the “probs” parameter in the methodSFS option to transform allele frequency data into the SNP data format from δaδi. We used δaδi to infer the folded SFS as we did not have information on the ancestral allele state. Due to low confidence in the low-frequency estimates, we also masked entries from 0 to 5 reads.