Single nucleotide polymorphisms, environmental data and R scripts used in the work: A donor registry: Genomic analyses of Posidonia australis seagrass meadows identifies adaptive genotypes for future-proofing
Data files
Nov 21, 2024 version files 2.25 MB
-
Environmental_Covariables_NSW_estuaries.csv
37.06 KB
-
PosidoniafiltereddataGEASNPs.csv
2.21 MB
-
README.md
6.25 KB
Abstract
Globally, anthropogenic climate change has caused declines of seagrass ecosystems necessitating proactive restoration approaches which would ideally anticipate future conditions. In eastern Australia, environmental conditions in estuaries with meadows of the endangered seagrass Posidonia australis have warmed and acidified over the past decade and seagrass communities have declined in some estuaries. Securing these valuable habitats will require proactive conservation and restoration efforts that could be augmented with restoration focussed on boosting resilience to future change. Understanding patterns of selection and where seagrass meadows are adapted to particular environmental conditions is key for identifying optimal donor material for restoration. We use single nucleotide polymorphisms and genotype by environment analyses to identify candidate loci under putative selection to environmental stressors and assess genomic variation and allelic turnover along stressor gradients. The most important estuarine variables driving selection were associated with temperature, water turbidity and pH. We developed a preliminary ‘donor registry’ of pre-adapted Posidonia australis genotypes by mapping the distribution of alleles to visualise allelic composition of each sampled seagrass meadow. The registry could be used as a first step to select source material for future-proofing restoration projects however, manipulative experiments will be required to test that pre-adapted genotypes confer increased resistance to multiple environmental stressors.
https://doi.org/10.5061/dryad.d2547d89s
Description of the data and file structure
We used single nucleotide polymorphisms (SNP data provded here) and genotype by environment analyses (environmental data provided here) to identify candidate loci under putative selection to environmental stressors and assess genomic variation and allelic turnover along stressor gradients for the seagrass Posidonia australis in estuaries along the New South Wales (Australia) coast.
Files and variables
File: PosidoniafiltereddataGEASNPs.csv
Description: DArT Sequence generated single nucleotide polymorphisms for individual loci from multiple individuals
Variables
- Individual: Samples from estauries in New South Wales - see estuarine covariables dataset for explainers of individual origin
- SNP sequence: short read sequence for individual SNPs. Some cells contain “NA” values representing no data available
File: Environmental_Covariables_NSW_estuaries.csv
Description: Corresponding environmental data for individuals and populations in the PosidoniafiltereddataGEASNPs.csv dataset
Variables
- Ind: Name of individual samples
- Pop: Name of population to which each individual belongs
- AvTemp: Average water temperature (degrees Celsius) for the relevant population/individual averaged over the last 20 years
- MinTemp: Minimum temperature (degrees Celsius) for the relevant population/individual over the last 20 years
- MaxTemp: Maximum temperature (degrees Celsius) for the relevant population/individual over the last 20 years
- TempRange: Difference between maximum and minimum temperatures (degrees Celsius) for the relevant population/individual over the last 20 years
- AvpH: Average water pH (pH units) for the relevant population/individual over the last 20 years
- MinpH: Minimum water pH (pH units)for the relevant population/individual over the last 20 years
- MaxpH: Maximum water pH (pH units)for the relevant population/individual over the last 20 years
- pHRange: Difference (pH units) between maximum and minimum pH value for the relevant population/individual over the last 20 years
- AvSal: Average water salinity (practical salinity units) meaure for the relevant population/individual over the last 20 years
- MaxSal: Maximum water salinity (practical salinity units meaure for the relevant population/individual over the last 20 years
- AvTurb: Average water turbidity (nephelometric turbidity units) measure for the relevant population/individual over the last 20 years
- MinTurb: Maximum water turbidity (nephelometric turbidity units) measure for the relevant population/individual over the last 20 years
- MaxTurb: Minimum water turbidity (nephelometric turbidity units) measure for the relevant population/individual over the last 20 years
- TurbRange: Difference between maximum and minimum water turbidity (nephelometric turbidity units) value for the relevant population/individual over the last 20 years
Code/software
Sequencing error was estimated by calculating the maximum proportion of allelic differences (bitwise distance) found between six pairs of technical replicates using bitwise.dist in the R package poppr.
A data filtering strategy was employed using several functions in the R package dartR v.2.7.2.
Genomic scans for adaptive divergence were carried out to identify candidate SNPs potentially under selective pressure using three different models: Redundancy analysis (RDA) , Principal Component Analysis for Outlier Detection (PCAdapt) and Latent Factor Mixed Models (LFMM2)
To infer population structure, individual ancestral coefficients were estimated based on a sparse non-negative matrix factorisation (SNMF) method. This was implemented using the snmf function in the R *package LEA v3.10.2. The optimal factor, K=8, was used to inform the LFMM to identify whether allele frequencies were correlated with any of the environmental variables. Statistical power of associations was increased by imputing missing genotype data via the *gl.impute *function in the *dartR package using the nearest neighbour option. Subsequently, the function lfmm_ridge *was used to compute a regularised least-squares estimate using a ridge penalty. Individual associations between each SNP frequency and each environmental variable were assessed using statistics test calibrated using genomic inflation factor (function *lfmm_test). Corrections for multiple comparisons were applied with the Benjamini-Hochberg algorithm with a false discovery rate (FDR) threshold of 5% . Significance associations was determined using a threshold of 0.001 as the probability of finding a false positive result increases with lower thresholds. Candidate SNP loci were retained for downstream analysis when they were identified by at least two out of the three methods.
The GF analysis was run in the R package gradientForest using a regression tree-based approach to fit a model of responses between genomic data and environmental variables . Turnover in adaptive genetic variation were modelled on the predictor variables using the candidate SNPs, identified via GEAs, as the response variables. The machine learning algorithm partitioned allele frequencies at numerous split values along each environmental gradient and calculated the change in allele frequencies for each split. The split importance (i.e., the amount of genomic variation explained by each split value) was cumulatively summed along the environmental gradient and aggregated across alleles to build a non-linear turnover function to identify loci that were significantly influenced by the predictor variable. The analysis was run over 500 regression trees for each of the nine environmental predictor variables with all other parameters at default settings.
Access information
Other publicly accessible locations of the data:
- Nil
Data was derived from the following sources:
- Nil
A total of 342 individual P. australis were initially genotyped with the DArTseq™ platform yielding a total of 11,382 SNP loci with a mean read depth of 7.07 and 20.73% missing data. Sequencing error was estimated by calculating the maximum proportion of allelic differences (bitwise distance) found between six pairs of technical replicates using bitwise.dist in the R package poppr which was used as a threshold. No sequencing errors were detected and technical replicates were then removed from the dataset. To enhance the quality of SNPs and to optimise the number of loci available for identification of candidate SNPs under potential selection, a data filtering strategy was employed using several functions in the R package dartR v.2.7.2. Data were filtered applying a locus call rate of 0.67, and individual call rate of 0.25 and a reproducibility threshold of 0.99. Read depth filter parameters were set at 2 to 50 and SNPs were thinned by setting the MAF to default (0.01). After filtering, a total of 3,277 SNP loci for 311 genotypes across the 13 populations were retained.
Environmental data were sourced from various repositories, please see the published manuscript for source details