How useful is genomic data for predicting maladaptation to future climate?
Data files
Apr 17, 2024 version files 30.38 GB
Abstract
Methods using genomic information to forecast potential population maladaptation to climate change or new environments are becoming increasingly common, yet the lack of model validation poses serious hurdles toward their incorporation into management and policy. Here, we compare the validation of maladaptation estimates derived from two methods – Gradient Forests (GFoffset) and the Risk Of Non-Adaptedness (RONA) – using exome capture pool-seq data from 35 to 39 populations across three conifer taxa: two Douglas-fir varieties and jack pine. We evaluate sensitivity of these algorithms to the source of input loci (markers selected from genotype-environment associations [GEA] or those selected at random). We validate these methods against two-year and 52-year growth and mortality measured in independent transplant experiments. Overall, we find that both methods often better predict transplant performance than climatic or geographic distances. We also find that GFoffset and RONA models are surprisingly not improved using GEA candidates. Even with promising validation results, variation in model projections to future climates makes it difficult to identify the most maladapted populations using either method. Our work advances understanding of the sensitivity and applicability of these approaches, and we discuss recommendations for their future use.
README: How useful is genomic data for predicting maladaptation to future climate?
https://doi.org/10.5061/dryad.sxksn039h
The data in this archive is the genetic, environmental, and phenotypic data as well as model outcomes from the evaluation of genomic offset models from Lind et al. (2024; citation at end of README).
Raw sequence data has been deposited on NCBI's Sequence Read Archive under bioprojects PRJNA1079709 and PRJNA744263. Analysis code is available on Zenodo (which mirrors the GitHub repositories):
Lind BM. 2024. GitHub.com/brandonlind/offset_validation: Publication release (Version 1.1.0). Zendodo (2023): DOI: https://doi.org/10.5281/zenodo.10708661
Lind BM. 2023. GitHub.com/brandonlind/douglas_fir_natural_populations: Offset Revision 1 (v1.0.0). Zenodo. https://doi.org/10.5281/zenodo.8018894
Lind BM. 2023. GitHub.com/brandonlind/jack_pine_natural_populations: Offset Revision 1 (v1.0.0). Zenodo. https://doi.org/10.5281/zenodo.8018892
Author: Brandon M. Lind
- lind(dot)brandon(dot)m(at)gmail(dot)com
- ORCID - 0000-0002-8560-5417
Usage
If you use or are inspired by the manuscript or archives, please cite analysis code (above), or either the manuscript or this archive:
Lind, B. M., R. Candido-Ribeiro, P. Singh, M. Lu, D. O. Vidakovic, T. R. Booker, M. Whitlock, N. Isabel, S. Yeaman,
and S. N. Aitken. 2024. How useful is genomic data for predicting maladaptation to future climate? Accepted to Global Change Biology. Available on bioRxiv DOI: https://doi.org/10.1101/2023.02.10.528022
Lind, Brandon et al. (2024). How useful is genomic data for predicting maladaptation to future climate? [Dataset]. Dryad. https://doi.org/10.5061/dryad.sxksn039h
Description of the data and file structure
The following directories (g-zip compressed within the archive) are named so as to be easily referenced with the code notebooks (Lind 2024) that either first used or otherwise produced the files for analysis. For instance the 01
archived directory contains files first used or output by the corresponding archived notebook 01_split_training_and_testing.ipynb.
jp_datatable.txt
a tab-separated dataframe used to configure the varscan pipeline (Lind 2021) used to call jack pine (jp) SNPs from raw sequence data. The format of the configuration file is explained on the GitHub repository from Lind (2021). Final SNP tables are in the 04/snp_files
subdirectory.
df_datatable.txt
a tab-separated dataframe used to configure the varscan pipeline (Lind 2021) used to call douglas-fir (df) SNPs from raw sequence data. The format of the configuration file is explained on the GitHub repository from Lind (2021). Final SNP tables are in the 04/snp_files
subdirectory.
01 directory
a tab-separated dataframe of population information for cross-variety Douglas-fir data (technically duplicated data from the coastal and interior files below). Contains the following columns
-
prov
- numerical provenance identification number
-
our_id
- the population identification used throughout analysis
-
Variety
- whether the population is of the coastal or interior variety
-
LONG
- longitude
-
LAT
- latitude
-
Elevation
,MAT
,MWMT
,MCMT
,TD
,MAP
,MSP
,AHM
,SHM
,DD_0
,DD5
,NFFD
,bFFP
,eFFP
,FFP
,PAS
,EMT
,EXT
,Eref
,CMD
- environmental variables - see Table S1 in Lind et al. (2024) for abbreviations and measurement units.
a tab-separated dataframe of population information for the coastal variety of Douglas-fir. Contains all columns from df_ALL-naturalpops_raw_env-19variables_change-p6.txt
, as well as:
-
group
- the group name for a cluster of populations
-
group_color
- the color for the specific cluster of populations
a tab-separated dataframe of population information for the interior variety of Douglas-fir. Contains all columns found in df_coastal-naturalpops_raw_env-19variables_change-p6.txt
a tab-separated dataframe of population information for jack pine. Contains all columns of Douglas-fir files (except prov
and group
), with the following additions:
-
id1
andid2
- identification numbers relevant to management
02 directory
- 7-zipped compressed directory containing climate normals from 1961-1900 from Adapt West (2021), which used ClimateNA (2016)
- shape files updated from the current range of coastal Douglas-fir, edited to add in polygons that surround our populations
- first asterisk is the taxon (coastal Douglas-fir, interior Douglas-fir, or jp for jack pine)
- second asterisk corresponds to the file type suffix
- RDS object of a blank range map, used in
gradient_fitting_script.R
to write/store offset. - first asterisk is the taxon (coastal Douglas-fir, interior Douglas-fir, or jp for jack pine)
03 directory
- 7-zipped compressed directory containing projected climate data from Adapt West (2021), which used ClimateNA (2016)
- the first asterisk is for the RCP level (eg RCP 4.5 or RCP 8.5)
- the second asterisk is for the projection year (2050s or 2080s)
- 7-zipped compressed directory containing elevation data from Adapt West (2021), which used ClimateNA (2016)
- tab-separated dataframe for all points (latitude and longitude) within shapefile boundaries representing species ranges
- each point (latitude and longitude) is given the same set of environmental values - this set of values correspond to true values of the common garden environment; columns for those variables are described in the 01 Directory
- the first asterisk indicates taxon (coastal Douglas-fir, interior Douglas-fir, or jp for jack pine)
- the second asterisk indicates the common garden location (Vancouver for all Douglas-fir; St Christine or Fontbrune for jack pine)
04 directory
- * _maf-p05_RD-recalculated_ADP-lt-1000_*.txt.gz
- dataframe containing output from five runs of baypass
- the first asterisk indicates taxon (coastal or interior Douglas-fir; jack pine files begin with
maf-p05
- the second asterisk indicates the environmental variable used in association
- each file contains the following columns (asterisks indicate a numerical value for one of five independent runs of baypass to ensure convergence)
-
locus
- locus name -
chain_*-COVARIABLE
-
chain_*-MRK
- numerical index to locus name output from baypass
-
chain_*-M_Pearson
- correlation coefficient between the scaled allele frequencies and the given covariable after rotation of both vectors - output from baypass
-
chain_*-SD_Pearson
- standard deviation of the correlation coefficient between the scaled allele frequencies and the given covariable after rotation of both vectors - output from baypass
-
chain_*-BF(dB)
- the estimated Bayes Factor (column BF(dB)) in dB units (i.e., 10 × log10 (BF)) output from baypass
-
chain_*-Beta_is
- regression coefficients βi output from baypass
-
chain_*-SD_Beta_is
- standard deviation of regression coefficients βi output from baypass
-
chain_*-eBPis
- empirical Bayesian p-value output from baypass
-
chain_*-env
- the environmental variable used in association
-
mean_BF(dB)
- the average Bayes Factor across the five chains (average value across each of the
chain_*-BF
) calculated manually
- the average Bayes Factor across the five chains (average value across each of the
-
rank_mean_BF(dB)
- the rank of loci from the
mean_BF(dB)
column, calculated manually
- the rank of loci from the
-
BF(dB)_gte20_for-gte3chains
- boolean column indicating whether the locus had a Bayes Factor greater than or equal to 20 for at least 3/5 independent baypass runs
-
BF(dB)_gte15_for-gte3chains
- boolean column indicating whether the locus had a Bayes Factor greater than or equal to 15 for at least 3/5 independent baypass runs
-
rank_consistency_top1perc_for-gte3chains
- boolean column indicating whether the locus was in the top 1% of
mean_BF(dB)
in at least 3/5 independent baypass runs
- boolean column indicating whether the locus was in the top 1% of
-
- * _ pooled_varscan_ *.txt.tar.gz
- g zipped tab-separated dataframes for SNP data
- the first asterisk indicates species (DF for Douglas-fir, JP for jack pine)
- the second asterisk indicates the file suffix, which further identifies which taxa are included in the file (FDC = coastal Douglas-fir, FDI = interior Douglas-fir, both-varieties = coastal + interior Douglas-fir)
- rows are populations - the names found in the environmental files of the 01 Directory
- columns are locus names
- entries are minor allele frequencies
- * _ * WZA.csv
- comma-separated datatable representing output from the Weighted Z Analysis (WZA) with the following columns
-
index
- gene name
-
SNPs
- number of snps per gene
-
hits
- number of candidate SNPs
-
Z
- score from the WZA
-
top_candidate_p
- p-value from the top candidate test from Yeaman et al.
-
Orthogroup
- assigned orthrogroup
-
CHROM
- reference genome contig name
-
Z_pVal
- p-value from the Z from WZA
-
- the first asterisk designates the taxon (JP = jack pine, DFc = coastal Douglas-fir, DFi = interior Douglas-fir)
- the second asterisk designates the environmental variable, these variables are described in the output in the 01 Directory
- comma-separated datatable representing output from the Weighted Z Analysis (WZA) with the following columns
- *.climatena.clean.csv.tar.gz
- g zipped comma-separated dataframe containing SNP-level information. For each SNP (row) there is an entry for environmental variables (described in the 01 Directory) where the entry is the Spearman's rho between the allele frequency and environment. There is also a column for gene identification (wza_gene) - this was used to determine which SNPs were in genes that were significant from the WZA.
- the first asterisk designates the taxon (jp = jack pine, fdc = coastal Douglas-fir, fdi = interior Douglas-fir)
06 directory
- tab-delimited dataframes containing phenotype data for jack pine in a given common garden (designated by the asterisk)
- contains the following columns:
-
provenance
- provenance identity to align with files in the 01 Directory
-
Site_(source)
- either Fontbrune or St. Christine (st-christine)
-
n_tree_alive
- number of living trees in 2018
-
Mean_DBH2018
- average diameter at breast heigh (cm) measured in 2018
-
Mean_Height_2019
- average height (cm) measured in 2019
-
Mortality_2018
- percent mortality measured in 2018
- [intermediate columns containing no data]
-
prov_name
- human-readable provenance name
-
our_id
- provenance identification found in the data of the 01 Directory
-
- tab-delimited dataframe with the average climate (columns; variables explained in 01 Directory) for each common garden (rows). There is also columns for the latitude and longitude of the common garden.
- tab-delimited dataframe with the human-readable provenance names for jack pine corresponding to the population IDs used here (row names, see 01 Directory)
- tab-delimited dataframe with the climate of the vancouver common garden for 2018, 2019, or average over 2018-2019 (rows), where columns are environmental variables, latitude or longitude (described in 01 Directory)
08 directory
- tab-delimited dataframe giving the predictive performance of climate and geographic distance to estimate population performance incommon gardens. Contains the following columns:
-
spp
- taxon
-
garden
- vancouver (douglas-fir taxa), fontbrune or st-christine (jack pine)
-
pheno
- phenotype used in performance evaluation
-
distance_metric
- designates the climates used in mahalanobis distances (all, Climate-based seed transfer)
- or geographic distance (vincenty_*, where * designates the common garden name)
-
spearman
- Spearman's rho between phenotypes and distance metric
-
pearson
- Pearson's r between pheotypes and distance metric
- other columns
- not used in manuscript
-
- tab-delimited dataframe containing population mean phenotype data for both varieties of Douglas-fir
- this file contains rownames
- row names correspond to population ID found in the
our_id
column of files described in the01
directory - the two columns correpsond to the phenotype - either shoot biomass (
blup_shoot_biomass
) or increment growth (blup_increment
). - note the phenotype data for jack pine are in the
06
directory in theDATA_Pinus_banksiana_ML2021-cg-data_provpops_*.txt
files
09 directory
- tab-delimited dataframe containing performance of RONA, with the following columns
-
spp
- taxon jp = jack pine,
- fdi = interior douglas-fir,
- fdc = coastal douglas-fir,
- fdc-combined (coastal predictions from the cross-variety RONA),
- fdi-combined (interior predictions from the cross-variety RONA),
- eastern-fdi (predictions for the eastern interior doug-fir genetic group from the interior-only RONA),
- western-fdi (predictions for the western interior doug-fir genetic group from the interior-only RONA),
- eastern-combined (predictions for the eastern interior doug-fir genetic group from the cross-variety RONA),
- western-combined (predictions for the western interior doug-fir genetic group from the cross-variety RONA),
-
garden
- the common garden location (either vancouver, st christine, or fontbrune)
-
method
- marker GEA source (either baypass or wza)
-
setname
- the set of markers (real = real GEA candidates from the
method
column, random = random set of loci equal to the real set of loci for themethod
)
- the set of markers (real = real GEA candidates from the
-
pheno
- phenotype used in performance calculation
-
env
- environmental variable that was used in calculating RONA
-
spearman
- Spearman's rho between RONA and phenotype
-
pearson
- Pearson's r between RONA and phenotype
- other columns
- not used in manuscript
-
- pkl compressed file of a dictionary containing environmental variables used to calculate RONA
- pkl compressed file of a dictionary containing RONA predictions
11 directory
- tab delimited dataframe containing performance scores from Gradient Forests
- first asterisk designates taxon (jp = jack pine, combined = interior + coastal doug-fir, fdc = coastal doug-fir, fdi = interior doug=fir)
- each file contains the following columns:
-
garden
- common garden name
-
pheno
- phenotype used to calculate performance
-
dataset
- GEA method (baypass or WZA) and whether the real GEA candidates were used or a random set of equal sample size
-
full_pearson
- Pearson's r between GF offset and phenotype
-
full_spearman
- Spearman's rho between GF offset and phenotype
- other columns
- not used
-
• offsets.pkl
- pkl compressed file containing predicted GF offsets to common gardens for each species
• validation_scores*.pkl
- pkl compressed file with the original GF offset validation output contained in the *validation_results.txt files.
16 directory
- pkl compressed file containing the environmental variables that differ significantly between current and future climates for RONA for all taxa
- asterisk designates source
- pkl compressed file containing the future offset predicted from RONA for future climate scenarios for all taxa
18 directory
- pkl compressed file containing the future offset predicted from Gradient Forests for future climate scenarios for all taxa
training directory
- ** - * - * - full_gradient_forest_training_importances.txt*
- tab-delimited file containing environmental importance values from GF models
- columns contain information for three importance metrics (see GF documentation)
- rows are for each environmental variable (see descriptions in 01 Directory)
- first asterisk designates taxon
- second asterisk designated GEA method (baypass or wza)
- third asterisk designates whether true GEA candidates were used or whether a random set of equal sample size was used for GF model training
- * - * - *-full.sh
- shell file submitted to computing cluster containing commands used to execute gradient forests
- first asterisk designates taxon,
- second asterisk designates GEA method (baypass or wza)
- third asterisk designates whether true GEA candidates were used or whether a random set of equal sample size was used for GF model training
- * - * - * -full_ *.out
- printed output from the *sh files of similar name.
- the last asterisk designates the slurm job ID that executed the shell script
Code/Software
Code is archived on Zenodo (Lind 2024), which mirrors the GitHub repository. Hyperlinks within README on GitHub link to nbviewer.org for easy viewing of jupyter notebooks.
References
AdaptWest-Project. (2021, February 5). Gridded current and projected climate data for North America at 1km resolution, generated using the ClimateNA software. http://adaptwest.databasin.org
Lind, B.M. (2021) GitHub.com/CoAdapTree/varscan_pipeline: Publication release (Version 1.0.0). Zenodo. http://doi.org/10.5281/zenodo.5083302
Lind, B.M. (2024). GitHub.com/brandonlind/offset_validation: Publication release (Version 1.1.0). Zenodo. https://doi.org/10.5281/zenodo.7641225
Lind, B.M., et al. (2024) How useful is genomic data for predicting maladaptation to future climate? Accepted to Global Change Biology. https://doi.org/10.1101/2023.02.10.528022
Wang, T., Hamann, A., Spittlehouse, D., & Carroll, C. (2016). Locally Downscaled and Spatially Customizable Climate Data for Historical and Future Periods for North America. PLoS ONE, 11(6), e0156720. https://doi.org/10.1371/journal.pone.0156720
Methods
Samples from natural populations were collected for Douglas-fir (Pseudotsuga menziesii var glauca and P. menziesii var menziesii) and jack pine (Pinus banksiana). Exome capture probes were used and pooled sequencing of equimolar quantities of individual DNA were carried out at Centre d/expertise et de services Génome Québec.