How useful is genomic data for predicting maladaptation to future climate?

Lind, Brandon 1 ; Candido-Ribeiro, Rafael1; Singh, Pooja2; Lu, Mengmeng2; Obreht Vidakovic, Dragana1; Booker, Tom1; Whitlock, Michael1; Yeaman, Sam2; Isabel, Nathalie3; Aitken, Sally1

Research facility: University of British Columbia

Published Apr 17, 2024 on Dryad. https://doi.org/10.5061/dryad.sxksn039h

Abstract

Methods using genomic information to forecast potential population maladaptation to climate change or new environments are becoming increasingly common, yet the lack of model validation poses serious hurdles toward their incorporation into management and policy. Here, we compare the validation of maladaptation estimates derived from two methods – Gradient Forests (GF_offset) and the Risk Of Non-Adaptedness (RONA) – using exome capture pool-seq data from 35 to 39 populations across three conifer taxa: two Douglas-fir varieties and jack pine. We evaluate sensitivity of these algorithms to the source of input loci (markers selected from genotype-environment associations [GEA] or those selected at random). We validate these methods against two-year and 52-year growth and mortality measured in independent transplant experiments. Overall, we find that both methods often better predict transplant performance than climatic or geographic distances. We also find that GF_offset and RONA models are surprisingly not improved using GEA candidates. Even with promising validation results, variation in model projections to future climates makes it difficult to identify the most maladapted populations using either method. Our work advances understanding of the sensitivity and applicability of these approaches, and we discuss recommendations for their future use.

https://doi.org/10.5061/dryad.sxksn039h

The data in this archive is the genetic, environmental, and phenotypic data as well as model outcomes from the evaluation of genomic offset models from Lind et al. (2024; citation at end of README).

Raw sequence data has been deposited on NCBI's Sequence Read Archive under bioprojects PRJNA1079709 and PRJNA744263. Analysis code is available on Zenodo (which mirrors the GitHub repositories):

Lind BM. 2024. GitHub.com/brandonlind/offset_validation: Publication release (Version 1.1.0). Zendodo (2023):  DOI: https://doi.org/10.5281/zenodo.10708661

Lind BM. 2023. GitHub.com/brandonlind/douglas_fir_natural_populations: Offset Revision 1 (v1.0.0). Zenodo. https://doi.org/10.5281/zenodo.8018894

Lind BM. 2023. GitHub.com/brandonlind/jack_pine_natural_populations: Offset Revision 1 (v1.0.0). Zenodo. https://doi.org/10.5281/zenodo.8018892

Author: Brandon M. Lind

lind(dot)brandon(dot)m(at)gmail(dot)com
ORCID - 0000-0002-8560-5417

Usage

If you use or are inspired by the manuscript or archives, please cite analysis code (above), or either the manuscript or this archive:

Lind, B. M., R. Candido-Ribeiro, P. Singh, M. Lu, D. O. Vidakovic, T. R. Booker, M. Whitlock, N. Isabel, S. Yeaman,
and S. N. Aitken. 2024. How useful is genomic data for predicting  maladaptation to future climate? Accepted to Global Change Biology. Available on bioRxiv DOI: https://doi.org/10.1101/2023.02.10.528022

Lind, Brandon et al. (2024). How useful is genomic data for predicting maladaptation to future climate? [Dataset]. Dryad. https://doi.org/10.5061/dryad.sxksn039h

Description of the data and file structure

The following directories (g-zip compressed within the archive) are named so as to be easily referenced with the code notebooks (Lind 2024) that either first used or otherwise produced the files for analysis. For instance the 01 archived directory contains files first used or output by the corresponding archived notebook 01_split_training_and_testing.ipynb.

jp_datatable.txt

a tab-separated dataframe used to configure the varscan pipeline (Lind 2021) used to call jack pine (jp) SNPs from raw sequence data. The format of the configuration file is explained on the GitHub repository from Lind (2021). Final SNP tables are in the 04/snp_files subdirectory.

df_datatable.txt

a tab-separated dataframe used to configure the varscan pipeline (Lind 2021) used to call douglas-fir (df) SNPs from raw sequence data. The format of the configuration file is explained on the GitHub repository from Lind (2021). Final SNP tables are in the 04/snp_files subdirectory.

01 directory

• df_ALL-naturalpops_raw_env-19variables_change-p6.txt

a tab-separated dataframe of population information for cross-variety Douglas-fir data (technically duplicated data from the coastal and interior files below). Contains the following columns

prov
- numerical provenance identification number
our_id
- the population identification used throughout analysis
Variety
- whether the population is of the coastal or interior variety
LONG
- longitude
LAT
- latitude
Elevation, MAT, MWMT, MCMT, TD, MAP, MSP, AHM, SHM, DD_0, DD5, NFFD, bFFP, eFFP, FFP, PAS, EMT, EXT, Eref, CMD
- environmental variables - see Table S1 in Lind et al. (2024) for abbreviations and measurement units.

• df_coastal-naturalpops_raw_env-19variables_change-p6.txt

a tab-separated dataframe of population information for the coastal variety of Douglas-fir. Contains all columns from df_ALL-naturalpops_raw_env-19variables_change-p6.txt, as well as:

group
- the group name for a cluster of populations
group_color
- the color for the specific cluster of populations

• df_interior-naturalpops_raw_env-19variables_change-p6.txt

a tab-separated dataframe of population information for the interior variety of Douglas-fir. Contains all columns found in df_coastal-naturalpops_raw_env-19variables_change-p6.txt

• jp_no-p24_raw_env-19variables.txt

a tab-separated dataframe of population information for jack pine. Contains all columns of Douglas-fir files (except prov and group), with the following additions:

id1 and id2
- identification numbers relevant to management

02 directory

• NA_NORM_1961-1990_netCDF.7z

7-zipped compressed directory containing climate normals from 1961-1900 from Adapt West (2021), which used ClimateNA (2016)

• *_union_file.*

shape files updated from the current range of coastal Douglas-fir, edited to add in polygons that surround our populations
first asterisk is the taxon (coastal Douglas-fir, interior Douglas-fir, or jp for jack pine)
second asterisk corresponds to the file type suffix

• *_union_mask.RDS

RDS object of a blank range map, used in gradient_fitting_script.R to write/store offset.
first asterisk is the taxon (coastal Douglas-fir, interior Douglas-fir, or jp for jack pine)

03 directory

• NA_ENSEMBLE_rcp*_20*_Bioclim_netCDF.7z

7-zipped compressed directory containing projected climate data from Adapt West (2021), which used ClimateNA (2016)
the first asterisk is for the RCP level (eg RCP 4.5 or RCP 8.5)
the second asterisk is for the projection year (2050s or 2080s)

• NA_Reference_files_netCDF.7z

7-zipped compressed directory containing elevation data from Adapt West (2021), which used ClimateNA (2016)

• *-*all-envsWGS84_clipped.txt

tab-separated dataframe for all points (latitude and longitude) within shapefile boundaries representing species ranges
each point (latitude and longitude) is given the same set of environmental values - this set of values correspond to true values of the common garden environment; columns for those variables are described in the 01 Directory
the first asterisk indicates taxon (coastal Douglas-fir, interior Douglas-fir, or jp for jack pine)
the second asterisk indicates the common garden location (Vancouver for all Douglas-fir; St Christine or Fontbrune for jack pine)

04 directory

• baypass_output subdirectory

* _maf-p05_RD-recalculated_ADP-lt-1000_*.txt.gz
- dataframe containing output from five runs of baypass
- the first asterisk indicates taxon (coastal or interior Douglas-fir; jack pine files begin with maf-p05
- the second asterisk indicates the environmental variable used in association
- each file contains the following columns (asterisks indicate a numerical value for one of five independent runs of baypass to ensure convergence)
  - locus - locus name
  - chain_*-COVARIABLE
  - chain_*-MRK
    - numerical index to locus name output from baypass
  - chain_*-M_Pearson
    - correlation coefficient between the scaled allele frequencies and the given covariable after rotation of both vectors - output from baypass
  - chain_*-SD_Pearson
    - standard deviation of the correlation coefficient between the scaled allele frequencies and the given covariable after rotation of both vectors - output from baypass
  - chain_*-BF(dB)
    - the estimated Bayes Factor (column BF(dB)) in dB units (i.e., 10 × log10 (BF)) output from baypass
  - chain_*-Beta_is
    - regression coefficients βi output from baypass
  - chain_*-SD_Beta_is
    - standard deviation of regression coefficients βi output from baypass
  - chain_*-eBPis
    - empirical Bayesian p-value output from baypass
  - chain_*-env
    - the environmental variable used in association
  - mean_BF(dB)
    - the average Bayes Factor across the five chains (average value across each of the chain_*-BF) calculated manually
  - rank_mean_BF(dB)
    - the rank of loci from the mean_BF(dB) column, calculated manually
  - BF(dB)_gte20_for-gte3chains
    - boolean column indicating whether the locus had a Bayes Factor greater than or equal to 20 for at least 3/5 independent baypass runs
  - BF(dB)_gte15_for-gte3chains
    - boolean column indicating whether the locus had a Bayes Factor greater than or equal to 15 for at least 3/5 independent baypass runs
  - rank_consistency_top1perc_for-gte3chains
    - boolean column indicating whether the locus was in the top 1% of mean_BF(dB) in at least 3/5 independent baypass runs

• snp_files subdirectory

* _ pooled_varscan_ *.txt.tar.gz
- g zipped tab-separated dataframes for SNP data
- the first asterisk indicates species (DF for Douglas-fir, JP for jack pine)
- the second asterisk indicates the file suffix, which further identifies which taxa are included in the file (FDC = coastal Douglas-fir, FDI = interior Douglas-fir, both-varieties = coastal + interior Douglas-fir)
- rows are populations - the names found in the environmental files of the 01 Directory
- columns are locus names
- entries are minor allele frequencies

• wza_output subdirectory

* _ * WZA.csv
- comma-separated datatable representing output from the Weighted Z Analysis (WZA) with the following columns
  - index
    - gene name
  - SNPs
    - number of snps per gene
  - hits
    - number of candidate SNPs
  - Z
    - score from the WZA
  - top_candidate_p
    - p-value from the top candidate test from Yeaman et al.
  - Orthogroup
    - assigned orthrogroup
  - CHROM
    - reference genome contig name
  - Z_pVal
    - p-value from the Z from WZA
- the first asterisk designates the taxon (JP = jack pine, DFc = coastal Douglas-fir, DFi = interior Douglas-fir)
- the second asterisk designates the environmental variable, these variables are described in the output in the 01 Directory
*.climatena.clean.csv.tar.gz
- g zipped comma-separated dataframe containing SNP-level information. For each SNP (row) there is an entry for environmental variables (described in the 01 Directory) where the entry is the Spearman's rho between the allele frequency and environment. There is also a column for gene identification (wza_gene) - this was used to determine which SNPs were in genes that were significant from the WZA.
- the first asterisk designates the taxon (jp = jack pine, fdc = coastal Douglas-fir, fdi = interior Douglas-fir)

06 directory

• DATA_Pinus_banksiana_ML2021-cg-data_provpops_*.txt

tab-delimited dataframes containing phenotype data for jack pine in a given common garden (designated by the asterisk)
contains the following columns:
- provenance
  - provenance identity to align with files in the 01 Directory
- Site_(source)
  - either Fontbrune or St. Christine (st-christine)
- n_tree_alive
  - number of living trees in 2018
- Mean_DBH2018
  - average diameter at breast heigh (cm) measured in 2018
- Mean_Height_2019
  - average height (cm) measured in 2019
- Mortality_2018
  - percent mortality measured in 2018
- [intermediate columns containing no data]
- prov_name
  - human-readable provenance name
- our_id
  - provenance identification found in the data of the 01 Directory

• jack_pine_common_gardens_average_climate_*.txt

tab-delimited dataframe with the average climate (columns; variables explained in 01 Directory) for each common garden (rows). There is also columns for the latitude and longitude of the common garden.

• population_provenance_proxies.txt

tab-delimited dataframe with the human-readable provenance names for jack pine corresponding to the population IDs used here (row names, see 01 Directory)

• vancouver_climate-2018-2019_USING.txt

tab-delimited dataframe with the climate of the vancouver common garden for 2018, 2019, or average over 2018-2019 (rows), where columns are environmental variables, latitude or longitude (described in 01 Directory)

08 directory

• climate_geo_dist.txt

tab-delimited dataframe giving the predictive performance of climate and geographic distance to estimate population performance incommon gardens. Contains the following columns:
- spp
  - taxon
- garden
  - vancouver (douglas-fir taxa), fontbrune or st-christine (jack pine)
- pheno
  - phenotype used in performance evaluation
- distance_metric
  - designates the climates used in mahalanobis distances (all, Climate-based seed transfer)
  - or geographic distance (vincenty_*, where * designates the common garden name)
- spearman
  - Spearman's rho between phenotypes and distance metric
- pearson
  - Pearson's r between pheotypes and distance metric
- other columns
  - not used in manuscript

• dfdata.txt

tab-delimited dataframe containing population mean phenotype data for both varieties of Douglas-fir
this file contains rownames
row names correspond to population ID found in the our_id column of files described in the 01 directory
the two columns correpsond to the phenotype - either shoot biomass (blup_shoot_biomass) or increment growth (blup_increment).
note the phenotype data for jack pine are in the 06 directory in the DATA_Pinus_banksiana_ML2021-cg-data_provpops_*.txt files

09 directory

• correlations.txt

tab-delimited dataframe containing performance of RONA, with the following columns
- spp
  - taxon jp = jack pine,
  - fdi = interior douglas-fir,
  - fdc = coastal douglas-fir,
  - fdc-combined (coastal predictions from the cross-variety RONA),
  - fdi-combined (interior predictions from the cross-variety RONA),
  - eastern-fdi (predictions for the eastern interior doug-fir genetic group from the interior-only RONA),
  - western-fdi (predictions for the western interior doug-fir genetic group from the interior-only RONA),
  - eastern-combined (predictions for the eastern interior doug-fir genetic group from the cross-variety RONA),
  - western-combined (predictions for the western interior doug-fir genetic group from the cross-variety RONA),
- garden
  - the common garden location (either vancouver, st christine, or fontbrune)
- method
  - marker GEA source (either baypass or wza)
- setname
  - the set of markers (real = real GEA candidates from the method column, random = random set of loci equal to the real set of loci for the method)
- pheno
  - phenotype used in performance calculation
- env
  - environmental variable that was used in calculating RONA
- spearman
  - Spearman's rho between RONA and phenotype
- pearson
  - Pearson's r between RONA and phenotype
- other columns
  - not used in manuscript

• efdict.pkl

pkl compressed file of a dictionary containing environmental variables used to calculate RONA

• rona.pkl

pkl compressed file of a dictionary containing RONA predictions

11 directory

• *validation_results.txt

tab delimited dataframe containing performance scores from Gradient Forests
first asterisk designates taxon (jp = jack pine, combined = interior + coastal doug-fir, fdc = coastal doug-fir, fdi = interior doug=fir)
each file contains the following columns:
- garden
  - common garden name
- pheno
  - phenotype used to calculate performance
- dataset
  - GEA method (baypass or WZA) and whether the real GEA candidates were used or a random set of equal sample size
- full_pearson
  - Pearson's r between GF offset and phenotype
- full_spearman
  - Spearman's rho between GF offset and phenotype
- other columns
  - not used

• offsets.pkl

pkl compressed file containing predicted GF offsets to common gardens for each species

• validation_scores*.pkl

pkl compressed file with the original GF offset validation output contained in the *validation_results.txt files.

16 directory

• *efdict.pkl

pkl compressed file containing the environmental variables that differ significantly between current and future climates for RONA for all taxa
asterisk designates source

• future_rona.pkl

pkl compressed file containing the future offset predicted from RONA for future climate scenarios for all taxa

18 directory

• gf_offsets_to_future.pkl

pkl compressed file containing the future offset predicted from Gradient Forests for future climate scenarios for all taxa

training directory

• training_outfiles subdirectory

** - * - * - full_gradient_forest_training_importances.txt*
- tab-delimited file containing environmental importance values from GF models
- columns contain information for three importance metrics (see GF documentation)
- rows are for each environmental variable (see descriptions in 01 Directory)
- first asterisk designates taxon
- second asterisk designated GEA method (baypass or wza)
- third asterisk designates whether true GEA candidates were used or whether a random set of equal sample size was used for GF model training

• training_shfiles subdirectory

* - * - *-full.sh
- shell file submitted to computing cluster containing commands used to execute gradient forests
- first asterisk designates taxon,
- second asterisk designates GEA method (baypass or wza)
- third asterisk designates whether true GEA candidates were used or whether a random set of equal sample size was used for GF model training
* - * - * *-full_ **.out
- printed output from the *sh files of similar name.
- the last asterisk designates the slurm job ID that executed the shell script

Code/Software

Code is archived on Zenodo (Lind 2024), which mirrors the GitHub repository. Hyperlinks within README on GitHub link to nbviewer.org for easy viewing of jupyter notebooks.

References

AdaptWest-Project. (2021, February 5). Gridded current and projected climate data for North America at 1km resolution, generated using the ClimateNA software. http://adaptwest.databasin.org

Lind, B.M. (2021) GitHub.com/CoAdapTree/varscan_pipeline: Publication release (Version 1.0.0). Zenodo. http://doi.org/10.5281/zenodo.5083302

Lind, B.M. (2024). GitHub.com/brandonlind/offset_validation: Publication release (Version 1.1.0). Zenodo. https://doi.org/10.5281/zenodo.7641225

Lind, B.M., et al. (2024) How useful is genomic data for predicting maladaptation to future climate? Accepted to Global Change Biology. https://doi.org/10.1101/2023.02.10.528022

Wang, T., Hamann, A., Spittlehouse, D., & Carroll, C. (2016). Locally Downscaled and Spatially Customizable Climate Data for Historical and Future Periods for North America. PLoS ONE, 11(6), e0156720. https://doi.org/10.1371/journal.pone.0156720