Environmental data provide marginal benefit for predicting climate adaptation

Li, Forrest 1 ; Runcie, Daniel 1 ; Ross-Ibarra, Jeffrey 1

Published May 28, 2025 on Dryad. https://doi.org/10.5061/dryad.5hqbzkhhf

Data files

May 28, 2025 version files 423.21 MB

environmental_data_marginal_benefit_dryad_data.zip

423.19 MB
README.md

12.26 KB

Abstract

Populations of natural and cultivated plant and animal populations will be affected by more extreme climate events such as drought and flooding in the future. We explore whether characterization of the environment-of-origin of each accession in a large sample of traditional maize germplasm can be used to accelerate conservation and breeding efforts for adaptation.

We compare the utility of genotype and environmental data for predicting fitness of individuals in a number of common garden trials. We find that environment-of-origin data and genome scans for loci that associate with abiotic environmental variables provide surprisingly little benefit to prioritizing accessions for improvement, despite clear evidence of environmental adaptation in these accessions. These results provide important practical insight into the use of gene banks for climate adaptation.

Methods include prediction of environmental variables from genotyping-by-sequencing data, environmental GWAS (envGWAS) to identify loci associated with climatic gradients, and genomic prediction of yield component traits.

Data includes metadata associated with phenotypic field trials, transformed climate-of-origin data associated with accessions used for study, output from MegaLMM for genomic prediction of environment (GPoE) and covariance matrices used for envGWAS, results from envGWAS and phenotypic GWAS, models run for transfer plots, and genomic prediction results.

Additional raw data available via CIMMYT and reproducible code via Github can be found under Access information or at the associated publication "Environmental data provide marginal benefit for predicting climate adaptation."

Dataset DOI: 10.5061/dryad.5hqbzkhhf

Description of the data and file structure

One zip file including all analysis data is included to preserve structure.

envGWAS results
- Includes all results from JointGWAS for multivariate environmental GWAS (envGWAS), including effect sizes, F-values and p-values. Also includes significant SNPs genome-wide and clumped lead SNPs for top peaks.
GPoE
- Includes results for genomic prediction of environment analyses, including accessions in sampled folds and predicted environmental values. Includes both spatial and random 10-fold cross-validation (CV) analyses.
MegaLMM output
- Includes results from MegaLMM output for generating genetic (G_hat) and residual (R_hat) covariance matrices used in JointGWAS for envGWAS.
Phenotypic GWAS
- Includes results for GWAS for phenotypic traits, such as anthesis silking interval (ASI), bare cob weight, days to flowering, field weight, grain weight per hectare (corrected for moisture), and plant height. Includes effect sizes and p-values.
Phenotypic Prediction
- Includes model accuracies for phenotypic trait prediction for above traits. Also includes model prediction using SNPs obtained from envGWAS done without population structure correction.
Tables
- Includes metadata for phenotypic field trials, such as accessions tested, trial coordinates, and testers used. Also includes inverse-normal-transformed climate data for all accessions used in this dataset.

Files and variables

File: environmental_data_marginal_benefit_dryad_data.zip

Description:

Note: Units for effect sizes and climate variables are ranked inverse-normal transformed (INT), which are unitless and are best represented as ranks along a normal distribution. This is done for comparison of accessions' environmental data along comparable distributions.

envGWAS results

envGWAS_results_chr_NN.csv - results from JointGWAS for multivariate envGWAS, separated by chromosome (10 files corresponding to each chromosome NN)
- Chr: Chromosome number where the SNP is located
- pos: SNP position on the chromosome (v4 assembly).
- snp: SNP identifier
- maf: minor allele frequency, or frequency of less commonly seen allele at this SNP
- MSE: mean squared error of this SNP for this model
- X..Df: Degrees of freedom
- X..Fvalue: F-statistic value for envGWAS association to any climate variable for the SNP.
- X..Pvalue: P-value significance calculated from the envGWAS F-statistic
envGWAS_beta_hats_chr_NN.csv - beta hat effect sizes for individual ranked inverse-normal transformed (INT) climatic variables from multivariate envGWAS, separated by chromosome (10 files corresponding to each chromosome NN)
- tmin..X: effect size for INT-transformed minimum temperature in growing season
- tmax..X: effect size for INT-transformed maximum temperature in growing season
- trange..X: effect size for INT-transformed temperature range in growing season
- precipTot..X: effect size for INT-transformed total precipitation in growing season
- aridityMean..X: effect size for INT-transformed mean aridity in growing season
- rhMean..X: effect size for INT-transformed mean humidity in growing season
- elevation..X: effect size for INT-transformed elevation of collected maize
GEA_lead_SNPs_list_with_coord.csv - top 32 hits from envGWAS after LD clumping (custom script)
- CHR: Chromosome number where the SNP is located
- BP: SNP position on the chromosome (v4 assembly).
- SNP: SNP identifier
  - STRAND: SNP strand orientation, either "+" (forward) or "-" (reverse).
  - v4_SNP: SNP identifier for assembly v4
GEA_significant_SNPs_list_with_coord.csv - all significant hits from envGWAS above a p-value threshold of 1e-5
- CHR: Chromosome number where the SNP is located
- BP: SNP position on the chromosome (v4 assembly).
- SNP: SNP identifier
  - STRAND: SNP strand orientation, either "+" (forward) or "-" (reverse).
  - v4_SNP: SNP identifier for assembly v4

GPoE

dataset_clim-spatial_prediction-rep_02 - directory with results from spatially sampled leave-one-out 10-fold cross-validation
dataset_clim-spatial_prediction-rep_03 - directory with results from randomly sampled 10-fold cross-validation
For each of these directories, they include:
- testing_folds_assignment.csv (1 file listing all accessions' fold assignment)
  - accession: ID of accession used in analysis
  - fold: training/testing fold assigned for accession
- U_hat-rep_XX-fold_NN.csv - GPoE predicted environmental variable results for fold NN in rep XX (02 for spatial sample or 03 for random sample), out of 10 total folds (10 files corresponding to each testing fold NN)
- tmin..X: predicted INT-transformed minimum temperature in growing season
- tmax..X: predicted INT-transformed maximum temperature in growing season
- trange..X: predicted INT-transformed temperature range
- precipTot..X: predicted INT-transformed total precipitation in growing season
- aridityMean..X: predicted INT-transformed mean aridity in growing season
- rhMean..X: predicted INT-transformed mean humidity in growing season
- elevation..X: predicted INT-transformed elevation of collected maize
- withheld-accessions-rep_XX-fold_NN.csv (10 files corresponding to each testing fold NN for each analysis XX, 02 for spatial sample or 03 for random sample)
  - accessions not used in training for fold NN (including buffer zone)

MegaLMM output

dataset_clim-rep_02
- G_hat.csv - genetic covariance matrix between climate traits, used in JointGWAS for envGWAS
- P_hat.csv - full covariance matrix between climate traits. Generally unused for downstream analysis.
- R_hat.csv - residual covariance matrix between climate traits, used in JointGWAS for envGWAS

Phenotypic GWAS

GWAS results for phenotypic traits, including effect size for each trait in each trial, as well as F-values and p-values across trials for the following traits:

ASI
BareCobWeight
DaysToFlowering
FieldWeight
GrainWeightPerHectareCorrected
PlantHeight

Each trait contains a directory containing a single sub-directory dataset_01_INT, in which contains the analysis files as follows, separated by chromosome number.

GWAS_results_chr_NN.csv (10 files, corresponding to chromosome number NN)
- CHROM: Chromosome number where the SNP is located
- POS: SNP position on the chromosome (v4 assembly).
- ID: SNP identifier
- V4: SNP identifier for assembly v4
- MAF: minor allele frequency
- MSE: mean squared error of model
- Experimento.X..Df: degrees of freedom
- Experimento.X..Fvalue: F-value across all trials
- Experimento.X..Pvalue: p-value significance for F-value
GWAS_beta_hats_chr_NN.csv (10 files, corresponding to chromosome number NN)
- Rows correspond to SNP beta hat effect size of given trait in each trial, with each column representing a different trial.

Phenotypic Prediction

deregressed_blups_nostress_filterTrial - model accuracies for phenotype prediction with all models using deregressed BLUPs (best linear unbiased predictions), for non-stress trials.
unstructured_deregressed_blups_nostress_filterTrial - model accuracies for phenotype prediction, without including population structure, using deregressed BLUPs for non-stress trials.
The format for both of these analysis results is the same. Files in each analysis are separated by trait.
modelPrediction_nostress_NN_results_resid.csv (5 files, with each trait NN corresponding to the following:)
- BareCobWeight
- DaysToFlowering
- FieldWeight
- GrainWeightPerHectareCorrected
- PlantHeight
  Results in each file are described as following, with each row corresponding to a trial and tester combination that was cross-validated on. Model accuracies are Pearson's correlation r values ranging from -1 to 1.
  - Experimento: trial these models were tested on
  - fold: tester hybrid line these models were tested on
  - n_train: number of training testcrosses
  - n_test: number of testing testcrosses
  - n_total: total number of testcrosses
  - lm_PC5: model accuracy of rrBLUP linear model using top 5 PCs
  - lm_PC5_env: model accuracy of rrBLUP linear model using top 5 PCs + environmental data
  - lm_all_SNPs: model accuracy of rrBLUP linear model using kinship as gBLUP model
  - lm_all_SNPs_env: model accuracy of rrBLUP linear model using kinship as gBLUP model + environmental data
  - lm_enriched_SNPs: model accuracy of rrBLUP linear model using top 5 PCs + top 32 envGWAS SNPs
  - lm_matching_SNPs: model accuracy of rrBLUP linear model using top 5 PCs + random matching 32 SNPs
  - lm_env: model accuracy of rrBLUP linear model using only environmental data
  - rf_env: model accuracy of random forest using only environmental data
  - rf_env_PC5: model accuracy of random forest using top 5 PCs + environmental data

Tables

accessions_per_trial_per_tester.csv - number of accessions tested in each trial (row) by elite tester (column) combination
accessions_per_trial_per_tester_original.csv - number of accessions tested in each trial (row) by elite tester (column) combination, but without initial filtering for low sample size, etc. See paper methods.
accessions_per_trial_per_trait.csv - number of accessions tested in each trial (row) by trait (column) combination
GEA-climate-invnormtransformed.csv - ranked inverse-normal transformed (INT) data for collection climate variables used in this analysis. Raw environmental variables are based on WorldClim 2, the Global Aridity Index and Potential Evapotranspiration Database, ERA5, and the SRTM 90m Digital Elevation Database v4.1, and 3-month growing season data is taken from the FAO (see paper methods).
- tmin: INT-transformed minimum temperature in growing season based on WorldClim 2
- tmax: INT-transformed maximum temperature in growing season based on WorldClim 2
- trange: INT-transformed temperature range in growing season as defined by untransformed tmax - untransformed tmin
- precipTot: INT-transformed total aggregate precipitation in growing season based on WorldClim 2
- aridityMean: INT-transformed mean aridity in growing season from ERA5
- rhMean: INT-transformed mean humidity in growing season based on the Global Aridity Index and Potential Evapotranspiration Database
- elevation: INT-transformed elevation of collected maize based on the SRTM 90m Digital Elevation Database v4.1
testers_in_trial.csv - number of elite testers used in each trial (row) for each trait (column)
trial_metadata.csv - metadata of each trial
- Trial ID: trial identifier, including year and location abbreviation
- Year: year trial was held
- Location: city near where trial were held
- Elevation (m): elevation meters above sea level of trial location
- Trial latitude: latitude coordinate of trial in WGS84 projection
- Trial longitude: longitude coordinate of trial in WGS84 projection
trials_per_tester.csv - number of trials each tester (row) was used to test each trait (column)

Access information

The Github repo with analysis scripts is available at github.com/liforrest6/SeeD .

Data was derived from the following sources:

Imputed GBS SNP data used for genetic analysis is available at https://data.cimmyt.org/dataset.xhtml?persistentId=hdl:11529/8702394.
Curated information of field trial metadata and phenotypic BLUP data is available at https://data.cimmyt.org/dataset.xhtml?persistentId=hdl:11529/10548233
Base layer map tiles for Figures 1 and S13 are by Stamen Design https://stamen.com/, under CC by 4.0 https://creativecommons.org/licenses/by/4.0/. Data by OpenStreetMap https://www.openstreetmap.org/, under ODbL https://www.openstreetmap.org/copyright.