Improving genomic prediction for plant disease using environmental covariates
Data files
Aug 16, 2025 version files 219.94 MB
-
README.md
12.89 KB
-
URSN_GxE_dryad.zip
219.92 MB
Abstract
Fusarium head blight (FHB) is a devastating fungal disease affecting wheat and barley, with susceptibility influenced by genotype, environment, and genotype-by-environment interactions (GxE). This study investigates GxE in a multi-environment trial dataset spanning 30 years from a collaborative nursery established in 1995 to assess resistant genotypes from spring wheat breeding programs across the northern U.S.
Traditionally, GxE has been analyzed as a reaction norm over an environmental index. Here, we computed the environment index as a linear combination of environmental covariates specific to each environment, and we derived an environment relationship matrix. Three methods were compared, all aimed at predicting untested genotypes in untested environments: the widely used Finlay-Wilkinson regression (FW), the joint-genomic regression analysis (JGRA) method, and mixed models incorporating an environmental relationship matrix. These were benchmarked against a baseline genomic selection model (GS) without environmental covariates. Predictive abilities were assessed within and across environments.
The results revealed that the JGRA marker effect method was more accurate than GS in within- and across-environment predictions, although the differences were small. The predictive ability slightly decreased when the target environment was less related to the training environments. Mixed models performed similarly to JGRA within-environment, but JGRA outperformed the other methods for across-environment predictions. Additionally, JGRA identified significant genetic markers associated with baseline FHB resistance and environmental sensitivity.
These findings highlight the value of incorporating environmental covariates to increase predictive ability and improve the selection of resistant genotypes for diverse, untested environments.
Readme URSN GxE
author: Charlotte Brault
This Read Me file was generated on 2025-08-13 by Charlotte Brault
Author Information
Principal Investigator Contact Information
James A. Anderson
Department of Agronomy and Plant Genetics, University of Minnesota 411 Borlaug Hall, 1991 Buford Circle, St. Paul, MN 55108
Email: ander319@umn.edu{.email}
ORCID: 0000-0003-4655-6517
Associate or Co-investigator Contact Information
Jason D. Fiedler
USDA-ARS Cereal Crops Improvement Research Unit Edward T. Schafer Agricultural Research Center, Fargo, ND, USA
Email: jason.fiedler@usda.gov{.email}
ORCID: 0000-0001-7736-4484
Associate or Co-investigator Contact Information
Charlotte Brault
Department of Agronomy and Plant Genetics, University of Minnesota 411 Borlaug Hall, 1991 Buford Circle St. Paul, MN 55108
Email: cbrault@umn.edu{.email} / charlotte.brault@live.com{.email}
ORCID: 0000-0001-7892-4236
- Date of data collection (single date, range, approximate date): 1995-2024
- Geographic location of data collection (where was data collected?): Northern Great Plains-Midwest (MN, ND, SD, Manitoba)
- Information about funding sources that supported the collection of the data: U.S Wheat and Barley Scab Initiative, USDA-ARS.
Sharing/Access Information
Licenses/restrictions placed on the data: CC0
Data and file description
The dataset consists of a main zip folder: URSN_GxE_dryad.zip
Weather data
data/NASA_URSN_soiltypes_characteristics.tsv
Soil types' characteristics for the URSN trial locations were predicted bySoilType
R package.- Data/weather folder. This folder contains the downloaded weather data from NOAA for the weather stations closest to the URSN trials.
data/weather/weather_var_description.xlsx
Description of the weather variables used in the analysis.data/weather/StationID.csv
. Link between the weather station ID and the URSN trial locations.
Trial information
data/URSN_trial_info.csv
GPS coordinates of the URSN trial locations.data/URSN_computed_plant_harvest_dates.csv
Planting and harvest dates for the URSN trials.
Phenotypic data
data/URSN_cleaned_table_phenot_1995_2024.tsv
Curated phenotypic data for the URSN population for DIS and VSK traits.data/URSN_mean_phenotype_1995-2024.rds
Mean phenotypic data for the URSN population for DIS and VSK traits at the environment level.
Genotypic data
data/URSN_3K_1968-2024_formatted_imputed_005mafFilt_gdose.rds
matrix of markers, obtained using the 3K genotypic array.
For confidentiality reasons, some private lines' genotypes have been removed from that file; thus, the results obtained when rerunning the scripts will be different from those in the manuscript.data/URSN_genotype_info_1995-2024.tsv
Table of genomic information for the URSN population.
Enviromics data
The folder data/enviromics
contains environmental data that has been generated by downloading data from NOAA or NASA sources for all the environmental covariates (weather and soil).
Code
useful_functions.R
Script that wraps custom, useful functions for pre-processing, analyzing data, and producing results
Bash
This folder contains the bash scripts used to run some of the analyses on the cluster.
Rmd.sh
Script to launch various Rmarkdown scripts on a High-Performance Computing (HPC) infrastructure.
Analysis scripts
Prepare inputs
Extract weather and soil data
analysis/extract_nasa_envt_covariates_URSN.Rmd
. In this script, we extract the weather data from NASAnasapower
and the soil data from theSoilType
R package for the URSN trial locations.analysis/extract_noaa_envt_covariates_URSN.Rmd
. In this script, we extract the weather data from NOAA for the weather stations closest to the URSN trials.
The weather variables were filled with the data from NASA, generated inanalysis/extract_nasa_envt_covariates_URSN.Rmd a
script.analysis/envt_nasa_phenotype_analysis.Rmd
andanalysis/envt_noaa_phenotype_analysis.Rmd
. In these scripts, we analyze the relationship between the weather variables.
Format and explore data
analysis/format_data_GxE_URSN.Rmd
. In this script, we load the phenotypic, environmental, and genotypic data and format them for further analysis. Notably, genotypes with fewer than 10 observations in a given environment were discarded. This script generatesoutput/inputs_GxE_URSN_filtered.rds
andoutput/inputs_GxE_URSN_all.rds
which will be used in the subsequent scripts.analysis/explore_data_GxE_URSN.Rmd
. In this script, we load the imputation results for the URSN population from the 3K to the 90K array.analysis/explore_EC_variations.Rmd
. In this script, we explore the environmental covariates variation, through a principal component analysis, exploring the environmental structure, the distribution of the location-specific variables, etc.analysis/extract_envt_index_URSN.Rmd
. In this script, we load phenotypic data and environment covariates, and we apply a Partial Least Squares Regression to predict the environmental index based on environmental covariates. We measure the accuracy of prediction by cross-validation.analysis/phenot_variance_components.Rmd
. In this script, we load the phenotypic data and compute the variance components analysis for the DIS and VSK traits.analysis/envirotyping_explore_outputs.Rmd
. In this script, we load output matrices from the envirotyping pipeline, and we study in-depth the relatedness between covariables, phenotype, and computed matrices.
Predict GxE using environmental covariates
analysis/URSN_FWregression_ECs.Rmd
. In this script, we load phenotypic data and environmental covariates (ECs), and we apply Finlay-Wilkinson regression to predict reaction norm intercept, slope, and phenotypic values based on mean phenotype and ECs. We compare the results with genomic prediction.analysis/genom_pred_JGRA_envt_index.Rmd
. In this script, we explore the genomic and phenotypic data for the UMN breeding program and the URSN population. It outputs plots of the genetic structure (PCA) and the genetic correlation and distribution between populations.analysis/genom_pred_URSN_envirotyping_scenarios_models.Rmd
. In this script, we load phenotypic data and ECs, and we apply mixed models to predict genetic (G), environmental (E), and genotype-by-environment (GxE) effects.
Analyze outputs
analysis/compare_JGRA_environment_index.Rmd
. In this script, we load the results from JGRA and compare various environment indices.analysis/analyze_mixedModels_methods_outputs.Rmd
. In this script, we load results from mixed models and we compare the predictive ability within- and across-environments for various prediction methods.analysis/analyze_envirotyping_outputs.Rmd
. In this script, we load results from mixed models, FW, and JGRA, and we compare the predictive ability within- and across-environments.analysis/GWAS_JGRA_envt_index.Rmd
. In this script, we load the results from JGRA and we estimate the significance of markers for G and GxE, as with a genome-wide association study (GWAS).analysis/predict_best_geno_loc_JGRA.Rmd
. In this script, we load the results from JGRA and we predict the best genotypes for each location.
Other
index.Rmd
. This script is the index of the project; it contains the description of the project and the links to the different scripts.about.Rmd
. This script contains the information about the project.license.Rmd
. This script contains the license information.
Output
Environment index
URSN_PLS_GDDstages_soilVars_LOO_environmentIndex.tsv
contains the environment index computed with all environmental covariates for all the stages, including or not including soil variables. For each environment, a single-value environment index is provided, corresponding to the leave-one-environment out cross-validation.
Columns:
- Trait: Phenotypic trait analyzed (DIS or VSK).
- Env. variables: Growth stage or variable group used in prediction (e.g., "V35", "all", "location
- soil.vars: Whether location-specific environmental covariates were included (TRUE/FALSE).
- env_code: Unique environment identifier (location_year).
- meanY: Observed mean phenotype per environment.
- envIdx_PLSpred: Environment index predicted by PLSR in leave-one-environment-out cross-validation.
URSN_3ECvars_sum_environmentIndex_2traits.tsv
Contains the environment index computed by a multiple linear regression computed with 3 environmental covariates around anthesis time.URSN_3ECvars_lm_environmentIndex_2traits.tsv
contains the environment index computed as a sum computed with 3 environmental covariates around anthesis time.
FW
URSN_FWregression_interceptSlope_table_envMean.tsv
estimation of intercept and slope parameters for all genotypes and the two traits.
Columns:
- Trait: trait name: DIS or VSK
- scenario: prediction scenario (always unknown environment): "knLoc.knYr" when location and year from the predicted environment were included in the training set, or "nLoc.nYr" were locations and years from the environment to be predicted are excluded from the training set.
- env_code: environment coded as LOCATION_YEAR
- PA.intercept: predictive ability for the intercept
- PA.slope: predictive ability for the slope
- PA.idx: predictive ability for phenotype prediction, based on the environment index (coming from environmental covariates)
- PA.meanEnv: predictive ability for phenotype prediction, based on the mean phenotype per environment
FWreg_GSpred_intercept_slope_yhat_accuracy_PLSR_best.tsv
predictive ability for predicting the intercept and the slope by cross-validation using the FW method.
- Columns:
- line: genotype name
- scenario: prediction scenario (always unknown environment): "knLoc.knYr" when location and year from the predicted environment were included in the training set, or "nLoc.nYr" were locations and years from the environment to be predicted are excluded from the training set.
- env_code: environment coded as LOCATION_YEAR
- trait: trait name: DIS or VSK
- obs: observed phenotype value
- pred.int: predicted intercept
- pred. slope: predicted slope
- pred. envtindex: predicted phenotypic value using the formula: intercept + slope x environment. index
FWreg_GSpred_obspred_envtindex_PLSR_best.tsv
observed vs. predicted intercept and slope values for all genotypes and environments predicted with the FW method.
Bayesian Mixed models
NOAA_GP_results_envt_all_scenarios_2traits_URSN_4M_cmon.envs_BayesB_redef.tsv
: predictive ability by environment, prediction scenario, and trait for the four models (M1:G, M2:G+E, M3:G+GxE, M4:G+E+GxE) for the BayesB method. Similar files were provided for the other methods tested (BayesA, BayesC, BL, BRR, and RKHS).NOAA_GP_obspred_envt_all_scenarios_2traits_URSN_4M_cmon.envs_BayesB_redef.tsv
: observed vs. predicted values for all environments and genotypes by prediction scenario and trait for the four models (M1:G, M2:G+E, M3:G+GxE, M4:G+E+GxE) for the BayesB method. Similar files were provided for the other methods tested (BayesA, BayesC, BL, BRR, and RKHS).
JGRA
JGRA.marker_2traits_URSN_rrBLUP_PLSR_best_resPA.tsv
: predictive ability values for JGRA reaction norm and JGRA marker for all environments.JGRA.marker_GP_2traits_rrBLUP_PLSR_best_obspred.tsv
: observed vs. predicted values for every genotype and environment.JGRA.marker_envIdxSoil_GEBV_per_loc_DIS.tsv
, andJGRA.marker_envIdxSoil_GEBV_per_loc_VSK.tsv
are the genomic estimated breeding values for the JGRA marker method and the two traits, using only location-specific environmental covariables to generate the environment index? These files were used as input in theanalysis/predict_best_geno_loc_JGRA.Rmd
script.JGRA.marker_2traits_URSN_meanPheno_summaryMrkLm_alldata.rds
Is the marker effect estimated with the JGRA marker using the mean phenotype as an environment index?
All
resPA_3methods_5scenarios_2traits_URSN.tsv
is the combined result for the 3 types of models implemented (FW, JGRA, and mixed models)
Other folders
- output/figures folder. This folder contains the figures generated in all scripts.
- Output/enviromics folder. This folder contains the outputs of the description of environmental data.
- Output/temp folder. This folder contains the outputs of the BGLR function as temporary files.
The docs folder contains the rendered Rmd scripts in HTML format.
Software and versions
- R version v4.5.0 and RStudio v2025.08.0