Improving genomic prediction for plant disease using environmental covariates

Brault, Charlotte 1 ; Conley, Emily1; Read, Andrew2; Green, Andrew3; Glover, Karl4; Cook, Jason5; Gill, Harsimardeep1; Anderson, James1; Fiedler, Jason6

Published Aug 16, 2025 on Dryad. https://doi.org/10.5061/dryad.wstqjq2z0

Data files

Aug 16, 2025 version files 219.94 MB

README.md
12.89 KB
URSN_GxE_dryad.zip
219.92 MB

Abstract

Fusarium head blight (FHB) is a devastating fungal disease affecting wheat and barley, with susceptibility influenced by genotype, environment, and genotype-by-environment interactions (GxE). This study investigates GxE in a multi-environment trial dataset spanning 30 years from a collaborative nursery established in 1995 to assess resistant genotypes from spring wheat breeding programs across the northern U.S.

Traditionally, GxE has been analyzed as a reaction norm over an environmental index. Here, we computed the environment index as a linear combination of environmental covariates specific to each environment, and we derived an environment relationship matrix. Three methods were compared, all aimed at predicting untested genotypes in untested environments: the widely used Finlay-Wilkinson regression (FW), the joint-genomic regression analysis (JGRA) method, and mixed models incorporating an environmental relationship matrix. These were benchmarked against a baseline genomic selection model (GS) without environmental covariates. Predictive abilities were assessed within and across environments.

The results revealed that the JGRA marker effect method was more accurate than GS in within- and across-environment predictions, although the differences were small. The predictive ability slightly decreased when the target environment was less related to the training environments. Mixed models performed similarly to JGRA within-environment, but JGRA outperformed the other methods for across-environment predictions. Additionally, JGRA identified significant genetic markers associated with baseline FHB resistance and environmental sensitivity.

These findings highlight the value of incorporating environmental covariates to increase predictive ability and improve the selection of resistant genotypes for diverse, untested environments.

Readme URSN GxE
author: Charlotte Brault

This Read Me file was generated on 2025-08-13 by Charlotte Brault

Author Information

Principal Investigator Contact Information

James A. Anderson

Department of Agronomy and Plant Genetics, University of Minnesota 411 Borlaug Hall, 1991 Buford Circle, St. Paul, MN 55108

Email: ander319@umn.edu{.email}

ORCID: 0000-0003-4655-6517

Associate or Co-investigator Contact Information

Jason D. Fiedler

USDA-ARS Cereal Crops Improvement Research Unit Edward T. Schafer Agricultural Research Center, Fargo, ND, USA

Email: jason.fiedler@usda.gov{.email}

ORCID: 0000-0001-7736-4484

Associate or Co-investigator Contact Information

Charlotte Brault

Department of Agronomy and Plant Genetics, University of Minnesota 411 Borlaug Hall, 1991 Buford Circle St. Paul, MN 55108

Email: cbrault@umn.edu{.email} / charlotte.brault@live.com{.email}

ORCID: 0000-0001-7892-4236

Date of data collection (single date, range, approximate date): 1995-2024
Geographic location of data collection (where was data collected?): Northern Great Plains-Midwest (MN, ND, SD, Manitoba)
Information about funding sources that supported the collection of the data: U.S Wheat and Barley Scab Initiative, USDA-ARS.

Sharing/Access Information

Licenses/restrictions placed on the data: CC0

Data and file description

The dataset consists of a main zip folder: URSN_GxE_dryad.zip

Weather data

data/NASA_URSN_soiltypes_characteristics.tsv Soil types' characteristics for the URSN trial locations were predicted by SoilType R package.
Data/weather folder. This folder contains the downloaded weather data from NOAA for the weather stations closest to the URSN trials.
data/weather/weather_var_description.xlsx Description of the weather variables used in the analysis.
data/weather/StationID.csv. Link between the weather station ID and the URSN trial locations.

Trial information

data/URSN_trial_info.csv GPS coordinates of the URSN trial locations.
data/URSN_computed_plant_harvest_dates.csv Planting and harvest dates for the URSN trials.

Phenotypic data

data/URSN_cleaned_table_phenot_1995_2024.tsv Curated phenotypic data for the URSN population for DIS and VSK traits.
data/URSN_mean_phenotype_1995-2024.rds Mean phenotypic data for the URSN population for DIS and VSK traits at the environment level.

Genotypic data

data/URSN_3K_1968-2024_formatted_imputed_005mafFilt_gdose.rds matrix of markers, obtained using the 3K genotypic array.
For confidentiality reasons, some private lines' genotypes have been removed from that file; thus, the results obtained when rerunning the scripts will be different from those in the manuscript.
data/URSN_genotype_info_1995-2024.tsv Table of genomic information for the URSN population.

Enviromics data

The folder data/enviromics contains environmental data that has been generated by downloading data from NOAA or NASA sources for all the environmental covariates (weather and soil).

Code

useful_functions.R Script that wraps custom, useful functions for pre-processing, analyzing data, and producing results

Bash

This folder contains the bash scripts used to run some of the analyses on the cluster.

Rmd.sh Script to launch various Rmarkdown scripts on a High-Performance Computing (HPC) infrastructure.

Analysis scripts

Prepare inputs

Extract weather and soil data

analysis/extract_nasa_envt_covariates_URSN.Rmd. In this script, we extract the weather data from NASA nasapower and the soil data from the SoilType R package for the URSN trial locations.
analysis/extract_noaa_envt_covariates_URSN.Rmd. In this script, we extract the weather data from NOAA for the weather stations closest to the URSN trials.
The weather variables were filled with the data from NASA, generated in analysis/extract_nasa_envt_covariates_URSN.Rmd a script.
analysis/envt_nasa_phenotype_analysis.Rmd and analysis/envt_noaa_phenotype_analysis.Rmd. In these scripts, we analyze the relationship between the weather variables.

Format and explore data

analysis/format_data_GxE_URSN.Rmd. In this script, we load the phenotypic, environmental, and genotypic data and format them for further analysis. Notably, genotypes with fewer than 10 observations in a given environment were discarded. This script generates output/inputs_GxE_URSN_filtered.rds and output/inputs_GxE_URSN_all.rds which will be used in the subsequent scripts.
analysis/explore_data_GxE_URSN.Rmd. In this script, we load the imputation results for the URSN population from the 3K to the 90K array.
analysis/explore_EC_variations.Rmd. In this script, we explore the environmental covariates variation, through a principal component analysis, exploring the environmental structure, the distribution of the location-specific variables, etc.
analysis/extract_envt_index_URSN.Rmd. In this script, we load phenotypic data and environment covariates, and we apply a Partial Least Squares Regression to predict the environmental index based on environmental covariates. We measure the accuracy of prediction by cross-validation.
analysis/phenot_variance_components.Rmd. In this script, we load the phenotypic data and compute the variance components analysis for the DIS and VSK traits.
analysis/envirotyping_explore_outputs.Rmd. In this script, we load output matrices from the envirotyping pipeline, and we study in-depth the relatedness between covariables, phenotype, and computed matrices.

Predict GxE using environmental covariates

analysis/URSN_FWregression_ECs.Rmd. In this script, we load phenotypic data and environmental covariates (ECs), and we apply Finlay-Wilkinson regression to predict reaction norm intercept, slope, and phenotypic values based on mean phenotype and ECs. We compare the results with genomic prediction.
analysis/genom_pred_JGRA_envt_index.Rmd. In this script, we explore the genomic and phenotypic data for the UMN breeding program and the URSN population. It outputs plots of the genetic structure (PCA) and the genetic correlation and distribution between populations.
analysis/genom_pred_URSN_envirotyping_scenarios_models.Rmd. In this script, we load phenotypic data and ECs, and we apply mixed models to predict genetic (G), environmental (E), and genotype-by-environment (GxE) effects.

Analyze outputs

analysis/compare_JGRA_environment_index.Rmd. In this script, we load the results from JGRA and compare various environment indices.
analysis/analyze_mixedModels_methods_outputs.Rmd. In this script, we load results from mixed models and we compare the predictive ability within- and across-environments for various prediction methods.
analysis/analyze_envirotyping_outputs.Rmd. In this script, we load results from mixed models, FW, and JGRA, and we compare the predictive ability within- and across-environments.
analysis/GWAS_JGRA_envt_index.Rmd. In this script, we load the results from JGRA and we estimate the significance of markers for G and GxE, as with a genome-wide association study (GWAS).
analysis/predict_best_geno_loc_JGRA.Rmd. In this script, we load the results from JGRA and we predict the best genotypes for each location.

Other

index.Rmd. This script is the index of the project; it contains the description of the project and the links to the different scripts.
about.Rmd. This script contains the information about the project.
license.Rmd. This script contains the license information.

Output

Environment index

URSN_PLS_GDDstages_soilVars_LOO_environmentIndex.tsv contains the environment index computed with all environmental covariates for all the stages, including or not including soil variables. For each environment, a single-value environment index is provided, corresponding to the leave-one-environment out cross-validation.

Columns:

Trait: Phenotypic trait analyzed (DIS or VSK).
Env. variables: Growth stage or variable group used in prediction (e.g., "V35", "all", "location
soil.vars: Whether location-specific environmental covariates were included (TRUE/FALSE).
env_code: Unique environment identifier (location_year).
meanY: Observed mean phenotype per environment.
envIdx_PLSpred: Environment index predicted by PLSR in leave-one-environment-out cross-validation.

URSN_3ECvars_sum_environmentIndex_2traits.tsv Contains the environment index computed by a multiple linear regression computed with 3 environmental covariates around anthesis time.
URSN_3ECvars_lm_environmentIndex_2traits.tsv contains the environment index computed as a sum computed with 3 environmental covariates around anthesis time.

URSN_FWregression_interceptSlope_table_envMean.tsv estimation of intercept and slope parameters for all genotypes and the two traits.

Columns:

Trait: trait name: DIS or VSK
scenario: prediction scenario (always unknown environment): "knLoc.knYr" when location and year from the predicted environment were included in the training set, or "nLoc.nYr" were locations and years from the environment to be predicted are excluded from the training set.
env_code: environment coded as LOCATION_YEAR
PA.intercept: predictive ability for the intercept
PA.slope: predictive ability for the slope
PA.idx: predictive ability for phenotype prediction, based on the environment index (coming from environmental covariates)
PA.meanEnv: predictive ability for phenotype prediction, based on the mean phenotype per environment

FWreg_GSpred_intercept_slope_yhat_accuracy_PLSR_best.tsv predictive ability for predicting the intercept and the slope by cross-validation using the FW method.

Columns:
- line: genotype name
- scenario: prediction scenario (always unknown environment): "knLoc.knYr" when location and year from the predicted environment were included in the training set, or "nLoc.nYr" were locations and years from the environment to be predicted are excluded from the training set.
- env_code: environment coded as LOCATION_YEAR
- trait: trait name: DIS or VSK
- obs: observed phenotype value
- pred.int: predicted intercept
- pred. slope: predicted slope
- pred. envtindex: predicted phenotypic value using the formula: intercept + slope x environment. index

FWreg_GSpred_obspred_envtindex_PLSR_best.tsv observed vs. predicted intercept and slope values for all genotypes and environments predicted with the FW method.

Bayesian Mixed models

NOAA_GP_results_envt_all_scenarios_2traits_URSN_4M_cmon.envs_BayesB_redef.tsv: predictive ability by environment, prediction scenario, and trait for the four models (M1:G, M2:G+E, M3:G+GxE, M4:G+E+GxE) for the BayesB method. Similar files were provided for the other methods tested (BayesA, BayesC, BL, BRR, and RKHS).
NOAA_GP_obspred_envt_all_scenarios_2traits_URSN_4M_cmon.envs_BayesB_redef.tsv: observed vs. predicted values for all environments and genotypes by prediction scenario and trait for the four models (M1:G, M2:G+E, M3:G+GxE, M4:G+E+GxE) for the BayesB method. Similar files were provided for the other methods tested (BayesA, BayesC, BL, BRR, and RKHS).

JGRA

JGRA.marker_2traits_URSN_rrBLUP_PLSR_best_resPA.tsv: predictive ability values for JGRA reaction norm and JGRA marker for all environments.
JGRA.marker_GP_2traits_rrBLUP_PLSR_best_obspred.tsv: observed vs. predicted values for every genotype and environment.
JGRA.marker_envIdxSoil_GEBV_per_loc_DIS.tsv, and JGRA.marker_envIdxSoil_GEBV_per_loc_VSK.tsv are the genomic estimated breeding values for the JGRA marker method and the two traits, using only location-specific environmental covariables to generate the environment index? These files were used as input in the analysis/predict_best_geno_loc_JGRA.Rmd script.
JGRA.marker_2traits_URSN_meanPheno_summaryMrkLm_alldata.rds Is the marker effect estimated with the JGRA marker using the mean phenotype as an environment index?

All

resPA_3methods_5scenarios_2traits_URSN.tsv is the combined result for the 3 types of models implemented (FW, JGRA, and mixed models)

Other folders

output/figures folder. This folder contains the figures generated in all scripts.
Output/enviromics folder. This folder contains the outputs of the description of environmental data.
Output/temp folder. This folder contains the outputs of the BGLR function as temporary files.

The docs folder contains the rendered Rmd scripts in HTML format.

Software and versions

R version v4.5.0 and RStudio v2025.08.0

Improving genomic prediction for plant disease using environmental covariates

Data files

Abstract

README: Improving genomic prediction for plant disease using environmental covariates