Data from: Leveraging historical trials to predict Fusarium head blight resistance in spring wheat breeding programs

Brault, Charlotte 1 ; Conley, Emily1 ; Green, Andrew2 ; Glover, Karl3 ; Cook, Jason4 ; Gill, Harsimardeep1 ; Read, Andrew5 ; Fiedler, Jason 5 ; Anderson, James1

Published Jan 07, 2025 on Dryad. https://doi.org/10.5061/dryad.fj6q5743r

Data files

Nov 21, 2024 version files 408.92 MB

README.md

12.04 KB
URSN_GS.zip

408.91 MB

Abstract

Fusarium head blight (FHB) is a fungal disease posing a major threat to wheat production. Plant breeding that leverages genotyping is an effective method to improve the genetic resistance of cultivars. Started in 1995, the uniform regional scab nursery (URSN) consists of germplasm from several public breeding programs in the Northern U.S region. Its main objective is to showcase new sources of resistance and enable germplasm exchange among the cooperators, however, the data from the URSN has not been studied. Phenotypic and genotypic data from this nursery was gathered, as well as from two current breeding programs in the U.S Midwest. Genomic prediction on eight traits related to FHB and agronomic traits was applied, and the effects of statistical method, marker density, training set size, genetic structure, and genetic architecture of the trait were studied. Using the URSN population, RKHS was the best method in various prediction settings, with an average accuracy of 0.63, marker density could be as low as 500 without decreasing the prediction accuracy, and training set optimization was useful for two traits. Furthermore, genotypic values were predicted in breeding programs using the URSN population as a training set with various prediction scenarios. Predicting unrelated populations led to a significant decrease in accuracy but with encouraging values for some traits and populations. Ultimately, when progressively decreasing the number of lines from breeding populations in the training set, the advantage of adding the URSN population was more pronounced, with an increase in accuracy up to 0.19.

This Read Me file was generated on 2024-11-15 by Charlotte Brault

Title of Dataset

Supporting data for Leveraging historical trials to predict Fusarium head blight resistance in wheat breeding programs.

Author Information

Principal Investigator Contact Information

James A. Anderson

Department of Agronomy and Plant Genetics, University of Minnesota 411 Borlaug Hall, 1991 Buford Circle St. Paul, MN 55108

Email: ander319@umn.edu{.email}

ORCID: 0000-0003-4655-6517

Associate or Co-investigator Contact Information

Jason D. Fiedler

USDA-ARS Cereal Crops Research Unit Edward T. Schafer Agricultural Research Center, Fargo, ND, USA

Email: jason.fiedler@usda.gov{.email}

ORCID: 0000-0001-7736-4484

Associate or Co-investigator Contact Information

Charlotte Brault

Department of Agronomy and Plant Genetics, University of Minnesota 411 Borlaug Hall, 1991 Buford Circle St. Paul, MN 55108

Email: cbrault@umn.edu{.email} / charlotte.brault@live.com{.email}

ORCID: 0000-0001-7892-4236

Date of data collection (single date, range, approximate date): 1995-2023
Geographic location of data collection (where was data collected?): Northern Great Plains-Midwest (MN, ND, SD, Manitoba)
Information about funding sources that supported the collection of the data: U.S Wheat and Barley Scab Initiative, USDA-ARS.
Overview of the data (abstract): Genomic prediction was used as a tool to improve the genetic resistance of spring wheat to Fusarium head blight, using data from a historical nursery and from two current breeding programs. Parameters related to genomic predictive ability were thoroughly studied. Predictive ability within the historical nursery was medium to high for all the eight traits studied and all methods gave similar results. Marker density did not affect predictive ability compared to training set size. Training set optimization had mixed results depending on the trait. We tested several prediction scenarios useful in a breeding context by harnessing the historical dataset in the training set for predicting breeding lines. While the lack of genetic relatedness decreased the accuracy of genomic prediction, we showed that breeding programs could benefit from this historical data by incorporating its information into training models, thus reducing the phenotyping effort.

Sharing/Access Information

Licenses/restrictions placed on the data: CC0

Data and file description

Data

URSN and URN populations

Phenotypic data

URN_cleaned_table_phenot_1992_2023.tsv Curated phenotypic data for the URN population phenotyped for FHB related traits.
URSN_cleaned_table_phenot_1995_2023.tsv Curated phenotypic data for the URSN population.

Genotypic data

URN_URSN_90K_Filt_geno_formatted_imputed_mafFilt_gdose.rds input genotypic file for the URSN population for the 90K array
URN_URSN_3K_geno_formatted_imputed_mafFilt_gdose.rds for the 3K genotypic array.
Table of genomic information for the URN and URSN populations: genotype_info_URN_URSN.tsv.

NDSU breeding program

NDSU_URSN_across-pop_list_inputs_v2.rds Combined phenotypic and genomic data from NDSU and URSN populations.

UMN breeding program

geno_combined_URSN_UMN-BP_imputed_Beagle_v2.rds Genotypic data from UMN breeding program and URSN population.
pheno_combined_URSN_UMN-BP_5traits_v2.rds Phenotypic data from UMN breeding program and URSN population.
2022 PY Entry list.xlsx List of PY entries for the UMN breeding program for 2022 phenotypic data, with the genotype name, locations, and pedigree.
2023 PY Entry list.xlsx List of PY entries for the UMN breeding program for 2023 phenotypic data, with the genotype name, locations, and pedigree.
UMN_parent_compo.csv Table of the composition of the genotypes used from the UMN breeding program.

Code

useful_functions.R Script which wraps custom useful functions for pre-processing, analyzing data and producing results
subsetMrkCV.R Script to subset markers multiple times for testing the effect of marker density.
subsetIndsCV.R Script to subset individuals multiple times to estimate the imputation accuracy.
imputationQuality.R Script to estimate the imputation accuracy based on predicted genotypes.
beagle.01Mar24.d36.jar Beagle software for imputation.

Bash

This folder contains the bash scripts used to run some of the analyses on the cluster.

launch_array.sh Script to launch the array job for marker selection (with subset_MrkCV.R) and genomic prediction within the URSN population.
launch_Rmdscript.sh Script to launch various Rmd scripts for genomic prediction on the cluster.
loop_imputation_3k_to_90k.sh Script to launch the genomic imputation from the 3K to the 90K genotypic array in the URN-URSN population.

Analysis scripts

Prepare inputs

Phenotypic data

extract_blups_URSN_data.Rmd. In this script, we extract the BLUPs from the phenotypic data for the URSN population, from URSN_cleaned_table_phenot_1995_2023.tsv table. It outputs a table of BLUPs: URSN_geno_blups_lme4-values.tsv, and a table of fitting information: URSN_lme4_fit-info_all_traits.tsv.
extract_blups_URN_FHB.Rmd. In this script, we extract the BLUPs from the phenotypic data for the URN population, from URN_cleaned_table_phenot_1992_2023.tsv table. It outputs a table of BLUPs: URN_geno_blups_lme4-values_FHBtraits.tsv, and a table of fitting information: URN_lme4_fit-info_all_traits_FHBtraits.tsv.
analyze_BLUPs_URN_URSN.Rmd. In this script, we analyze the BLUPs from the URSN and URN populations. It outputs plots of phenotypic structure and genetic correlations among traits.

Genomic data URSN

explo_geno_URN_URSN.Rmd. In this script, we explore the genomic data for the URSN population. It outputs plots of the genetic structure, additive relatedness and linkage disequilibrium.
URN_URSN_measure_imput_quality.Rmd. In this script, we load the imputation results for the URSN population from the 3K to the 90K array.

URSN and UMN data

explore_UMN_BP_URSN.Rmd. In this script, we explore the genomic and phenotypic data for the UMN breeding program and the URSN population. It outputs plots of the genetic structure (PCA) and the genetic correlation and distribution between populations.

Genomic prediction within URSN

genom_pred_URN_URSN.Rmd. In this script, we perform genomic prediction within the URSN population, using various inputs.
genom_pred_URN_URSN_exclude_orga.Rmd. In this script, we perform genomic prediction by excluding in turn each organization.
genom_pred_URN_URSN_subsetGeno.Rmd. In this script, we perform genomic prediction by subsetting genotypes from the 3K genotypic array to match the number of individuals in the 90K genotypic array.
genom_pred_URSN_TSoptSFSI.Rmd. In this script, we perform genomic prediction within the URSN population, using training set optimization with sparse selection index.
analyze_genom_pred_results_URSN.Rmd. In this script, we combine and analyze the results of the genomic prediction within the URSN population. All results were combined in a table: URSN_genomic_prediction_all_combined_results.tsv.

Genomic prediction between URSN and breeding programs

genom_pred_acrosspop_URN_URSN.Rmd. In this script, we perform genomic prediction between the URSN population and the URN population.
genom_pred_across-pop_UMN-BP_URSN.Rmd. In this script, we perform genomic prediction between the URSN population and the UMN breeding program.
genom_pred_across-pop_UMN-BP_subset_UMN.Rmd. In this script, we perform genomic prediction between the URSN population and a subset of the UMN lines.
genom_pred_across-pop_NDSU_URSN.Rmd. In this script, we perform genomic prediction between the URSN population and the NDSU breeding program.
genom_pred_across-pop_NDSU_URSN_subset_NDSU.Rmd. In this script, we perform genomic prediction between the URSN population and a subset of the NDSU lines.
analyze_across_pop_GP.Rmd. In this script, we gather and analyze the results of the genomic prediction between the URSN population and the URN/UMN/NDSU population. All results were combined in a table: URN_UMN_NDSU_genomic_prediction_all_combined_results.tsv.

Genomic prediction with training set optimization

genom_pred_acrosspop_TSopt_URN_URSN.Rmd. In this script, we perform genomic prediction between the URSN population and the URN population, using training set optimization.
genom_pred_across-pop_TSopt_UMN-BP_URSN.Rmd. In this script, we perform genomic prediction between the URSN population and the UMN breeding program, using training set optimization.
genom_pred_across-pop_TSopt_NDSU_URSN.Rmd. In this script, we perform genomic prediction between the URSN population and the NDSU breeding program, using training set optimization.

Other

plot_figures.Rmd. In this script, we generate the figures for the manuscript, specifically, we load the PCA genetic structure plots for the NDSU and UMN breeding programs and the URSN population and plot it together.
index.Rmd. This script is the index of the project, it contains the description of the project and the links to the different scripts.
about.Rmd. This script contains the information about the project.
license.Rmd. This script contains the license information.

Output

genotype_info_URN_URSN.tsv Table of genomic information for the URN and URSN populations.
URSN_geno_blups_lme4-values_95-23.tsv Table of BLUPs for the URSN population, URSN_df_deregressed_blups_8traits_95-23.tsv the table of deregressed BLUPs, and URSN_geno_blups_var_lme4-values_95-23.Rdata Rdata file of the variance of BLUPs.
URSN_lme4_fit-info_all_traits_95-23.tsv Table of fitting information for the URSN population.
URN_geno_blups_lme4-values_FHBtraits.tsv Table of BLUPs for the URN population and URN_geno_blups_var_lme4-values_FHBtraits.Rdata Rdata file of the variance of BLUPs.
URN_lme4_fit-info_all_traits_FHBtraits.tsv Table of fitting information for the URN population.
res_GP_TSopt_within_across-pop_URN_URSN_7traits.tsv Table of results for the training set optimization genomic prediction within the URN population and between the URSN and URN populations.
URN_UMN_NDSU_genomic_prediction_all_combined_results.tsv Table of results for the genomic prediction between the URSN population and the URN/UMN/NDSU population, generated in analyze_across_pop_GP.Rmd.
URSN_genomic_prediction_all_combined_results.tsv Table of results for the genomic prediction within the URSN population.
fig_repro_URSN_UMN_BP_PCA_inds_geno_orga.rds. This file contains the PCA outputs of the genetic structure of the UMN breeding program and the URSN population, it has been generated in explore_UMN_BP_URSN.Rmd and is used in plot_figures.Rmd.
output/figures folder. This folder contains the figures generated in all scripts.
output/genom_pred folder. This folder contains the outputs of genomic predictions.
output/lme4 folder. This folder contains the outputs BLUPs prediction and fitting information.
output/imputation folder. This folder contains the outputs of the imputation quality.

The docs folder contains the rendered Rmd scripts in html format.

Software and versions

R version v4.4.0 and RStudio v2024.04.01+748
vcftools 0.1.17
Beagle v5.4

Version changes

November 2024

Update following manuscript revision:

Update genotype names with released cultivar names
Add genotype random subset in URSN genomic selection
Add genomic imputation accuracy assessment:
- Bash script and R-scripts to subset individuals and markers
- Imputation for each chromosome, save imputed genotypes
- Measure of imputation accuracy
- Gather all imputation results

Data from: Leveraging historical trials to predict Fusarium head blight resistance in spring wheat breeding programs

Data files

Abstract

README: General information

Title of Dataset

Author Information

Principal Investigator Contact Information

Associate or Co-investigator Contact Information

Associate or Co-investigator Contact Information

Sharing/Access Information

Data and file description

Data

URSN and URN populations

Phenotypic data

Genotypic data

NDSU breeding program

UMN breeding program

Code

Bash

Analysis scripts

Prepare inputs

Phenotypic data

Genomic data URSN

URSN and UMN data

Genomic prediction within URSN

Genomic prediction between URSN and breeding programs

Genomic prediction with training set optimization

Other

Output

Software and versions

Version changes

November 2024