Data from: Leveraging historical trials to predict Fusarium head blight resistance in spring wheat breeding programs
Data files
Nov 21, 2024 version files 408.92 MB
-
README.md
12.04 KB
-
URSN_GS.zip
408.91 MB
Abstract
Fusarium head blight (FHB) is a fungal disease posing a major threat to wheat production. Plant breeding that leverages genotyping is an effective method to improve the genetic resistance of cultivars. Started in 1995, the uniform regional scab nursery (URSN) consists of germplasm from several public breeding programs in the Northern U.S region. Its main objective is to showcase new sources of resistance and enable germplasm exchange among the cooperators, however, the data from the URSN has not been studied. Phenotypic and genotypic data from this nursery was gathered, as well as from two current breeding programs in the U.S Midwest. Genomic prediction on eight traits related to FHB and agronomic traits was applied, and the effects of statistical method, marker density, training set size, genetic structure, and genetic architecture of the trait were studied. Using the URSN population, RKHS was the best method in various prediction settings, with an average accuracy of 0.63, marker density could be as low as 500 without decreasing the prediction accuracy, and training set optimization was useful for two traits. Furthermore, genotypic values were predicted in breeding programs using the URSN population as a training set with various prediction scenarios. Predicting unrelated populations led to a significant decrease in accuracy but with encouraging values for some traits and populations. Ultimately, when progressively decreasing the number of lines from breeding populations in the training set, the advantage of adding the URSN population was more pronounced, with an increase in accuracy up to 0.19.
README: General information
This Read Me file was generated on 2024-11-15 by Charlotte Brault
Title of Dataset
Supporting data for Leveraging historical trials to predict Fusarium head blight resistance in wheat breeding programs.
Author Information
Principal Investigator Contact Information
James A. Anderson
Department of Agronomy and Plant Genetics, University of Minnesota 411 Borlaug Hall, 1991 Buford Circle St. Paul, MN 55108
Email: ander319@umn.edu{.email}
ORCID: 0000-0003-4655-6517
Associate or Co-investigator Contact Information
Jason D. Fiedler
USDA-ARS Cereal Crops Research Unit Edward T. Schafer Agricultural Research Center, Fargo, ND, USA
Email: jason.fiedler@usda.gov{.email}
ORCID: 0000-0001-7736-4484
Associate or Co-investigator Contact Information
Charlotte Brault
Department of Agronomy and Plant Genetics, University of Minnesota 411 Borlaug Hall, 1991 Buford Circle St. Paul, MN 55108
Email: cbrault@umn.edu{.email} / charlotte.brault@live.com{.email}
ORCID: 0000-0001-7892-4236
- Date of data collection (single date, range, approximate date): 1995-2023
- Geographic location of data collection (where was data collected?): Northern Great Plains-Midwest (MN, ND, SD, Manitoba)
- Information about funding sources that supported the collection of the data: U.S Wheat and Barley Scab Initiative, USDA-ARS.
- Overview of the data (abstract): Genomic prediction was used as a tool to improve the genetic resistance of spring wheat to Fusarium head blight, using data from a historical nursery and from two current breeding programs. Parameters related to genomic predictive ability were thoroughly studied. Predictive ability within the historical nursery was medium to high for all the eight traits studied and all methods gave similar results. Marker density did not affect predictive ability compared to training set size. Training set optimization had mixed results depending on the trait. We tested several prediction scenarios useful in a breeding context by harnessing the historical dataset in the training set for predicting breeding lines. While the lack of genetic relatedness decreased the accuracy of genomic prediction, we showed that breeding programs could benefit from this historical data by incorporating its information into training models, thus reducing the phenotyping effort.
Sharing/Access Information
Licenses/restrictions placed on the data: CC0
Data and file description
Data
URSN and URN populations
Phenotypic data
URN_cleaned_table_phenot_1992_2023.tsv
Curated phenotypic data for the URN population phenotyped for FHB related traits.URSN_cleaned_table_phenot_1995_2023.tsv
Curated phenotypic data for the URSN population.
Genotypic data
URN_URSN_90K_Filt_geno_formatted_imputed_mafFilt_gdose.rds
input genotypic file for the URSN population for the 90K arrayURN_URSN_3K_geno_formatted_imputed_mafFilt_gdose.rds
for the 3K genotypic array.- Table of genomic information for the URN and URSN populations:
genotype_info_URN_URSN.tsv
.
NDSU breeding program
NDSU_URSN_across-pop_list_inputs_v2.rds
Combined phenotypic and genomic data from NDSU and URSN populations.
UMN breeding program
geno_combined_URSN_UMN-BP_imputed_Beagle_v2.rds
Genotypic data from UMN breeding program and URSN population.pheno_combined_URSN_UMN-BP_5traits_v2.rds
Phenotypic data from UMN breeding program and URSN population.2022 PY Entry list.xlsx
List of PY entries for the UMN breeding program for 2022 phenotypic data, with the genotype name, locations, and pedigree.2023 PY Entry list.xlsx
List of PY entries for the UMN breeding program for 2023 phenotypic data, with the genotype name, locations, and pedigree.UMN_parent_compo.csv
Table of the composition of the genotypes used from the UMN breeding program.
Code
useful_functions.R
Script which wraps custom useful functions for pre-processing, analyzing data and producing resultssubsetMrkCV.R
Script to subset markers multiple times for testing the effect of marker density.subsetIndsCV.R
Script to subset individuals multiple times to estimate the imputation accuracy.imputationQuality.R
Script to estimate the imputation accuracy based on predicted genotypes.beagle.01Mar24.d36.jar
Beagle software for imputation.
Bash
This folder contains the bash scripts used to run some of the analyses on the cluster.
launch_array.sh
Script to launch the array job for marker selection (withsubset_MrkCV.R
) and genomic prediction within the URSN population.launch_Rmdscript.sh
Script to launch various Rmd scripts for genomic prediction on the cluster.loop_imputation_3k_to_90k.sh
Script to launch the genomic imputation from the 3K to the 90K genotypic array in the URN-URSN population.
Analysis scripts
Prepare inputs
Phenotypic data
extract_blups_URSN_data.Rmd
. In this script, we extract the BLUPs from the phenotypic data for the URSN population, fromURSN_cleaned_table_phenot_1995_2023.tsv
table. It outputs a table of BLUPs:URSN_geno_blups_lme4-values.tsv
, and a table of fitting information:URSN_lme4_fit-info_all_traits.tsv
.extract_blups_URN_FHB.Rmd
. In this script, we extract the BLUPs from the phenotypic data for the URN population, fromURN_cleaned_table_phenot_1992_2023.tsv
table. It outputs a table of BLUPs:URN_geno_blups_lme4-values_FHBtraits.tsv
, and a table of fitting information:URN_lme4_fit-info_all_traits_FHBtraits.tsv
.analyze_BLUPs_URN_URSN.Rmd
. In this script, we analyze the BLUPs from the URSN and URN populations. It outputs plots of phenotypic structure and genetic correlations among traits.
Genomic data URSN
explo_geno_URN_URSN.Rmd
. In this script, we explore the genomic data for the URSN population. It outputs plots of the genetic structure, additive relatedness and linkage disequilibrium.URN_URSN_measure_imput_quality.Rmd
. In this script, we load the imputation results for the URSN population from the 3K to the 90K array.
URSN and UMN data
explore_UMN_BP_URSN.Rmd
. In this script, we explore the genomic and phenotypic data for the UMN breeding program and the URSN population. It outputs plots of the genetic structure (PCA) and the genetic correlation and distribution between populations.
Genomic prediction within URSN
genom_pred_URN_URSN.Rmd
. In this script, we perform genomic prediction within the URSN population, using various inputs.genom_pred_URN_URSN_exclude_orga.Rmd
. In this script, we perform genomic prediction by excluding in turn each organization.genom_pred_URN_URSN_subsetGeno.Rmd
. In this script, we perform genomic prediction by subsetting genotypes from the 3K genotypic array to match the number of individuals in the 90K genotypic array.genom_pred_URSN_TSoptSFSI.Rmd
. In this script, we perform genomic prediction within the URSN population, using training set optimization with sparse selection index.analyze_genom_pred_results_URSN.Rmd
. In this script, we combine and analyze the results of the genomic prediction within the URSN population. All results were combined in a table:URSN_genomic_prediction_all_combined_results.tsv
.
Genomic prediction between URSN and breeding programs
genom_pred_acrosspop_URN_URSN.Rmd
. In this script, we perform genomic prediction between the URSN population and the URN population.genom_pred_across-pop_UMN-BP_URSN.Rmd
. In this script, we perform genomic prediction between the URSN population and the UMN breeding program.genom_pred_across-pop_UMN-BP_subset_UMN.Rmd
. In this script, we perform genomic prediction between the URSN population and a subset of the UMN lines.genom_pred_across-pop_NDSU_URSN.Rmd
. In this script, we perform genomic prediction between the URSN population and the NDSU breeding program.genom_pred_across-pop_NDSU_URSN_subset_NDSU.Rmd
. In this script, we perform genomic prediction between the URSN population and a subset of the NDSU lines.analyze_across_pop_GP.Rmd
. In this script, we gather and analyze the results of the genomic prediction between the URSN population and the URN/UMN/NDSU population. All results were combined in a table:URN_UMN_NDSU_genomic_prediction_all_combined_results.tsv
.
Genomic prediction with training set optimization
genom_pred_acrosspop_TSopt_URN_URSN.Rmd
. In this script, we perform genomic prediction between the URSN population and the URN population, using training set optimization.genom_pred_across-pop_TSopt_UMN-BP_URSN.Rmd
. In this script, we perform genomic prediction between the URSN population and the UMN breeding program, using training set optimization.genom_pred_across-pop_TSopt_NDSU_URSN.Rmd
. In this script, we perform genomic prediction between the URSN population and the NDSU breeding program, using training set optimization.
Other
plot_figures.Rmd
. In this script, we generate the figures for the manuscript, specifically, we load the PCA genetic structure plots for the NDSU and UMN breeding programs and the URSN population and plot it together.index.Rmd
. This script is the index of the project, it contains the description of the project and the links to the different scripts.about.Rmd
. This script contains the information about the project.license.Rmd
. This script contains the license information.
Output
genotype_info_URN_URSN.tsv
Table of genomic information for the URN and URSN populations.URSN_geno_blups_lme4-values_95-23.tsv
Table of BLUPs for the URSN population,URSN_df_deregressed_blups_8traits_95-23.tsv
the table of deregressed BLUPs, andURSN_geno_blups_var_lme4-values_95-23.Rdata
Rdata file of the variance of BLUPs.URSN_lme4_fit-info_all_traits_95-23.tsv
Table of fitting information for the URSN population.URN_geno_blups_lme4-values_FHBtraits.tsv
Table of BLUPs for the URN population andURN_geno_blups_var_lme4-values_FHBtraits.Rdata
Rdata file of the variance of BLUPs.URN_lme4_fit-info_all_traits_FHBtraits.tsv
Table of fitting information for the URN population.res_GP_TSopt_within_across-pop_URN_URSN_7traits.tsv
Table of results for the training set optimization genomic prediction within the URN population and between the URSN and URN populations.URN_UMN_NDSU_genomic_prediction_all_combined_results.tsv
Table of results for the genomic prediction between the URSN population and the URN/UMN/NDSU population, generated inanalyze_across_pop_GP.Rmd
.URSN_genomic_prediction_all_combined_results.tsv
Table of results for the genomic prediction within the URSN population.fig_repro_URSN_UMN_BP_PCA_inds_geno_orga.rds
. This file contains the PCA outputs of the genetic structure of the UMN breeding program and the URSN population, it has been generated inexplore_UMN_BP_URSN.Rmd
and is used inplot_figures.Rmd
.- output/figures folder. This folder contains the figures generated in all scripts.
- output/genom_pred folder. This folder contains the outputs of genomic predictions.
- output/lme4 folder. This folder contains the outputs BLUPs prediction and fitting information.
- output/imputation folder. This folder contains the outputs of the imputation quality.
The docs folder contains the rendered Rmd scripts in html format.
Software and versions
- R version v4.4.0 and RStudio v2024.04.01+748
- vcftools 0.1.17
- Beagle v5.4
Version changes
November 2024
Update following manuscript revision:
- Update genotype names with released cultivar names
- Add genotype random subset in URSN genomic selection
- Add genomic imputation accuracy assessment:
- Bash script and R-scripts to subset individuals and markers
- Imputation for each chromosome, save imputed genotypes
- Measure of imputation accuracy
- Gather all imputation results