Dataset to evaluate the impact of environmental kernels in genomic prediction models
Data files
Feb 10, 2026 version files 7.05 MB
-
impact-ec-main.zip
7.04 MB
-
README.md
7.86 KB
Abstract
Integrating genomic and environmental information holds the potential for enhancing the predictive power of genomic prediction models when accounting for the genotype-by-environment interactions. Hence, incorporating environmental covariates (EC) into these models can significantly influence their predictive accuracy. In this study, we utilized 1379 genotypes from the SoyNAM dataset, evaluated across four environments and genotyped with 4611 single-nucleotide polymorphism markers, to compare models incorporating genotype-by-environment and genotype-by-environmental covariate interactions using different covariance matrices. We evaluated four approaches: summarizing EC by averaging (AVG), filtering ECs based on a coefficient of determination criterion (FILT), segmenting ECs by crop phenology (STG), and a naïve approach that utilized all available information (ALL). Predictive ability was assessed as the Pearson correlation between the genomic estimated breeding values and the adjusted phenotypes, considering 10 replicates of three cross-validation scenarios (CV2: predicting tested genotypes in observed environments; CV1: untested genotypes in observed environments; CV0: tested genotypes in novel environments). Incorporating EC information into the models increased average predictive ability from 0.42 to 0.56 for CV1 and CV2. In these cases, the predictive ability was lower when EC information was averaged to compute the environmental kinship matrix, with slight differences observed with respect to the other approaches. Regarding the CV0 scheme, the model incorporating only genotype-by-environment information performed better (0.33). The naïve method, which utilized all available EC information (ALL), proved to be a promising approach, as it effectively improved the results in these scenarios while eliminating the need for additional steps in selecting variables.
Dataset DOI: 10.5061/dryad.2fqz6133v
Description of the data and file structure
Material developed for the manuscript: Impact of environmental covariates summarization on predictive ability in genomic selection.
This dataset supports the evaluation of how different approaches to obtain environmental kernels can impact predictive ability in genomic selection models. It includes a data composed by 1,379 genotypes evaluated in four different environments across United States, that were genotyped with 4,611 SNP markers. The dataset is a subset from SoyNAM (Soybean Nested Association Mapping) and was initially loaded using the R package "SoyNAM" (Xavier et al., 2022). The file also includes all the R codes used for conducting data analyses, divided into different scripts for: extracting and handling environmental covariates, build the design matrices for fitting the models, assigning the phenotypic information for cross-validations, fitting the models and also getting the variance components.
Files and variables
File: impact-ec-main.zip
Description: Evaluating how different approaches to obtain environmental kernels can impact predictive ability.
data folder
W.csv or WP.csv: Contains the environmental covariates organized daily for the whole season period. Each number in variable's name corresponds to the day of the year that data was collected.
- Location: The name of environment where the phenotypic data was collected.
- T2M_MAX: The daily maximum temperature at 2 meters in graus Celsius.
- T2M_MIN: The daily minimum temperature at 2 meters in graus Celsius.
- RH2M: The ratio of actual partial pressure of water vapor to the partial pressure at saturation in percent.
- T2M: The average air temperature at 2 meters above the surface of the earth in graus Celsius.
- PRECTOTCORR: The total precipitation at the surface of the earth in water mass in milimeters.
- WS2M: The average of wind speed at 2 meters above the surface of the earth in meters per second.
- QV2M: The ratio of the mass of water vapor to the total mass of air at 2 meters in percent.
WCONV.csv: Contains the environmental convariates summarised through the average across season period.
- T2M_MAX: The daily maximum temperature at 2 meters in graus Celsius.
- T2M_MIN: The daily minimum temperature at 2 meters in graus Celsius.
- RH2M: The ratio of actual partial pressure of water vapor to the partial pressure at saturation in percent.
- T2M: The average air temperature at 2 meters above the surface of the earth in graus Celsius.
- PRECTOTCORR: The total precipitation at the surface of the earth in water mass in milimeters.
- WS2M: The average of wind speed at 2 meters above the surface of the earth in meters per second.
- QV2M: The ratio of the mass of water vapor to the total mass of air at 2 meters in percent.
WFILT.csv: Contains the environmental covariates filtered by a coefficient of determination. Each number in variable's name corresponds to the day of the year that data was collected.
- T2M_MAX: The daily maximum temperature at 2 meters in graus Celsius.
- T2M_MIN: The daily minimum temperature at 2 meters in graus Celsius.
- RH2M: The ratio of actual partial pressure of water vapor to the partial pressure at saturation in percent.
- T2M: The average air temperature at 2 meters above the surface of the earth in graus Celsius.
- PRECTOTCORR: The total precipitation at the surface of the earth in water mass in milimeters.
- WS2M: The average of wind speed at 2 meters above the surface of the earth in meters per second.
- QV2M: The ratio of the mass of water vapor to the total mass of air at 2 meters in percent.
WSTG.csv and W_stages.csv:Contains the environmental covariates organized by soybean phenological stages. Each number in variable's name corresponds to the interval that data was summarised.
- T2M_MAX: The daily maximum temperature at 2 meters in graus Celsius.
- T2M_MIN: The daily minimum temperature at 2 meters in graus Celsius.
- RH2M: The ratio of actual partial pressure of water vapor to the partial pressure at saturation in percent.
- T2M: The average air temperature at 2 meters above the surface of the earth in graus Celsius.
- PRECTOTCORR: The total precipitation at the surface of the earth in water mass in milimeters.
- WS2M: The average of wind speed at 2 meters above the surface of the earth in meters per second.
- QV2M: The ratio of the mass of water vapor to the total mass of air at 2 meters in percent.
X.csv: Contains the molecular marker (Single nucleotide polymorphisms) information for the evaluated genotypes. Each column corresponds to the molecular marker coded as 0, 1 and 2.
- Genotype: The name of evaluated genotypes.
Y.csv: Containts the phenotypic information of genotypes evaluated across environments.
- year: The year which phenotypic data was collected.
- location: The location where phenotypic data was collected.
- environ: The combination of year x location.
- strain: The name of the genotype that was evaluated.
- yield: The grain yield in kg per hectare.
env.cov.csv: Contains the raw environmental covariates data before different strategies for summarization.
- YearFilt: The year in which the weather data were collected.
- ENV: The name of the environment (combination of location × year).
- YEAR: The year in which the weather data were collected.
- MM: The month in which the weather data were collected.
- DD: The day on which the weather data were collected.
- DOY: The day of the year corresponding to the weather data.
- YYYYMMDD: The complete date corresponding to the weather data.
- T2M_MAX: The daily maximum temperature at 2 meters in graus Celsius.
- T2M_MIN: The daily minimum temperature at 2 meters in graus Celsius.
- RH2M: The ratio of actual partial pressure of water vapor to the partial pressure at saturation in percent.
- T2M: The average air temperature at 2 meters above the surface of the earth in graus Celsius.
- PRECTOTCORR: The total precipitation at the surface of the earth in water mass in milimeters.
- WS2M: The average of wind speed at 2 meters above the surface of the earth in meters per second.
- QV2M: The ratio of the mass of water vapor to the total mass of air at 2 meters in percent.
pipeline
- 01.extracting_ec_data.R:
R codes for extracting environmental covariates of locations.
- 02.AdjustingW.R:
R codes for adjusting different strategies to summarise the environmental covariates informations.
- 03.DesignMatrices.R:
R codes for building design matrices for fitting genomic selection models.
- 04.1.FittingFullDataModels.R:
R Codes for fitting data with full data for estimating variance components using different models.
- 04.FittingModelsParallel.R:
R codes for conducting cross-validation in parallel for faster running time.
- 05.VC.R:
R codes for extracting variance components from full data models.
- 3.1.AssigningCV.R:
R codes for assigning observations into folds for conducting cross-validation.
- .gitignore: file generated automatically when creating github folder. Not needed for running analyses.
- impact-ec.Rproj: file for openning R project.
Code/software
The materials uploaded in this file can be viewed through the softwares Microsoft Excel and the codes were written using the R Language. The original code can be accessed at: https://github.com/vsagae/impact-ec
Access information
Data was derived from the following sources:
- The dataset is a subset from SoyNAM (Soybean Nested Association Mapping) and was initially loaded using the R package "SoyNAM" (Xavier et al., 2022).
