A method for identifying environmental stimuli and genes responsible for genotype-by-environment interactions from a large-scale multi-environment data set
Data files
Dec 22, 2021 version files 461.89 MB
Abstract
It has not been fully understood in real fields what environment stimuli cause the genotype-by-environment (G × E) interactions, when they occur, and what genes react to them. Large-scale multi-environment data sets are attractive data sources for these purposes because they potentially experienced various environmental conditions. In this study, we developed a data-driven approach termed Environmental Covariate Search Affecting Genetic Correlations (ECGC) to identify environmental stimuli and genes responsible for the G × E interactions from large-scale multi-environment data sets. ECGC was applied to a soybean (Glycine max) data set that consisted of 25,158 records collected at 52 environments. ECGC illustrated what meteorological factors shaped the G × E interactions in six traits including yield, flowering time, and protein content and when they were involved. For example, it illustrated the relevance of precipitation around sowing dates and hours of sunshine just before maturity to the interactions observed for yield. Moreover, genome-wide association mapping on the sensitivities to the identified stimuli discovered candidate and known genes responsible for the G × E interactions. Our results demonstrate the capability of data-driven approaches to bring novel insights on the G × E interactions observed in fields. This dataset provides the data used in this study and supplementary tables cited in the manuscript.
Usage notes
Supplementary Tables (Tables S1-S21) are provided in a single excel file (SupplementaryTables.xlsx). Supplementary Data (Supplementary Data 1-4) are provided as independent four csv files (SupplementaryData1-4.csv). An example R script to execute ECGC is also provided. This script uses Table S5 and Table S11. For Supplementary Tables, captions are included in the file. Captions for the Supplementary Data are provided below.
Supplementary Data 1
IDs of records, phenotypic values, and meteorological measurements are included. The meteorological measurements include from the sowing dates (day 0) to 245th day after sowing. The column names of meteorological measurements are composed of abbreviations of meteorological factors and days after sowing. The abbreviations are:
T, mean temperature
Tmax, maximum temperature
Tmin, minimum temperature
Pr, precipitation
e, vapour pressure
VPD, vapour pressure deficit
RH, relative humidity
RHmin, minimum relative humidity
u, wind speed
u10max, maximum wind speed
N, hours of sunshine
Sd, solar radiation
EP, potential evapotranspiration
Ph, photoperiod.
Supplementary Data 2
Genomic relationship matrix used for mixed models is included. The column and row names are the variety IDs.
Supplementary Data 3
Adjusted phenotypic values used for estimating genetic correlation are included. Phenotypic values were adjusted for fixed effects at each environment. Then adjusted phenotypic values at each environment were matched to those of each of the other environments. The column names are composed of environment IDs and traits.
Supplementary Data 4
Record IDs of Supplementary Data 3 are included. Record IDs indicate the positions in Supplementary Data 1.
Trait abbreviations
In these files, traits are often referred as abbreviations: DTF, days to flowering; DTM, days to maturity; SL, stem length (cm); PR, protein content of seeds (%); YI, yield (kg/a); SW, seed weight (g/100 seeds).