Supporting data for: Multi-modal deep learning improves grain yield prediction in wheat breeding by fusing genomics and phenomics
Data files
May 18, 2023 version files 99.45 GB
-
2018_CIMMYT_EYT.zip
18.11 GB
-
2018_CIMMYT_YT_1.zip
3.59 GB
-
2018_CIMMYT_YT_2.zip
19.84 GB
-
2018_CIMMYT_YT_3.zip
21.96 GB
-
2018_CIMMYT_YT_4.zip
30.70 GB
-
Genotypes.zip
5.26 GB
-
README.md
8.36 KB
-
Table_Headers_Info_v2.0.xlsx
17.04 KB
Abstract
Identifying and growing new crop varieties with the highest yield is of utmost importance to ensure robust and sustainable food supplies for the global population. Plant breeding programs benefit from increasing technological support but still rely on full growth cycle and manual yield measurement, hindering speed of development. While methods to predict yield have been proposed, satisfying levels of performance are still to be reached. In this study, we propose a new machine learning model that simultaneously leverages both genotype and phenotype measurement by fusing multiple sources of input data collected by unmanned aerial systems: longitudinal multispectral and thermal images, digital elevation models, along with single nucleotide polymorphisms (SNPs) measurements. To tackle the varying number of observations for each sample, we leverage a deep multiple instance learning framework with an attention mechanism that also allows us to shed light on the importance the trained model gives to each data input during prediction, enhancing interpretability.
Our model reaches 0.754~±~0.024 Pearson correlation coefficient when predicting yield in similar environmental conditions, which represents a 34.8% improvement over the genotype-only linear baseline (0.559~±~0.050). Moreover, we achieve transfer to a new, unseen environment where we obtain 0.386~±~0.010~(0.407 for ensemble performance) Pearson correlation coefficient when predicting yield on new lines when using genotypes alone, a 13.5% improvement over the linear baseline. We show that our multi-modal deep learning architecture efficiently accounts for plant health and environment, thereby distilling out the genetic contribution and providing excellent predictions. Yield prediction algorithms leveraging phenotypic observations during training are therefore expected to improve plant breeding programs with more accurate selections of lines, speeding up delivery of improved varieties.
Methods
Plant Material and Field Layout
Spring wheat (Triticum aestivum L.) breeding lines of two different experiments, named as YT (Yield Trials, 27°22'57.6'' N, 109°55'34.7'' W) and EYT (Elite Yield Trials, 27°23'0.1'' N, 109°55'7.9'' W), were selected from the International Maize and Wheat Improvement Center (CIMMYT) wheat breeding program. All the trials were planted in November 2017, at Norman E Borlaug Experiment Station in Ciudad Obregon, Sonora, Mexico during the 2017–18 season. The YT experiment consisted of 1800 unique spring wheat entries, while the EYT consisted of 1710 unique entries. Both experiments were arranged as the alpha lattice design and distributed within two blocks in YT and three blocks in EYT. The YT plots served as experimental units and were 1.7m × 3.4m in size, planted on two raised beds spaced 0.8m apart with paired rows on each bed at 0.15m spacing for each plot. The EYT plots were sown in flat and were 1.3m × 4m in size with six rows per plot.
UAS, Sensors, and Image Acquisition
The UAS used for image acquisition was a DJI Matrice 100 (DJI, Shenzhen, China). The flight plans were created using Litchi Android App (VC Technology Ltd., UK) and CSIRO mission planner application (https://uavmissionplanner.netlify.app/) for DJI Matrice100. Accordingly, the flight speed, the flight elevation above the ground, and the width between two parallel flight paths were adjusted based on the overlap rate and the camera field of view. Both cameras were automatically triggered with the onboard GNSS unit following a constant interval of distance traveled. A summary of flight settings is listed in the Supplement (Table 1).
To collect the thermal image from the spring wheat nurseries, a FLIR VUE Pro R thermal camera (FLIR Systems, USA) was carried by the DJI Matrice 100. All data collections were conducted between 11AM and 1PM during the grain filling stage. The aerial image overlap rate between two geospatially adjacent images was set to 80% both sequentially and laterally to ensure optimal orthomosaic photo stitching quality. To preserve the image pixel information, the FLIR camera was set to capture Radiometric JPEG (R-JPEG) images.
A MicaSense RedEdge-M multispectral camera (MicaSense Inc., USA) was used to collect spring wheat canopy images in both the YT and EYT experiments. All UAS flights were conducted between 11AM to 2PM. The aerial image overlap rate between two geospatially adjacent images was set to 80% both sequentially and laterally to ensure optimal orthomosaic photo stitching quality. To preserve the image pixel intensity, the MicaSense RedEdge-M camera was set to capture uncompressed TIFF images.
To improve the geospatial accuracy of orthomosaic and orthorectified images, ground control points (GCPs) consisting of bright white/reflective square markers were uniformly distributed in the field experiment before image acquisition and surveyed to cm-level resolution. All the GCPs were surveyed using a Trimble R4 RTK (Trible Inc., Sunnyvale, California, US) Global Positioning System (GPS).
Plot-level Traits Extraction
Plot-level phenotypic trait values used for learning include multiple vegetation indices (VIs), the canopy height from the digital elevation models (DEMs), and the canopy temperature. Extraction of plot-level phenotypic values from orthomosaic and orthorectified images followed the methodology of Wang et al. (2020).
Wang, Xu, Paula Silva, Nora Bello, Daljit Singh, Byron Evers, Suchismita Mondal, Francisco Pinto, Ravi Prakash Singh, and Jesse Poland. "Improved accuracy of high-throughput phenotyping from Unmanned Aerial Systems by extracting traits directly from orthorectified images." Frontiers in plant science 11 (2020): 1616.
Imaging Sensor |
MicaSense RedEdge-M |
FLIR VUE Pro R |
||
Flight Speed |
14 km/h |
18 km/h |
||
Experiment |
YT |
EYT |
YT |
EYT |
Flight Date |
01/18/2018, 02/26/2018, 03/07/2018, 03/15/2018 |
01/19/2018, 02/23/2018, 03/02/2018, 03/07/2018, 03/21/2018 |
03/08/2018, 03/18/2018 |
02/23/2018, 03/02/2018 |
Flight altitude |
35 m AGL |
60 m AGL |
||
Ground Sample Distance of Orthomosaic |
2.05 cm/pixel |
8.20 cm/pixel |
||
Usage notes
The dataset includes six archived files, 2018_CIMMYT_YT_1.zip, 2018_CIMMYT_YT_2.zip, 2018_CIMMYT_YT_3.zip,2018_CIMMYT_YT_4.zip, 2018_CIMMYT_EYT.zip, and Genotypes.zip. The first five zipped files (YT_1 to YT_4 and EYT) zipped files contain all the phenomics data of the YT and EYT trials, while the last zipped file includes the genomics data.
The "Table_Headers_Info.xlsx" file explains the headers in each tabular data files.
2018 YT dataset
There are four sub-folders and a CSV file in this dataset.
The “DEM” folder stores the Digital Elevation Model (DEM) images of each breeding plot in Geo-Tiff format cropped from the DEM of the entire EYT field trial. The image file name indicates the plot ID.
The “Multispec” folder stores the reflectance orthorectified images in Geo-Tiff format cropped from orthorectified raw images captured by the MicaSence RedEdge camera. Due to its large scale, the YT field trials are separated into three sections, named “1-60”, “61-141”, and “142-320” respectively. The image acquisition at each section might be implemented on different dates. The section information and the date information can be revealed from the folder name.
Image files saved in each folder stores reflectance information in five raster bands, including “B” (blue band), “G” (green band), “R” (red band), “RE” (red-edge band), and “NIR” (near infrared band). The first five segments of the file name separated by a hyphen indicate the plot ID, while the next four segments indicate the image acquisition date, time, and the raw image sequence number.
The “Plot-level_VI” folder stores the numerical value of plot-level traits extracted from the images saved in the “Multispec” folder. Other than the reflectance trait values of the five separated spectral bands, there are three vegetation indices - “GNDVI”, “NDVI”, and “NDRE” also extracted.
The “Thermal” folder stores the thermal orthorectified images in Geo-Tiff format cropped from orthorectified raw images captured by the FLIR VUE Pro R camera, as well as the numerical value of plot-level trait extracted from those images.
The key file links the phenotype metadata with the genotype metadata in the CSV file.
2018 CIMMYT EYT dataset
There are five sub-folders in this dataset, including processed images and extracted trait values on five different dates. In each -sub-folder, there is a “DEM_crop” folder, an “Or_crop” folder, and a “Traits_csv” folder.
The “DEM_crop” folder stores the Digital Elevation Model (DEM) images of each breeding plot in Geo-Tiff format cropped from the DEM of the entire EYT field trial. The image file name indicates the plot ID.
The “Or_crop” folder stores the reflectance orthorectified images in Geo-Tiff format cropped from orthorectified raw images captured by the MicaSence RedEdge camera. Image file names ended with “B” (blue band), “G” (green band), “R” (red band), “RE” (red-edge band), and “NIR” (near infrared band) record single-band reflectance information. Image file names ended with “GNDVI”, “NDVI”, and “NDRE” record three vegetation indices information. The first five segments of the file name separated by a hyphen indicate the plot ID, while the next four segments indicate the image acquisition date, time, and the raw image sequence number.
The “Traits_csv” folder stores the numerical value of plot-level traits extracted from the images in the “DEM_crop” and the “Or_crop” folder.
Genotype dataset
There are two zipped files (.gz files) and two spreadsheets in this dataset. The two zipped files store the SNP data of all the wheat breeding lines planted from the year of 2013 to the year of 2020 in CIMMYT Mexico. The first spreadsheet (Eyt1718_Gmat_982lines.xlsx) stores the genetic matrix information of the 2018 EYT field trials, while the second spreadsheet (Fieldbook_18-OBR-EYTBW-B5I.csv) includes meta data of each EYT plot.
Table_Headers_Info.xlsx
The Excel spreadsheet explains the columns' names in each tabular data files. "Header" represents the header name. "Notes" explains what the header name stands for, included unit if necessary.