Skip to main content
Dryad

Supporting data for: Multi-modal deep learning improves grain yield prediction in wheat breeding by fusing genomics and phenomics

Data files

May 18, 2023 version files 99.45 GB

Abstract

Identifying and growing new crop varieties with the highest yield is of utmost importance to ensure robust and sustainable food supplies for the global population. Plant breeding programs benefit from increasing technological support but still rely on full growth cycle and manual yield measurement, hindering speed of development. While methods to predict yield have been proposed, satisfying levels of performance are still to be reached. In this study, we propose a new machine learning model that simultaneously leverages both genotype and phenotype measurement by fusing multiple sources of input data collected by unmanned aerial systems: longitudinal multispectral and thermal images, digital elevation models, along with single nucleotide polymorphisms (SNPs) measurements. To tackle the varying number of observations for each sample, we leverage a deep multiple instance learning framework with an attention mechanism that also allows us to shed light on the importance the trained model gives to each data input during prediction, enhancing interpretability.

Our model reaches 0.754~±~0.024 Pearson correlation coefficient when predicting yield in similar environmental conditions, which represents a 34.8% improvement over the genotype-only linear baseline (0.559~±~0.050). Moreover, we achieve transfer to a new, unseen environment where we obtain 0.386~±~0.010~(0.407 for ensemble performance) Pearson correlation coefficient when predicting yield on new lines when using genotypes alone, a 13.5% improvement over the linear baseline. We show that our multi-modal deep learning architecture efficiently accounts for plant health and environment, thereby distilling out the genetic contribution and providing excellent predictions. Yield prediction algorithms leveraging phenotypic observations during training are therefore expected to improve plant breeding programs with more accurate selections of lines, speeding up delivery of improved varieties.