Skip to main content

Data from: Multi-generation genomic prediction of maize yield using parametric and non-parametric sparse selection indices

Cite this dataset

Lopez-Cruz, Marco et al. (2021). Data from: Multi-generation genomic prediction of maize yield using parametric and non-parametric sparse selection indices [Dataset]. Dryad.


Genomic prediction models are often calibrated using multi-generation data. Over time, as data accumulates, training data sets become increasingly heterogeneous. Differences in allele frequency and linkage disequilibrium patterns between the training and prediction genotypes may limit prediction accuracy. This leads to the question of whether all available data or a subset of it should be used to calibrate genomic prediction models. Previous research on training set optimization has focused on identifying a subset of the available data that is optimal for a given prediction set. However, this approach does not contemplate the possibility that different training sets may be optimal for different prediction genotypes. To address this problem, we recently introduced a sparse selection index (SSI) that identifies an optimal training set for each individual in a prediction set. Using additive genomic relationships, the SSI can provide increased accuracy relative to genomic-BLUP (GBLUP). Non-parametric genomic models using Gaussian kernels (KBLUP) have, in some cases, yielded higher prediction accuracies than standard additive models. Therefore, here we studied whether combining SSIs and kernel methods could further improve prediction accuracy when training genomic models using multi-generation data. Using four years of doubled haploid maize data from the International Maize and Wheat Improvement Center (CIMMYT), we found that when predicting grain yield the KBLUP outperformed the GBLUP, and that using SSI with additive relationships (GSSI) lead to 5-17% increases in accuracy, relative to the GBLUP. However, differences in prediction accuracy between the KBLUP and the kernel-based SSI were smaller and not always significant.


Data consist of 3722 Doubled-Haploid (DH) lines derived from biparental families developed at CIMMYT’s Maize DH facility at the Agricultural & Livestock Research Organization (KALRO) in Kiboko, Kenya. The biparental families were obtained by crossing elite inbred lines with drought-tolerant lines. The DH lines were selected from a larger population (based on the results of evaluating germination, good stand, plant type, low ear placement, and well-filled ears) for stage I multi-location yield trials conducted from 2017 to 2020.

Each year, the selected DH lines were crossed with a single-cross tester from the complementary heterotic group to generate tree-way hybrids that were evaluated under well-watered (denoted as optimal) and drought conditions. Trial were planted in an alpha-lattice design with two replications and evaluated in two well-watered locations and one managed drought stress location during the 2017, 2018, 2019, and 2020 growing seasons. Grain yield (GY, tons/ha), anthesis date (AD, days) and plant height (PH, cm) traits were recorded. Plots were manually harvested and GY was corrected to a moisture of 12.5%. AD was measured from planting to the moment in which 50% of the plants shed pollen, and PH was measured between the soil surface and the flag leaf collar on five representative plants in each plot.

DNA samples from leaves were sent to the Institute for Genomic Diversity, Cornell University, Ithaca, NY, USA, for genotyping with repetitive sequences (rAmpSeq). A distortion segregation analysis was performed to a total of 5465 dominant markers coded as 0 (absence) and 1 (presence) from where a total of 61 markers were discarded at a 5% FDR. The remaining markers were filtered by minor allele frequency (MAF<0.05), leading 4612 filtered markers that were used for analyses.

The adjusted means of GY, AD and PH were obtained using mixed-effects models fitted separately for each trait-environmental-condition-year combination. The Best Linear Unbiased Estimates (BLUE) of genotypes for the optimal experiments were estimated within year across the two locations by fitting models including Genotype, Location, Replicate, Block, and Genotype-by-Location interaction. Likewise, within each year, the BLUE for each trait for the single-location drought experiment was obtained through a linear model including Genotype, Replicate, and Block only.

A total of ? = 3527 lines containing marker information and phenotypic information remained after quality control. The final number of lines in 2017, 2018, 2019, and 2020 are n = 901, n = 1418, n = 722, and n = 486, respectively.

Data from the 2017 and 2018 cycles have been previously described and analyzed by Beyene et al. (2019) and Atanda et al. (2020).

Usage notes

Data contains phenotypic observations on 3528 genotypes from which 901, 1419, 722, and 486 are from 2017, 2018, 2019, and 2020, respectively.

Missing values: In 2018, genotype with GID '973680' was not observed in Optimal experiments and genotype '976304' was not observed in Drought experiments. Genotype '1132699' from 2019 was not recorded in Drought experiments.

Phenotypic data: File 'Pheno_data.csv' is a matrix containing the adjusted means for all the 3528 genotypes (in rows) for each trait-environmental-condition combination (in columns). Column 'GID' contains the Genotype ID and column 'Year' contains the cycle to which each genotype belongs to.

Genotypic data: File 'Geno_data.csv' contains, for each genotype (in rows), presence-absence marker information on 4612 markers (in columns). Column 'GID' contains the Genotype ID and matches GID column in 'Pheno_data.csv' file.


Bill & Melinda Gates Foundation

Monsanto Beachell-Borlaug International Scholar Program

National Institute of Food and Agriculture, Award: 2021-67015-33413

Monsanto Beachell-Borlaug International Scholar Program