Data for: Machine learning for genomic and pedigree prediction in sugarcane
Data files
This dataset is embargoed and will be released when the associated article is published. Contact gro.dayrdatad@pleh to notify us of article publication.
Lists of files and downloads will become available to the public when released.
Abstract
Sugarcane (Saccharum spp.) plays a crucial role in global sugar production; however, the efficiency of breeding programs has been hindered by its heterozygous polyploid genomes. Considering non-additive genetic effects is essential in genome prediction (GP) models of crops with highly heterozygous polyploid genomes. This study incorporates non-additive genetic effects and pedigree information using machine learning methods to track sugarcane breeding lines and enhance the prediction by assessing the degree of association between genotypes. This study measured the stem biomass and sugar content of 297 clones from 87 families within a breeding population used in the Japanese sugarcane breeding program. Subsequently, we conducted analyses based on the marker genotypes of 33,149 single-nucleotide polymorphisms. To validate the accuracy of GP in the population, we first predicted the prediction accuracy of the best linear unbiased prediction (BLUP) based on a genomic relationship matrix. Prediction accuracy was assessed using two different cross-validation methods: repeated 10-fold cross-validation and leave-one-family-out cross-validation. The accuracy of GP of the first method ranged from 0.36 to 0.74 and of the second method from 0.15 to 0.63. Next, we compared the prediction accuracy of BLUP and two machine learning methods: random forests and simulation annealing ensemble (SAE), a newly developed machine learning method that explicitly models the interaction between variables. Both pedigree and genomic information were utilized as input in these methods. Through repeated 10-fold cross-validation, we found that the accuracy of the machine learning methods consistently surpassed that of BLUP in most cases. In leave-one-family-out cross-validation, SAE demonstrated the highest accuracy among the methods. These results underscore the effectiveness of GP in Japanese sugarcane breeding and highlight the significant potential of machine learning methods.
https://doi.org/10.5061/dryad.0rxwdbs8p
This dataset was used in the paper to compare different methods for building genomic prediction models. Specifically, it includes phenotypic data (pheno.csv), marker genotype data (geno.csv), a genomic relationship matrix (grm.csv), a numerator relationship matrix (prm.csv), and IDs representing families for lines evaluated in the sugarcane breeding program in Japan. These data were compared using a new machine learning method called Simulated Annealing Ensemble (SAE), Random Forest, and commonly used methods for genomic prediction such as genomic BLUP (GBLUP) and Pedigree-based BLUP (PBLUP).
Description of the data and file structure
pheno.csv: Estimated genotypic values calculated from data from multiple environments using a mixture model. This is the y value of the genomic prediction.
Columns: traits, rows: genotypes
geno.csv: Scored marker genotype data for all genotypes. The scores take 0 or 1 values.
Columns: markers, rows: genotypes
grm.csv: Genomic relationship matrix. A square matrix of the number of genotypes ✕ number of genotypes.
prm.csv: Numerator relationship matrix (Relationship matrix computed from the pedigree). A square matrix of the number of genotypes ✕ number of genotypes.
fam.csv: family tree ID. Rows are genotypes.
Missing values are expressed as NA.
Code/Software
The Python library for the newly developed SAE method is accessible at https://github.com/m-inamori/SAEnsemble.
