Supplementary data files for: Genomics of iron and zinc concentration, iron Bioavailability, and yield in common bean: multi-locus GWAS, structural equation modelling, candidate gene prioritization and genomic selection.
Data files
Mar 05, 2026 version files 45.75 KB
-
README.md
16.58 KB
-
Supplementary_Table_1_5.zip
29.18 KB
Abstract
This submission contains supplementary datasets supporting the genetic and genomic analyses of agronomic, nutritional, and quality traits in a panel of 427 common bean (Phaseolus vulgaris L.) genotypes.
Supplementary Table 1 provides Best Linear Unbiased Predictors (BLUPs) and Best Linear Unbiased Estimates (BLUEs) for all evaluated genotypes across the studied traits.
Supplementary Table 2 presents results from single-trial analysis of variance (ANOVA) for the measured phenotypes.
Supplementary Table 3 summarizes quantitative trait nucleotide (QTN) discovery results, including trait identity, detection method, chromosomal location, marker position, allelic effects, LOD scores, significance probabilities, proportion of phenotypic variance explained, minor allele frequency, and the number of methods supporting each SNP.
Supplementary Table 4 contains functional annotation of prioritized candidate genes identified within ±131,550 base pairs of significant SNP markers, including marker support across models, pathway descriptions, and gene ontology terms for cooking time, soaking ability, polyphenols, flavonoids, seed coat darkening, iron, zinc, and yield.
Supplementary Table 5 provides genomic prediction performance metrics, including Pearson correlation coefficient (PCC), mean absolute error (MAE), and the combined index for correlation and error (CICE). Collectively, these supplementary files provide detailed phenotypic summaries, association mapping outputs, candidate gene prioritization, and prediction accuracy metrics to support reproducibility and further research.
Dataset DOI: 10.5061/dryad.dz08kpsbz
Description of the data and file structure
The data were collected from multi-trait phenotypic evaluations of 427 diverse common bean (Phaseolus vulgaris L.) genotypes conducted under field and laboratory conditions to quantify agronomic performance, seed nutritional quality, and traits related to iron bioavailability. The experimental efforts focused on measuring grain yield, seed iron (Fe) and zinc (Zn) concentrations, and traits associated with bioavailable iron (BioFe), including cooking time, seed coat darkening, total polyphenols, flavonoid and phytates. The study aimed to uncover the genetic associations conditioning iron, zinc, yield, and BioFe-related traits through genomic analyses, and to inform the optimization of genomic selection models for these traits.
Files and variables
File: SupplementaryTable1_5.zip
Description: Five supplementary tables are included in the zip folder. Missing values are indicated as NA.
Supplementary Table 1: Best Linear Unbiased Predictors (BLUPS) and Best Linear Unbiased Estimates (BLUEs) for evaluated genotypes.
Genotype_ID: A unique identifier assigned to each genotype and linked to the DNA marker dataset
GenotypeName: The official or commonly used name of the genotype corresponding to the Genotype_ID, used for reporting and identification in trials.
BLUPS: Best Linear Unbiased Predictors;
BLUEs: Best Linear Unbiased Estimates for evaluated genotypes
Yield (kg/ha): Yield estimated in kilograms per hectare from six environments
Seed Iron (ppm): Iron content in uncooked dry common bean seeds measured in parts per million (from six environments)
Seed Zinc (ppm): Zinc content in uncooked dry common bean seeds measured in parts per million (from six environments)
Cooking Time (minutes): Determined in presoaked beans using Matson Cookers (from one environment)
Hydration Coefficient: Water absorption capacity of common beans soaked for 12 hours (from one environment)
Total Polyophenol (mg/g): Measured in dry common bean seeds (from one environment)
Flavonoid (mg/g): Measured in dry common bean seeds (from one environment)
Phytate (mg/g): Measured in dry common bean seeds (from one environment)
Condensed Tannins (mg/g): Measured in dry common bean seeds (from one environment)
p < 0.05 (*): There is less than a 5% probability that the observed result occurred by chance
p < 0.01 (**): There is less than a 1% probability that the result occurred by chance.
p < 0.001 (***): There is less than a 0.1% probability that the result occurred by chance.
"L* value: Represents lightness in the CIELAB (Lab*) color space, L* value of 0 = Black (completely dark), and
Table 2: A single trial analysis of variance (ANOVA) for the evaluated phenotypes of 427 common bean genotypes.
Environment: The specific testing location in different rainy seasons where phenotypic data were collected.
Statistic: The calculated values summarizing trait performance or genetic parameters within a given environment
DF (Days to 50% flowering): Number of days from planting to the point when 50% of the plants in a plot have at least one open flower
DPM (Days to 50% physiological maturity): Number of days from planting to the stage when 50% of the plants in a plot reach physiological maturity, typically indicated by pod color change and seed filling completion
YDHA: Yield (kg/ ha)
FESEED: Seed iron content (ppm, parts per million)
ZNSEED: Seed zinc content (ppm)
COOKT: Cooking time (minutes) pf presoaked beans measured using Matson Cookers
HC: Hydration coefficient/ water uptake/ soaking ability of dry beans soaked for 12 hours in distilled water
LSD: Least significant difference
CV(%): Coefficient of variation
Kag18A: Kagera site-the first rainy season of year 2018
Kag18B: Kagera site-the second rainy season of year 2018
Kaw20A: Kawanda site-the first rainy season of year 2020
Rwe20A: Rwebitaba site-the first rainy season of year 2020
ANOVA: Analysis of variance
Broad-sense heritability (H²) is the proportion of total observed variation in a trait that is due to genetic differences among individuals.
Supplementary Table 3: Trait, method for quantitative trait nucleotide identification, chromosome, position, QTN effect, LOD score, probability, proportion of phenotypic variation explained, and minor allele frequency, Alleles and number of methods that identified the same single nucleotide polymorphisms for nine traits.
QTN effect: The estimated effect size of the Quantitative Trait Nucleotide (QTN) on the trait, indicating the magnitude and direction of its contribution to phenotypic variation.
LOD Score: The log10 of the odds ratio indicating the strength of evidence for a QTL at a given genomic position. Higher LOD values indicate stronger evidence of association.
p. vaue < 0.05: There is less than a 5% probability that the observed result occurred by chance
p. value < 0.01: There is less than a 1% probability that the result occurred by chance.
p. value < 0.001: There is less than a 0.1% probability that the result occurred by chance.
R² (Proportion of Phenotypic Variation Explained): The proportion of total phenotypic variance in a trait that is explained by a specific QTL (marker). It indicates the relative contribution of the identified genetic factor to variation observed in the population.
MAF (Minor Allele Frequency): The frequency at which the less common allele occurs in a given population for a specific genetic marker or locus. It was used to assess allele distribution and filter rare variants.
SNP: Single nucleotide polymorphisms
mrMLM:Multi-locus random-SNP-effect mixed linear model
FASTmrMLM: Fast multi-locus random-SNP-effect mixed linear model
FASTmrEMMA:Fast multi-locus random-SNP-effect efficient mixed model association,
ISIS EM-BLASSO: Iterative sure independence screening EM Bayesian LASSO
pKWmEB:Polygenic Kruskal-Wallis test with empirical Bayes
pLARmEB: Polygenic least angle regression with empirical Bayes
L* value: Represents lightness in the CIELAB (Lab*) color space, L* = 0: Black (completely dark), and
L* = 100" White (completely bright)
YDHA: Yield (kg/ ha)
FESEED: Seed iron content (ppm, parts per million)
ZNSEED: Seed zinc content (ppm)
COOKT: Cooking time (minutes) pf presoaked beans measured using Matson Cookers
HC: Hydration coefficient/ water uptake/ absorption capacity of beans seeds soaked for 12 hours in distilled water
Total Polyophenol (mg/g): Measured in dry common bean seeds
Flavonoid (mg/g): Measured in dry common bean seeds
Phytate (mg/g): Measured in dry common bean seeds
Condensed Tannins (mg/g): Measured in dry common bean seeds
Method: The statistical method that provided the strongest evidence for the detected QTL, and whose parameters (e.g., effect size, LOD score, p-value, R²) are reported.
No. of Methods: The total number of different QTL detection methods (e.g., mrMLM, and ISIS EM-BLASSO) that detected a statistically significant QTL at a given genomic region.
Chromosome: The specific chromosome carrying the genomic region (QTL or marker) associated with the trait under study.
Supplementary Table 4: Name, number of models that identified the SNP marker, pathway description and ontology of prioritized genes for cooking time (COOKT), soaking ability (HC), total polyphenols, flavonoids, seed coat darkening (L*), iron (Fe), zinc (Zn) and yield (YDHA) identified within ±131550 base pairs of the significant SNP markers.
Gene Name: The official name the gene associated with the identified trait or genomic region.
Trait Name: The phenotypic trait linked to the gene or QTL, or genomic marker based on association or functional annotation.
No. of Methods: Number of statistical methods that identified a significant Quantitative Trait Locus (QTL) for the trait under study.
Pathway Description / Gene Ontology (GO) Description: Functional annotation describing the biological pathway, molecular function, or biological process in which the gene is involved, based on pathway databases or Gene Ontology classification.
SNP: Single nucleotide polymorphic marker,
YDHA (kg/ha): Yield estimated in kilograms per hectare
FESEED (ppm): Iron content in uncooked dry common bean seeds measured in parts per million
ZNSEED (ppm): Zinc content in uncooked dry common bean seeds measured in parts per million
COOKT: Cooking time (minutes) determined in presoaked beans using Matson Cookers
Hydration Coefficient: Soaking ability/ Water absorption capacity of common beans soaked for 12 hours
Total Polyophenol (mg/g): Measured in dry common bean seeds
Flavonoid (mg/g): Measured in dry common bean seeds
Phytate (mg/g): Measured in dry common bean seeds
Condensed Tannins (mg/g): Measured in dry common bean seeds
L* value: Represents lightness in the CIELAB (Lab*) color space, L* = 0: Black (completely dark), and
L* = 100" White (completely bright)
Supplementary Table 5: Pearson Correlation Coefficient (PCC), Mean Absolute Error (MAE) and combined index for correlation and error (CICE).
Notes: Describes the data structure and approach used for genomic prediction. Within-environment prediction was performed using missing phenotypic data within the same environment, which were imputed using available data from that environment. Whole-environment prediction treated entire environments as missing and predicted their performance using information from other environments. Across-environment prediction was conducted using environment means, with random partitioning of data into training and testing sets to evaluate predictive performance.
Environment: Testing location
Model: The statistical models used used for genomic prediction
Kaw: Kawanda
Kag: Kagera
BMORS: The Bayesian multi-output regressor stacking
BRR: Bayesian ridge regression
GBLUP: Genomic best linear unbiased prediction
ABLUP: pedigree best linear unbiased prediction
HBLUP: hybrid best linear unbiased prediction
RKHS: Reproducing Kernel Hilbert Space
Meso: Mesoamerican gene pool
majorQTLfixedfactor: Major QTL fitted as fixed factors in the model
YDHA (kg/ha): Yield estimated in kilograms per hectare
FESEED (ppm): Iron content in uncooked dry common bean seeds measured in parts per million
ZNSEED (ppm): Zinc content in uncooked dry common bean seeds measured in parts per million
Cooking time (minutes): Determined in presoaked beans using Matson Cookers
Hydration Coefficient: Soaking ability/ Water absorption capacity of common beans soaked for 12 hours
Total Polyophenol (mg/g): Measured in dry common bean seeds
Flavonoid (mg/g): Measured in dry common bean seeds
Phytate (mg/g): Measured in dry common bean seeds
Condensed Tannins (mg/g): Measured in dry common bean seeds
PCC: Pearson Correlation Coefficient
MAE: Mean Absolute Error
CICE: Combined index for correlation and error
Code/software
Phenotype data
Single environment data were analyzed using Breeding view software v1.8.0.52 to clean data and assess for within-trial variability (VSN, 2017). Data were cleaned by both the raw data and residual methods. The raw data method represents observations that exceed 1.5 times the interquartile range and the residual method reports large, standardized residuals identified by the mixed model analysis (VSN, 2017). Cleaned trial data were then subjected to a single step combined analysis of variance (ANOVA) in META-R v6.0, a process that was not feasible in breeding view software, as it performs a two-stage analysis. The linear model: was used, where:~~ Yijkl = observed value, GM = Grand Mean, Ei = Environment effect, Ri = Replication effect, BK = Block effect, Gl = genotype effect, G×Eli = Genotype x Environment effect, and εijkl = error (Alvarado et al., 2020). Cooking time, hydration coefficient, biochemical compounds and seed coat color lightness or darkness (L*) were evaluated in a single environment and analyzed using the linear model where, Yijk: observed value, GM: grand mean, Ri: replication effect, Ri(bj): effect of the incomplete block within a replication, Gk: genotype effect, and eijk: error effect. Data for L* was analyzed for each market class separately. Pearson correlation coefficient (r) was determined in R software.
Genome wide association study
The QTL associated with target traits were determined using multi-locus models for single traits. To discriminate pleiotropic and single-trait SNPs, a multi-trait multi-locus structural equation modeling (SEM) was also performed. Multi-locus methods incorporate regularization techniques or Bayesian frameworks that control the overall false positive rate (Wang et al., 2015), which is an advantage with detection of more QTL (Zhang et al., 2020). These models are also more efficient for complex traits controlled by several major loci and multiple loci with small effects (Wang et al., 2015).
Candidate gene identification around QTNs and prioritization
Genes around the QTL, located within the LD block, were considered as candidate genes (CGs) on the Phaseolus vulgaris v2.1 reference genome in Phytozome v13 (https://phytozome-next.jgi.doe.gov/). Within an LD block, correlation between genetic variants remains high, and are thus inherited together. The CG prioritization steps included:
Key words search
Orthologs and co-expression analyses
The conserved domains within the protein sequences
Final key words search
Genomic prediction models
Two R-based packages, the Bayesian multi-trait and multi-environment (BMTME) and Bayesian generalized linear regression (BGLR), were used in analyses (Montesinos-Lopez et al., 2019; R Core Team, 2021; Perez-Rodriguez and Campos, 2022). These packages provide robust frameworks for modeling complex genetic traits across multiple traits and environments. Two statistical models were employed to evaluate the influence of various factors on traits under study: a parametric method [Bayesian Ridge Regression (BRR)] and second non-parametric [Reproducing Kernel Hilbert Space (RKHS)]. The BRR model assumes normal prior distribution on marker effects and induces marker-specific or a homogenous shrinkage on marker effects across markers (Perez et al., 2010). The RKHS models complex, non-linear relationships (Campos et al., 2013). Phenotypic data were analyzed together with either SNP marker (genomic) data, pedigree information or a combination of both. Genomic data were used to construct a Genomic Relationship Matrix (GRM). Similarly, pedigree information was used to derive the Numerator Relationship Matrix (NRM). Additionally, both GRM and NRM were combined to produce hybrid matrix.
Model performance was assessed through random cross-validation with 10 partitions (NPartitions = 10), where the dataset was repeatedly and randomly split into training (80%) and testing (20%) sets. For each partition, models were fitted on the training data and validated on the testing data. Sampling with replacement allowed some genotypes to appear in multiple partitions, reflecting breeding scenarios where lines are evaluated in some, but not all, environments (Montesinos-Lopez et al., 2019). Prediction accuracy was quantified using Pearson’s correlation coefficient (PCC), which measures how well the predicted genomic estimated breeding values tracked the trend of the observed phenotypic values in the testing set for each replicate, reflecting their linear relationship. Mean Absolute Error (MAE) quantified the average magnitude of prediction errors, providing insight into the numerical deviation. To comprehensively evaluate model performance by combining both correlation and error, the Combined Index for Correlation and Error (CICE) proposed by Pan et al., 2024 was used.
Access information
Other publicly accessible locations of the data:
- Mean Data for phenotype-based clustering, and diversity of common bean genotypes in seed iron concentration and cooking time: https://doi.org/10.7910/dvn/izspmi
