GBS data and morphological nut trait data from two populations of hazelnut (Corylus spp.)

Published Dec 12, 2025 on Dryad. https://doi.org/10.5061/dryad.ghx3ffc0z

Abstract

The native, perennial shrub American hazelnut (Corylus americana) is cultivated in the Midwestern U.S. for its significant ecological benefits, as well as its high-value nut crop. Developing improved varieties for the Upper Midwest requires validated quantitative genetic approaches to utilizing genomic data in both American hazelnut and interspecific hybrids typical of breeding programs. In addition, high-throughput phenotyping methods are essential to the efficient and accurate screening of large breeding populations. This study reports novel advances in both of these domains. Two populations of hazelnuts, one composed of C. americana and one composed of C. americana x C. avellana hybrids, were phenotyped over the course of two years in two locations using a digital imagery-based method for quantifying morphological nut and kernel traits. This data was utilized to perform both composite interval mapping (CIM), using a recently released genetic map, and genomic prediction, using a newly available chromosome-scale reference genome for C. americana. Multiple QTL were detected for all traits analyzed, with an average total R² of 52%. Marker-assisted genomic selection exhibited high prediction accuracy, with an average correlation coefficient between genotypic values and phenotypic observations of 0.78 across both environments. These results suggest that genomic prediction is a tenable method for improving genetic gain for highly polygenic traits in hazelnut breeding programs.

1. Study description

Developing improved hazelnut varieties for the Upper Midwest requires validated quantitative genetic approaches to utilizing genomic data in both American hazelnut (Corylus americana) and interspecific hybrids typical of breeding programs. High-throughput phenotyping methods are also essential for efficient and accurate screening of large breeding populations.

This study reports advances in both domains. Two populations of hazelnuts—one composed of C. americana and one composed of C. americana × C. avellana hybrids—were phenotyped over two years at two locations using a digital imagery–based method for quantifying morphological nut and kernel traits. These data were used to perform:

Composite interval mapping (CIM), using a recently released genetic map.
Genomic prediction, using a newly available chromosome-scale reference genome for C. americana.

Multiple QTL were detected for all traits analyzed (mean total R² ≈ 52%). Marker-assisted genomic selection exhibited high prediction accuracy (mean correlation ≈ 0.78 between genotypic values and phenotypic observations across environments). These results support genomic prediction as a viable method for improving genetic gain for highly polygenic traits in hazelnut breeding programs.

2. Dataset contents (top-level files)

All of the following files are located in the root of the Dryad dataset.

2.1 Genetic map R objects (linkage groups 1–11)

Gzipped R objects containing ordered and phased genotype data for each linkage group (LG), constructed with onemap and formatted for use with fullsibQTL:

chr1.rda.gz
chr2.rda.gz
chr3.rda.gz
chr4.rda.gz
chr5.rda.gz
chr6.rda.gz
chr7.rda.gz
chr8.rda.gz
chr9.rda.gz
chr10.rda.gz
chr11.rda.gz

Description:

Each file is a gzipped .rda (R data) object for a single linkage group (LG1–LG11). After gunzipping (for example, gunzip chr1.rda.gz), the object can be loaded in R with:

load("chr1.rda")  # after gunzipping

Each object contains:

Marker genotypes ordered along the linkage group.
Phased haplotypes as estimated by onemap.
Metadata required by fullsibQTL for QTL mapping.

2.2 QTL mapping input files (`fullsibQTL_input_files.zip`)

fullsibQTL_input_files.zip

Description:

Archive containing text and R input files required to run the fullsibQTL analyses described in the associated manuscript.

Contents (from ls -R fullsibQTL_input_files):

Eric-Jeff_map.txt
Text representation of the genetic map for the Eric × Jeff full-sib family, in a format suitable for fullsibQTL.
Eric-Jeff.vcf.gz
VCF file containing SNP genotypes for individuals in the Eric × Jeff population.
fullsibQTL_Eric-Jeff_phenos.csv
Phenotypic data for individuals in the mapping population.
- Rows: individuals.
- Columns: identifiers and trait values (nut and kernel morphological traits and any other traits used in QTL analyses).
  Column names correspond to trait abbreviations used in the manuscript.
onemap_Eric-Jeff.raw
Raw onemap input file used to build the genetic map and phased genotypes that underlie the chr*.rda objects.

These files, together with the chr*.rda.gz objects, allow reproduction of the fullsib QTL analyses.

2.3 Genomic prediction input files (`StageWise_input_files.zip`)

StageWise_input_files.zip

Description:

Archive containing CSV files used as inputs to the StageWise R package for multi-environment genomic prediction.

Contents (from ls -R StageWise_input_files):

gs_geno_mn.csv
Genomic marker data for the Minnesota (MN) environment.
- Rows: genotypes (individuals or entries).
- Columns: marker loci (and possibly ID columns).
gs_geno_wi.csv
Genomic marker data for the Wisconsin (WI) environment, with analogous structure to gs_geno_mn.csv.
mn_stage1.csv
Stage 1 input file for the MN environment.
- Typical columns: plot/individual identifier, genotype code, replication or block identifiers (if used), and trait values for MN.
wi_stage1.csv
Stage 1 input file for the WI environment, analogous to mn_stage1.csv.

Together, these CSVs provide the genotype and phenotype data required to reproduce the StageWise genomic prediction analyses.

2.4 Image archives (raw and processed imagery)

Image datasets used for high-throughput phenotyping of nut and kernel morphology:

images_MN_2020.tar.gz – Minnesota site, 2020.
images_MN_2021.tar.gz – Minnesota site, 2021.
images_WI_2019.tar.gz – Wisconsin site, 2019.
images_WI_2020.tar.gz – Wisconsin site, 2020.

Each archive contains:

Original RGB PNG images of individual nuts and kernels.
Binary segmentation masks for each image.
Overlay images showing segmentation masks overlaid on the original image.
A nested directory structure that encodes year, field row, and plant.

These archives are large; users may wish to extract only subsets relevant to their analyses. See Section 3 for details on folder structure and file naming.

3. Image data: folder structure and naming scheme

This section describes the organization of the PNG images within each images_*.tar.gz archive, using a few representative examples. The complete dataset follows the same pattern.

3.1 Directory structure (example: `images_MN_2020.tar.gz`)

After extraction (for example, tar -xvzf images_MN_2020.tar.gz), the directory structure is:

images_MN_2020/
  2020/
    Row_3-Plant_75/
      in-shell/
        {UID_1088-mn-2020}{Location_Minn}{Genotype_14-8}{Year_2020}{Row_3}{Plant_75}{Nut_1}{Scale_13.28}.png
        {UID_1088-mn-2020}{Location_Minn}{Genotype_14-8}{Year_2020}{Row_3}{Plant_75}{Nut_2}{Scale_13.317}.png
        ...
        binary-masks/
          {UID_1088-mn-2020}{Location_Minn}{Genotype_14-8}{Year_2020}{Row_3}{Plant_75}{Nut_1}{Scale_13.28}.png
          {UID_1088-mn-2020}{Location_Minn}{Genotype_14-8}{Year_2020}{Row_3}{Plant_75}{Nut_2}{Scale_13.317}.png
          ...
        binary-masks-overlay/
          {UID_1088-mn-2020}{Location_Minn}{Genotype_14-8}{Year_2020}{Row_3}{Plant_75}{Nut_1}{Scale_13.28}.png
          {UID_1088-mn-2020}{Location_Minn}{Genotype_14-8}{Year_2020}{Row_3}{Plant_75}{Nut_2}{Scale_13.317}.png
          ...
      kernel/
        {UID_1088-mn-2020}{Location_Minn}{Genotype_14-8}{Year_2020}{Row_3}{Plant_75}{Nut_1}{Scale_13.28}.png
        ...
        binary-masks/
          {UID_1088-mn-2020}{Location_Minn}{Genotype_14-8}{Year_2020}{Row_3}{Plant_75}{Nut_1}{Scale_13.28}.png
          ...
        binary-masks-overlay/
          {UID_1088-mn-2020}{Location_Minn}{Genotype_14-8}{Year_2020}{Row_3}{Plant_75}{Nut_1}{Scale_13.28}.png
          ...
    Row_3-Plant_76/
      in-shell/
        {UID_1089-mn-2020}{Location_Minn}{Genotype_14-8}{Year_2020}{Row_3}{Plant_76}{Nut_1}{Scale_13.141}.png
        ...
        binary-masks/
        binary-masks-overlay/
      kernel/
        {UID_1089-mn-2020}{Location_Minn}{Genotype_14-8}{Year_2020}{Row_3}{Plant_76}{Nut_1}{Scale_13.141}.png
        ...
        binary-masks/
        binary-masks-overlay/
    Row_4-Plant_2/
      in-shell/
        {UID_1092-mn-2020}{Location_Minn}{Genotype_14-8}{Year_2020}{Row_4}{Plant_2}{Nut_1}{Scale_13.317}.png
        ...
        binary-masks/
        binary-masks-overlay/
      kernel/
        {UID_1092-mn-2020}{Location_Minn}{Genotype_14-8}{Year_2020}{Row_4}{Plant_2}{Nut_12}{Scale_13.36}.png
        ...
        binary-masks/
        binary-masks-overlay/
    ...

Key points:

Year folder: e.g. 2020/.
Plant-level folders: Row_<row>-Plant_<plant>/ (for example, Row_3-Plant_75/).
Within each plant folder:
- in-shell/ – images of whole nuts in shell.
- kernel/ – images of kernels (after shell removal).
Within each of in-shell/ and kernel/:
- Original PNG images directly in the folder.
- binary-masks/ – binary segmentation masks corresponding to each PNG.
- binary-masks-overlay/ – PNGs showing the mask overlaid on the original image.

The MN 2021 and WI 2019/2020 archives use the same structure, with year, row, plant, and UID values adjusted accordingly.

3.2 File naming convention for PNG images

Each PNG filename encodes metadata in a series of brace-delimited fields. For example:

{UID_1088-mn-2020}{Location_Minn}{Genotype_14-8}{Year_2020}{Row_3}{Plant_75}{Nut_1}{Scale_13.28}.png

The fields are:

{UID_1088-mn-2020}
Unique image identifier for that site and year.
{Location_Minn}
Location code (here, Minnesota).
{Genotype_14-8}
Genotype identifier in the breeding population.
{Year_2020}
Year of harvest/imaging.
{Row_3}
Field row number.
{Plant_75}
Plant number within the row.
{Nut_1}
Nut index from that plant (1-based).
{Scale_13.28}
Image scale factor (for example, pixel-to-linear-dimension conversion) used in the phenotyping pipeline; the numeric value is encoded for each image.

For every image, the same basename (same sequence of {...} fields) is used for:

The original PNG file in in-shell/ or kernel/.
The corresponding binary mask in binary-masks/.
The corresponding overlay image in binary-masks-overlay/.

Thus, files can be matched programmatically across folders by their identical basenames.

4. Variable and data structure notes

Because the CSV and R objects are intended to be used directly with fullsibQTL, onemap, and StageWise, their internal structure follows the conventions of those packages.

4.1 `fullsibQTL_Eric-Jeff_phenos.csv`

Rows correspond to individuals in the Eric × Jeff mapping population.
Columns provide an identifier (matching genotype codes used in genotypic files) and trait values.
Trait column names correspond to trait abbreviations used in the manuscript (for example, nut and kernel size and shape metrics derived from image analysis).

4.2 `gs_geno_mn.csv` and `gs_geno_wi.csv`

Rows represent genotypes or entries used in genomic prediction.
Columns represent marker loci plus any necessary ID columns (for example, genotype name).
Marker coding and formatting are compatible with the StageWise workflow.

4.3 `mn_stage1.csv` and `wi_stage1.csv`

Rows represent observational units (plots or individuals) for each environment.
Columns include identifiers (plot or individual ID, genotype code), design factors (for example, replication or block, if used), and trait values.
Column names are aligned with the variable naming used in the manuscript and the StageWise documentation.

For precise trait definitions, units, and abbreviations, refer to the associated manuscript (tables and methods section), which provides the full description of each phenotypic variable.

5. Software and versions

The following software was used to generate and analyze the data:

R: https://www.r-project.org/
fullsibQTL: https://github.com/augusto-garcia/fullsibQTL
Used for QTL mapping in full-sib families.
onemap: https://github.com/augusto-garcia/onemap
Used for genetic map construction and phasing of marker data.
StageWise: https://github.com/jendelman/StageWise
Used for multi-environment genomic prediction.

To reproduce analyses:

Decompress the required data files (*.tar.gz, *.rda.gz).
Install R and the packages listed above.
Load the .rda objects and import the CSV files following the workflows described in the package vignettes and the associated manuscript.

6. Usage notes

The image archives are large; consider working with subsets (for example, specific rows/plants or environments) if storage or memory is limited.
The one-to-one mapping between original PNGs, masks, and overlay images allows flexible use in image analysis workflows (for example, recomputing or validating shape descriptors).
The genetic and phenotypic input files are structured to be immediately usable in fullsibQTL and StageWise; users familiar with these packages should be able to adapt the inputs to related analyses.

Phenotyping

Morphological characteristics of in-shell hazelnuts and shelled kernels were used as the trait data for this study. This was collected primarily using an adapted version of the digital imagery acquisition and analysis pipeline reported by (Hameed et al. 2018). In brief, bushes were completely harvested by hand in August 2020 and August 2021. Harvested clusters were dried in the greenhouse, husked, and 30 in-shell nuts were then randomly sampled per bush. These nuts were arranged on a 6x5 grid with a QR code, and a Nikon 5600 DSLR camera tethered to a desktop computer was used to acquire a single image. The OpenCV Python library was used to isolate each in-shell nut and produce a binary mask by applying a fixed HSV threshold to each pixel. An ellipse was then fit to each binary mask, and the length of its major and minor axes was calculated by converting pixel length to physical distance using a scale bar embedded in each image. Circularity of the nut was calculated as the ratio of these two lengths. A bulk weight for the subsample was then obtained, and each nut was individually cracked. Each kernel was then returned to the grid, preserving the original arrangement of the nuts, and a second photo was acquired, and the same traits were calculated. Finally, a bulk weight for the kernels was also obtained. This allowed for both a volumetric and a gravimetric estimation of the percent kernel for each nut sampled. Python scripts for the image acquisition, processing, phenotyping, and file management are available at: https://github.com/shbrainard/hazelnut-phenotyping.

Sequencing and genotyping

Roughly 1 cm²~~ of leaf tissue was sampled from each bush in May 2021, immediately following budbreak. Tissue was sampled into 96-well Qiagen Collection Microtubes (Qiagen N.V., Venlo, The Netherlands) and lyophilized using a Labconco 18 L freeze dryer set to 0.004mBar for 72 hours. Freeze-dried tissue was then macerated. DNA extraction, quantification, library preparation, and sequencing were performed at the University of Wisconsin-Madison Biotechnology Center. Libraries were prepared for genotyping-by-sequencing using a double digestion with the restriction enzymes NsiI and BfaI, following the methodology described by Elshire et al. 2011). This combination was pre-selected based on an analysis of k-mer distributions of various enzyme digestions, where NsiI/BfaI was observed to maximize the k-mer diversity of the library. Illumina adapters and sample-specific barcodes were then annealed. Samples were multiplexed, and paired-end 150-bp sequence data were generated using an Illumina NovaSeq 6000, with an average of 10 million reads per sample. Trimming and demultiplexing of raw Illumina reads was performed using a custom Java application https://github.com/shbrainard/gbsTools. Reads were aligned to the C. americana genome for ‘Winkler’ (Brainard et al. 2024).

Biallelic SNPs were utilized for the calculation of GEBVs, and were called using the TASSEL GBSv2 pipeline (Bradbury et al. 2007). SNPs were then filtered for missingness (<10% across all samples), linkage disequilibrium (r² < 0.75), and allele depth (80^th quantile of samples having a depth >8). Since the Wisconsin population of wild seedlings was unstructured, SNPs were also filtered to exclude sites with minor allele frequency < 0.05. All filtering was performed using bcftools (Danecek et al. 2021). Markers were then subset to only include those that were retained in both the Wisconsin and Minnesota populations, leaving 44,961 SNPs.

For the construction of the genetic map, haplotype-based markers were called using Stacks 2 (Rochette et al. 2019), which can identify multi-allelic markers using the phased nature of multiple indels or SNPs that appear within a single 150-bp paired-end read. Because such markers cannot be directly filtered for depth, the parameter ‘gt-alpha’ was increased to 0.01 as a method for ensuring genotype quality. Markers were then filtered for linkage disequilibrium (r2 > 0.95) using bcftools. This generated a set of 78,079 markers, with an average of ~7,000 per chromosome.

Linkage map construction and QTL analysis

Since the Minnesota populations were constructed from controlled crosses between known hazelnut varieties, it was possible to build a genetic map by using the R package onemap (Margarido et al. 2007) (https://github.com/augusto-garcia/onemap). This map was previously reported in (Brainard et al. 2023). Briefly, markers called using Stacks 2 were first filtered to include only those of segregation types A1, A2, and B3.7 (following the notation of Wu et al. 2002), such that only markers with either three or four alleles remained. Next, markers for which more than 5% of all samples had no called genotype were removed, and two-point recombination frequencies were calculated for all possible phase configurations between all remaining markers using maximum likelihood. Maximum likelihood estimates were able to fully resolve the phase between pairs of markers, due to the fully informative segregation types that were utilized. A hierarchical clustering algorithm was utilized to construct linkage groups, and markers were ordered and phased within these groups by using a Hidden Markov Model with an error rate of 0.05. Recombination frequencies were finally converted to genetic distances using the Kosambi mapping function. Finally, this map was imported into the R package fullsibQTL (Gazaffi et al. 2020) (https://github.com/augusto-garcia/fullsibQTL), which was used to perform composite interval mapping.

Calculation of GEBVs

To calculate genomic-estimated breeding values from the biallelic SNP dataset described above, the R package StageWise (Endelman 2023) (https://github.com/jendelman/StageWise) was used to compute variance components and best linear unbiased predictors (BLUPs) of additive genetic value. This software is designed to perform a two-stage analysis, by first computing best linear unbiased estimators (BLUEs) for each genotype using a specified experimental design. Since the genotypes in both populations were comprised of unreplicated seedlings, a fixed-effects linear model was used to first compute each genotype’s BLUE.

GBS data and morphological nut trait data from two populations of hazelnut (Corylus spp.)

Data files

Abstract

README: GBS data and morphological nut trait data from two populations of hazelnut (Corylus spp.)

1. Study description

2. Dataset contents (top-level files)

2.1 Genetic map R objects (linkage groups 1–11)

2.2 QTL mapping input files (fullsibQTL_input_files.zip)

2.3 Genomic prediction input files (StageWise_input_files.zip)

2.4 Image archives (raw and processed imagery)

3. Image data: folder structure and naming scheme

3.1 Directory structure (example: images_MN_2020.tar.gz)

3.2 File naming convention for PNG images

4. Variable and data structure notes

4.1 fullsibQTL_Eric-Jeff_phenos.csv

4.2 gs_geno_mn.csv and gs_geno_wi.csv

4.3 mn_stage1.csv and wi_stage1.csv

5. Software and versions

6. Usage notes

Methods

2.2 QTL mapping input files (`fullsibQTL_input_files.zip`)

2.3 Genomic prediction input files (`StageWise_input_files.zip`)

3.1 Directory structure (example: `images_MN_2020.tar.gz`)

4.1 `fullsibQTL_Eric-Jeff_phenos.csv`

4.2 `gs_geno_mn.csv` and `gs_geno_wi.csv`

4.3 `mn_stage1.csv` and `wi_stage1.csv`