Supplementary data from: Targeted population genomics uncovers demographic history and genetic divergence in North American wild cranberry
Data files
Apr 13, 2026 version files 187.42 MB
-
README.md
17.18 KB
-
Supplemental_Data_S1.vcf.gz
155.02 MB
-
Supplemental_Table_S1.csv
197.68 KB
-
Supplemental_Table_S2.csv
8.60 KB
-
Supplemental_Table_S3.csv
763 B
-
Supplemental_Table_S4.csv
848.25 KB
-
Supplemental_Table_S5.csv
31.17 MB
-
Supplemental_Table_S6.csv
41.10 KB
-
Supplemental_Table_S7.csv
120.08 KB
Abstract
Wild populations of North American cranberry (Vaccinium macrocarpon Aiton) are reservoirs of genetic variation that may contribute to the improvement of breeding-relevant traits. However, the extent to which wild genetic variation is geographically structured and represented in elite germplasm remains unclear. We analyzed 179 wild cranberry accessions from the upper Midwest and Eastern North America to estimate nucleotide diversity (π), population structure, and loci associated with genetic differentiation and environmental variables using a genome-informed targeted genotyping panel. Additionally, 14 demographic scenarios were evaluated using site-frequency-spectrum–based inference to identify historical events that could explain current genetic diversity. We observed extremely low nucleotide diversity within the targeted panel (π = 5x10-6). Rare allele distributions strongly influenced π and Tajima's D values, suggesting constrained diversity in the genomic regions assayed that is not captured by heterozygosity-based estimates alone. However, we interpreted these results as conservative lower bounds on genome-wide neutral diversity because the targeted panel is enriched for genic and conserved regions. A clear separation between the Midwest and East populations was observed, with inbreeding coefficients ranging from -0.13 to 0.15. Furthermore, site frequency spectrum inference from the targeted panel supported a demographic scenario consistent with a significant population reduction ≈15-14 thousand years ago (kya), followed by a divergence between the two regions ≈12 kya, and an asymmetric gene flow ≈1.3 kya. We detected 254 candidate loci showing regional allele-frequency differentiation. Several of these loci colocalized with candidate genes linked to stress response, development, and metabolic processes. To evaluate the representation of geographically differentiated wild alleles in a breeding context, we analyzed Rutgers breeding materials (n = 484) and found that this panel is enriched for common alleles in Eastern wild populations. These findings indicate regionally structured allele-frequency variation in wild cranberry, with potential relevance to environmental response and breeding. This study extends prior wild cranberry population-genetic research by providing targeted-panel estimates of diversity, comparisons of demographic models, and breeding insights on geographically differentiated alleles, while highlighting the importance of conserving wild cranberry germplasm for use in modern breeding programs.
Dataset DOI: 10.5061/dryad.m0cfxpph5
Description of the data and file structure
- Supplemental_Table_S1.csv
Wild and breeding cranberry materials were analyzed in this study. The wild germplasm consists of 181 accessions collected in nine U.S. states and one Canadian province, including 179 Vaccinium macrocarpon and two Vaccinium oxycoccos. The breeding panel comprises 871 V.macrocarpon accessions derived from controlled crosses in the Rutgers University Cranberry and Blueberry Breeding Program. Entries marked “n/a” indicate missing data.
Variables
- Accession ID: Passport identifier.
- Family ID: Pedigree-based family identifier for Rutgers breeding populations.
- Species: Taxonomic designation of each accession, including Vaccinium macrocarpon and Vaccinium oxycoccos.
- Designation: Classification of each accession as wild or breeding material.
- Location: The site where the accession was collected.
- Province or State: Collection origin of each accession from Delaware (DE), Massachusetts (MA), Maine (ME), Michigan (MI), Minnesota (MN), New Brunswick, Canada (NB), New Jersey (NJ), New York (NY), Pennsylvania (PA), and Wisconsin (WI).
- Latitude: North-south geographic coordinate (°).
- Longitude: East-west geographic coordinate (°).
- River basin: Hydrologic region assigned to each population.
- Elevation: Topographic height (m) of each geographic coordinate.
- PC1/PC2: Scores of the first and second principal components from all V. macrocarpon accessions analyzed in the study (wild and Rutgers breeding material).
- Bio1: Annual mean temperature (°C).
- Bio2: Mean diurnal range (°C).
- Bio3: Isothermality.
- Bio4: Temperature seasonality (°C x 100).
- Bio5: Max. temperature of warmest month (°C).
- Bio6: Min. temperature of coldest month (°C).
- Bio7: Temperature annual range (°C).
- Bio8: Mean Temperature of wettest quarter (°C).
- Bio9: Mean Temperature of driest quarter (°C).
- Bio10: Mean temperature of warmest quarter (°C).
- Bio11: Mean temperature of coldest quarter (°C).
- Bio12: Annual precipitation (mm).
- Bio13: Precipitation of wettest month (mm).
- Bio14: Precipitation of driest month (mm).
- Bio15: Precipitation seasonality (mm).
- Bio16: Precipitation of wettest quarter (mm).
- Bio17: Precipitation of driest quarter (mm).
- Bio18: Precipitation of warmest quarter (mm).
- Bio19: Precipitation of coldest quarter (mm).
- Bdod_0-5cm: Bulk density topsoil (cg/cm3).
- Bdod_5-15cm: Bulk density subsoil (cg/cm3).
- Cfvo_0-5cm: Course fragment content topsoil (cm3/dm3).
- Cfvo_5-15cm: Course fragment content subsoil (cm3/dm3).
- Clay_5-15cm: Clay subsoil (g/kg).
- Nitrogen_0-5cm: Total nitrogen topsoil (cg/kg).
- Nitrogen_5-15cm: Total nitrogen subsoil (cg/kg).
- Ocd_0-5cm: Organic carbon density topsoil (hg/m3).
- Ocd_5-15cm: Organic carbon density subsoil (hg/m3).
- Phh2o_0-15cm: pH topsoil (pH x 10).
- Phh2o_5-15cm: pH subsoil (pH x 10).
- Sand_0-15cm: Sand topsoil (g/kg).
- Sand_5-15cm: Sand subsoil (g/kg).
- Silt_0-5cm: Silt topsoil (g/kg).
- Silt_5-15cm: Silt subsoil (g/kg).
- Soc_0-5cm: Soil organic carbon topsoil (dg/kg).
- Soc_5-15cm: Soil organic carbon subsoil (dg/kg).
Supplemental_Table_S2.csv
Introgression statistics from Dsuite at the population level. Introgression signals detected across all group comparisons using the ABBA-BABA framework.
Variables
- P1, P2, P3: Populations used in D'Suite D-statistical tests, where P1 and P2 are sister populations, and P3 is the population tested for excess allele sharing.
- D-statistic: Patterson’s D-statistic measuring excess allele sharing between P3 and either P1 or P2.
- Z-score: Standardized test statistic used to assess the significance of the D-statistic.
- p-value: Two-sided p-value associated with the Z-score.
- f4-ratio: Estimated proportion of introgressed ancestry based on the f4-ratio statistic.
- BBAA, ABBA, BABA: Counts of site patterns used to calculate the D-statistic, where BBAA indicates shared derived alleles between P1 and P2, ABBA between P2 and P3, and BABA between P1 and P3.
Supplemental_Table_S3.csv
Introgression statistics of Dsuite at the river basin level. Introgression signals detected across all group comparisons using the ABBA-BABA framework.
Variables
- P1, P2, P3: Populations used in D'Suite D-statistical tests, where P1 and P2 are sister populations, and P3 is the population tested for excess allele sharing.
- D-statistic: Patterson’s D-statistic measuring excess allele sharing between P3 and either P1 or P2.
- Z-score: Standardized test statistic used to assess the significance of the D-statistic.
- p-value: Two-sided p-value associated with the Z-score.
- f4-ratio: Estimated proportion of introgressed ancestry based on the f4-ratio statistic.
- BBAA, ABBA, BABA: Counts of site patterns used to calculate the D-statistic, where BBAA indicates shared derived alleles between P1 and P2, ABBA between P2 and P3, and BABA between P1 and P3.
Supplemental_Table_S4.csv
Genome-wide scan results for genetic differentiation (GST) and spatial ancestry (SPA) analysis.
Variables
- Marker: Single nucleotide polymorphisms (SNPs) mapped from the reference genome 'Ben Lear' v1.0.
- Chr: Chromosome location of the SNPs in the reference genome.
- Pos (bp): Physical position of the SNPs in the reference genome.
- SPA: Statistic score representing allele frequency gradients in geographic space.
- GST: Genetic differentiation values among populations used as input for genome-wide scans.
Supplemental_Table_S5.csv
Genome-wide scan results for environmental factors obtained from North American wild cranberry locations. The entries marked "n/a" represent missing data.
Variables
- Marker: Single nucleotide polymorphisms (SNPs) mapped from the reference genome 'Ben Lear' v1.0.
- Trait: Environmental variables associated with genetic markers (see Supplemental Table S1).
- Model: Genome-wide association models tested in this study, including mod1 (K), mod2 (K + latitude), mod3 (K + longitude), mod4 (K + latitude + longitude), mod5 (K + Q (1 PC)), mod6 (K + Q (1 PC) + latitude), mod7 (K + Q (1 PC) + longitude), mod8 (K + Q (1 PC) + latitude + longitude).
- Score: Strength of the association between marker and environmental trait.
- P_value: Statistical significance of the marker-environment association.
Supplemental_Table_S6.csv
Significant genome-wide associations detected for genetic differentiation (GST) and geographic-environmental factors. The entries marked "n/a" represent missing data.
Variables
- Trait: Variables used for genome-wide scans, including GST, spatial ancestry (SPA), and environmental factors that describe geographic, climatic, and soil conditions (Supplemental Table S1). Multiple traits listed together indicate overlapping genomic signals.
- Category: Variables were classified into temperature (bio1-bio11), precipitation (bio12-bio19), soil (bdod, cfvo, clay, nitrogen, ocd, phh2o, sand, silt, soc), elevation, GST and SPA to facilitate the interpretation of genome-wide associations.
- Model: Gene action at the specific locus, including additive, dominant for the reference allele (1-dom-ref), and dominant for the alternate allele (1-dom-alt).
- Marker: Single nucleotide polymorphisms (SNPs) mapped from the reference genome 'Ben Lear' v1.0.
- Chr: Chromosome location of the SNPs in the reference genome.
- Pos (bp): Physical position of the SNPs in the reference genome.
- Reference / Alternate: Nucleotides representing the reference and alternate alleles at the SNPs.
- Gene ID: Identifier of the predicted cranberry gene associated with the lead SNP based on SnpEff analysis.
- Region: Genomic location of the lead SNP relative to the gene model (intergenic, intronic, splicing or exonic).
- Impact: Predicted functional effect of the SNP on the gene as classified by SnpEff (moderate, low, or modifier).
- AGI homolog: Arabidopsis Gene Identifier (AGI) of the closest Arabidopsis thaliana homolog inferred by sequence similarity.
- Identity (%): Percentage of amino acid sequence identity between the cranberry gene and its Arabidopsis homolog.
- Description: Functional annotation of the gene based on Arabidopsis homolog information.
- DE: Allele frequency of the reference allele in the Delaware (DE) population.
- MA: Allele frequency of the reference allele in the Massachusetts (MA) population.
- ME: Allele frequency of the reference allele in the Maine (ME) population.
- MI: Allele frequency of the reference allele in the Michigan (MI) population.
- MN:Aallele frequency of the reference allele in the Minnesota (MN) population.
- NB: Allele frequency of the reference allele in the New Brunswick, Canada (NB) population.
- NJ: Allele frequency of the reference allele in the New Jersey (NJ) population.
- NY: Allele frequency of the reference allele in the New York (NY) population.
- PA: Allele frequency of the reference allele in the Pennsylvania (PA) population.
- RU: Allele frequency of the reference allele in the Rutgers (RU) population.
- WI: Allele frequency of the reference allele in the Wisconsin (WI) population.
Supplemental_Table_S7.csv
Candidate genes located within ±17 kb of 186 lead SNPs associated with genetic differentiation (GST) and geographic–environmental factors in wild cranberry. The entries marked "n/a" represent missing data.
Variables
- Marker: Single nucleotide polymorphisms (SNPs) mapped from the reference genome 'Ben Lear' v1.0.
- Chr: Chromosome location of the SNPs in the reference genome.
- Pos (bp): Physical position of the SNPs in the reference genome.
- Gene ID: Identifier of the predicted cranberry gene located within ±17 kb of the lead SNP.
- Start/End (bp): Genomic start and end coordinates of the annotated gene model in the reference genome 'Ben Lear' v1.0.
- AGI homolog: Arabidopsis Gene Identifier (AGI) of the closest Arabidopsis thaliana homolog inferred by sequence similarity.
- Identity (%): Percentage of amino acid sequence identity between the cranberry gene and its Arabidopsis homolog.
- Description: Functional annotation of the gene based on Arabidopsis homolog information.
- GO term: Gene Ontology (GO) annotation describing the predicted biological process, molecular function, and/or cellular component of the gene.
Supplemental_Data_S1.vcf.gz
Variant call format (VCF) file containing all detected single-nucleotide polymorphisms (SNPs) and insertion–deletion variants (INDELs) with minor allele frequency (MAF) ≥ 0.01 identified in this study. Accession identifiers and metadata are provided in Supplemental Table S1.
Supplemental Fig.S1. Distribution of minor allele frequency (MAF) in wild cranberry. The density distribution shows the frequency spectrum of 30,818 SNPs within the dataset, including common (MAF > 0.05) and rare (0.01 ≤ MAF < 0.05) alleles.
Supplemental Fig.S2. Genomic distribution of single-nucleotide polymorphisms (SNPs) across the cranberry genome. Physical distribution of 74,231 SNPs with minor allele frequency (MAF) ≥ 0.01 across the 12 chromosomes of the reference genome 'Ben Lear' v1.0. Each vertical line represents the genomic position of an individual SNP detected across all wild cranberry populations analyzed in this study.
Supplemental Fig.S3. Distribution of nucleotide diversity (π) in wild cranberry populations. π density plot of wild cranberry populations collected from Delaware (DE), Massachusetts (MA), Maine (ME), Michigan (MI), Minnesota (MN), New Brunswick, Canada (NB), New Jersey (NJ), New York (NY), Pennsylvania (PA), and Wisconsin (WI).
Supplemental Fig.S4. Analysis of ancestry proportions in wild cranberry accessions. The bar plots represent admixture levels (k) in accessions collected from Delaware (DE), Massachusetts (MA), Maine (ME), Michigan (MI), Minnesota (MN), New Brunswick, Canada (NB), New Jersey (NJ), New York (NY), Pennsylvania (PA), and Wisconsin (WI).
Supplemental Fig.S5. Cross-validation error (CV) of ten ancestry levels studied in wild cranberry. A) CV error from 1 to 10 analyzed clusters (k). B) Percentage decrease in CV error relative to the previous K. The optimal clustering solution was identified at K = 4, where further reductions in CV error became negligible.
Supplemental Fig.S6. Consensus Neighbor-Joining phylogenetic tree of 179 Vaccinium macrocarpon samples generated from 1000 bootstrap replicates. Leaf colors represent the cranberry population. The bars surrounding the nodes indicate the group (Western or Eastern) to which each population belongs. Only nodes with bootstrap support values > 50% are shown. Two samples of Vaccinium oxycoccos (grey leaf colors) were used as outgroup in this analysis.
Supplemental Fig.S7. Scheme of the demographic models tested to infer the divergence between wild cranberry populations in the Midwest and Eastern North America. 14 demographic models (Models A-N), where all models included a divergence time parameter (TDIV) as their main component. Model A represented divergence as the unique factor, while Model B added constant migration (mK) and Model C included population growth. Model D included all three parameters that combined divergence with migration and population growth. Models E through H added population bottlenecks (TBOT) to their analysis with different combinations of divergence and migration and growth and both factors. Models I and J tested different migration patterns using recent migration (mR) in Model I and early migration (mE) in Model J. Models K and L demonstrated how divergence between populations occurred with directional migration and population growth while testing Eastern (K) and Midwestern (L) migration patterns. The final two models (M and N) combined population bottleneck effects with divergence and recent migration or early migration.
Supplemental Fig.S8. Akaike Information Criterion (AIC) comparison of demographic models for Midwest–East divergence in wild cranberry. The box plot shows the AIC distribution of 14 demographic scenarios evaluated in this study (Supplemental Fig.S6), including models with population bottlenecks, divergence, growth, and migration. The run with the highest composite log-likelihood per model was re-simulated 100 times to approximate the likelihood distribution and evaluate overlap between models using AIC distributions.
Supplemental Fig.S9. Geographic locations of river basin groups and genetic introgression signals. A) Map of river basin group locations, with each point representing a wild cranberry population. B) Heatmap showing the introgression detected using the Fbranch statistic (fb) for specific clades in the phylogenetic tree. The Y-axis shows the phylogeny at the river basin group level, rooted with Vaccinium oxycoccos (outgroup). The nodes represent the river basin groups: Upper Mississippi (UM), New England (NE), Souris-Red-Rainy (SR), Mid Atlantic (MA), and Great Lakes (GL). Darker red corresponds to higher values of fb (0–1), while gray indicates the absence of introgression. The blue dotted lines indicate the internal nodes representing the most recent common ancestors.
Supplemental Fig.S10. Geographic variation in reference allele frequencies across Midwestern and Eastern wild cranberry populations. Reference allele frequencies for loci identified by genetic differentiation (GST) and spatial ancestry (SPA) analyses on chromosomes 1 (Vmac_chr01_39700535), 3 (Vmac_chr03_8535639), 4 (Vmac_chr04_34367988), and 11 (Vmac_chr11_1489270), as well as for loci associated with organic carbon density in subsoil (Vmac_chr05_33388302), mean temperature of the coldest quarter (Vmac_chr10_15584947), and precipitation of the driest month (Vmac_chr06_33703561). Frequencies are shown for wild populations collected from Delaware (DE), Massachusetts (MA), Maine (ME), Michigan (MI), Minnesota (MN), New Brunswick, Canada (NB), New Jersey (NJ), New York (NY), Pennsylvania (PA), and Wisconsin (WI), along with elite germplasm from the Rutgers University breeding program (RU). Reference allele were obtained from the 'Ben Lear' v1.0 reference genome.
Supplemental Fig.S11. Principal component analysis (PCA) of wild and Rutgers breeding cranberry germplasm. Population structure between wild cranberry populations and Rutgers breeding material was analyzed through a PCA based on 7,523 genome-wide single nucleotide polymorphisms (SNPs). Detailed information on accessions is provided in Supplemental Table S1.
Supplemental Fig.S12. Gene Ontology (GO) enrichment of cellular component terms in temperature-associated genomic regions. GO cellular component terms showing significant enrichment for temperature-associated loci at a false discovery rate (FDR) ≤ 0.05. Bars represent the number of genes associated with each GO term, and color indicates the adjusted p-value.
