Genomic structure and ex situ conservation of the North American grapevine Vitis labrusca
Data files
Nov 24, 2025 version files 122.76 MB
-
filtered_chr1_19_cluster2_maf01_ld_pca.eigenval
159 B
-
filtered_chr1_19_cluster2_maf01_ld_pca.eigenvec
40.77 KB
-
filtered_chr1_19_cluster2_maf01_ld.map
858.39 KB
-
filtered_chr1_19_cluster2_maf01_ld.ped
13.80 MB
-
filtered_chr1_19_cluster2_maf01.map
1.23 MB
-
filtered_chr1_19_cluster2_maf01.ped
19.82 MB
-
filtered_chr1_19_ld_pca.eigenval
159 B
-
filtered_chr1_19_ld_pca.eigenvec
61.72 KB
-
filtered_chr1_19_ld.map
1.47 MB
-
filtered_chr1_19_ld.ped
34.98 MB
-
filtered_chr1_19.map
2.03 MB
-
filtered_chr1_19.ped
48.46 MB
-
README.md
3.39 KB
Abstract
Vitis labrusca, a North American wild grapevine, is an important source of disease resistance and climate resilience traits for grape breeding, yet its genomic diversity is incompletely represented in ex situ germplasm collections. We genotyped 314 accessions, which included material conserved at the USDA germplasm collection and newly sampled wild individuals. Accessions were genotyped using genotyping-by-sequencing, and after imputation and filtering, we identified a total of 44,701 SNPs. Within the accessions genotyped, we identified extensive mislabelling and hybridization, with approximately one-third of accessions classified as putative hybrids. We also detected genetically distinct populations from Virginia and North Carolina that are not currently conserved. These results reveal significant geographic and genomic gaps in ex situ conservation of V. labrusca and highlight priority regions for future sampling to better safeguard this species for breeding and research.
Dataset DOI: 10.5061/dryad.s7h44j1n1
Description of the data and file structure
We examined 314 accessions of V. labrusca, including accessions conserved in the USDA germplasm collection as well as wild-collected accessions. Using genotyping-by-sequencing, we identified over 44,000 genetic markers, which we used to determine population structure, potential mislabelling or hybridization, and the extent to which ex situ conserved accessions adequately capture this species.
Files and variables
File: filtered_chr1_19_cluster2_maf01_ld_pca.eigenvec
Description: PCA results for accessions assigned > 0.90 to Cluster 2 (n = 183) after MAF 1 % filter and LD-pruning (--indep-pairwise 10 3 0.5) using 18,849 SNPs
File: filtered_chr1_19_cluster2_maf01.map
Description: SNP map file (PLINK format) for accessions assigned > 0.90 to Cluster 2 (n = 183) after MAF 1 % filter for a total of 27,071 SNPs
File: filtered_chr1_19_cluster2_maf01_ld_pca.eigenval
Description: PCA results for accessions assigned > 0.90 to Cluster 2 (n = 183) after MAF 1 % filter and LD-pruning (--indep-pairwise 10 3 0.5) using 18,849 SNPs
File: filtered_chr1_19_ld_pca.eigenvec
Description: PCA results for all accessions (n = 271) after MAF 1 % filter and LD-pruning (--indep-pairwise 10 3 0.5) using 32,267 SNPs
File: filtered_chr1_19_ld_pca.eigenval
Description: PCA results for all accessions (n = 271) after MAF 1 % filter and LD-pruning (--indep-pairwise 10 3 0.5) using 32,267 SNPs
File: filtered_chr1_19_cluster2_maf01_ld.map
Description: SNP map file (PLINK format) for accessions assigned > 0.90 to Cluster 2 (n = 183) after MAF 1 % filter and LD-pruning (--indep-pairwise 10 3 0.5) for a total of 18,849 SNPs
File: filtered_chr1_19_ld.map
Description: SNP map file (PLINK format) for all accessions (n = 271) after MAF 1 % filter and LD-pruning (--indep-pairwise 10 3 0.5) for a total of 32,267 SNPs
File: filtered_chr1_19_cluster2_maf01.ped
Description: SNP ped file (PLINK format) for accessions assigned > 0.90 to Cluster 2 (n = 183) after MAF 1 % filter for a total of 27,071 SNPs
File: filtered_chr1_19_cluster2_maf01_ld.ped
Description: SNP ped file (PLINK format) for accessions assigned > 0.90 to Cluster 2 (n = 183) after MAF 1 % filter and LD-pruning (--indep-pairwise 10 3 0.5) for a total of 18,849 SNPs
File: filtered_chr1_19_ld.ped
Description: SNP ped file (PLINK format) for all accessions (n = 271) after MAF 1 % filter and LD-pruning (--indep-pairwise 10 3 0.5) for a total of 32,267 SNPs
File: filtered_chr1_19.map
Description: SNP map file (PLINK format) for all accessions (n = 271) after MAF 1 % filter for a total of 44,701 SNPs
File: filtered_chr1_19.ped
Description: SNP ped file (PLINK format) for all accessions (n = 271) after MAF 1 % filter for a total of 44,701 SNPs
Code/software
Files are provided in PLINK (v2.00a5.8) format (ped/map)
Access information
Data was derived from the following sources:
- Data are from Migicovsky et al. (2025) Genomic structure and ex situ conservation of the North American grapevine Vitis labrusca, Plants People Planet
A total of 314 unique samples were sequenced, although 11 of these were sequenced twice. Sequencing data were processed and single nucleotide polymorphisms (SNPs) identified using the Tassel 5 GBS v2 Pipeline v5.2.94 (Bradbury et al., 2007). During this process, replicate samples with the same IDs were automatically merged. Default parameters were used for all steps except the initial GBSSeqToTagDBPlugin() step, where a minimum quality score (-mnQS) of 20 was required. Reads were aligned to the V. labrusca reference genome for GREM4, version 1 (Li & Gschwend, 2023) using BWA v0.7.18 (Li & Durbin, 2009). A minimum minor allele frequency (MAF) of 0.01 was set for SNP discovery using Tassel, and a total of 154,638 SNPs were identified. These SNPs were filtered using VCFTools v0.1.16 to remove indels, retain only biallelic sites, and ensure a MAF threshold of 0.01 (Danecek et al., 2011), resulting in 145,155 SNPs across the 314 accessions.
Imputation was performed using LinkImputeR (Money et al., 2017), with input filters of MAF 0.01, a maximum missingness by position of 70%, and a Hardy-Weinberg equilibrium filter of p < 0.0001. Parameters tested included depths of 2 to 6 (increasing by 1) and maximum missingness per SNP and sample of 0.2 to 0.6 (increasing by 0.1). The selected imputation case applied the filters of PositionMiss(0.6) and SampleMiss(0.6), resulting in an accuracy of 0.935 and a correlation value of 0.8091. After imputation, 271 accessions and 46,916 SNPs remained. VCFTools (v0.1.16) (Danecek et al., 2011) was used to remove SNPs located on unassembled contigs in the V. labrusca genome from the SNP set, resulting in a final filtered dataset of 44,701 SNPs.
PLINK v2.00a5.8 (Chang et al., 2015; Purcell & Chang, 2025) was used to prune the imputed SNP table for linkage disequilibrium (LD). LD-pruning was performed by considering a window of 10 SNPs, removing one SNP from a pair if the r2 was >0.5 then shifting the window by 3 SNPS and repeating the procedure (--indep-pairwise 10 3 0.5). The resulting dataset contained 271 accessions and 32,267 SNPs.
Accessions with >0.90 ancestry proportion assignment to Cluster 2 (n = 183) were retained as putative “pure” V. labrusca. These accessions were kept using PLINK (v2.00a5.8), followed by a filter for MAF 0.01, resulting in 27,071 SNPs remaining (Chang et al., 2015; Purcell & Chang, 2025). LD-pruning was repeated as described above and the final dataset contained 18,849 SNPs and 183 accessions.
References
Bradbury, P.J., Zhang, Z., Kroon, D.E., Casstevens, T.M., Ramdoss, Y. & Buckler, E.S. (2007). TASSEL: software for association mapping of complex traits in diverse samples. Bioinformatics, 23(19), pp. 2633–2635.
Chang, C.C., Chow, C.C., Tellier, L.C., Vattikuti, S., Purcell, S.M. & Lee, J.J. (2015). Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience, 4(1), pp. s13742-015-0047–8 https://doi.org/10.1186/s13742-015-0047-8.
Danecek, P., Auton, A., Abecasis, G., Albers, C.A., Banks, E., DePristo, M.A., Handsaker, R.E., Lunter, G., Marth, G.T., Sherry, S.T., McVean, G., Durbin, R. & Genomes Project Analysis, G. (2011). The variant call format and VCFtools. Bioinformatics, 27(15), pp. 2156–8 https://doi.org/10.1093/bioinformatics/btr330.
Li, B. & Gschwend, A.R. (2023). Vitis labrusca genome assembly reveals diversification between wild and cultivated grapevine genomes. Frontiers in Plant Science, 14 https://doi.org/10.3389/fpls.2023.1234130.
Li, H. & Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25(14), pp. 1754–60 https://doi.org/10.1093/bioinformatics/btp324.
Money, D., Migicovsky, Z., Gardner, K. & Myles, S. (2017). LinkImputeR: user-guided genotype calling and imputation for non-model organisms. BMC Genomics, 18(1), p. 523 https://doi.org/10.1186/s12864-017-3873-5.
Purcell, S. & Chang, C. (2025). PLINK v2.00a5.8 www.cog-genomics.org/plink/2.0/.
