Data from: Targeted genotyping-by-sequencing of potato and data analysis with R/polyBreedR
Data files
May 24, 2024 version files 60.79 MB
Abstract
“Mid-density” targeted genotyping-by-sequencing (GBS) combines trait-specific markers with thousands of genomic markers at an attractive price for linkage mapping and genomic selection. A 2.5K targeted GBS assay for potato was developed using the DArTagTM technology and later expanded to 4K targets. Genomic markers were selected from the potato InfiniumTM SNP array to maximize genome coverage and polymorphism rates. The DArTag and SNP array platforms produced equivalent dendrograms in a test set of 298 tetraploid samples, and 83% of the common markers showed good quantitative agreement, with RMSE (root-mean-squared-error) less than 0.5. DArTag is suited for genomic selection candidates in the clonal evaluation trial, coupled with imputation to a higher-density platform for the training population. Using the software polyBreedR, an R package for the manipulation and analysis of polyploid marker data, the RMSE for imputation by linkage analysis was 0.15 in a small half-diallel population (N=85), which was significantly lower than the RMSE of 0.42 with the Random Forest method. Regarding high-value traits, the DArTag markers for resistance to potato virus Y, golden cyst nematode, and potato wart appeared to track their targets successfully, as did multi-allelic markers for maturity and tuber shape. In summary, the potato DArTag assay is a transformative and publicly available technology for potato breeding and genetics.
README: Data from "Targeted genotyping-by-sequencing of potato and data analysis with R/polyBreedR"
https://doi.org/10.5061/dryad.8pk0p2nw4
File S1. Potato DArTag V1 data for 703 samples in Variant Call Format (VCF). The V1 panel had 2503 markers.
File S2. Metadata for the samples in File S1 (CSV). First column (id) matches the sample ID in File S1. The second column (year) is the submission year: 2020, 2021, or 2022.
File S3. Potato DArTag V2 data for 375 samples in Variant Call Format (VCF). The V2 panel had 3915 markers.
File S4. DArTag Missing Allele Discovery Counts for the samples in File S3 (CSV). The first three columns are "AlleleID", "CloneID", and "AlleleSequence". CloneID is the marker name; AlleleID is the haplotype name, and AlleleSequence is the 81 bp haplotype. The first row contains the plate id, and the second row contains the sample id.
File S5. Potato V4 SNP array data for 298 samples and 15,187 markers, in Variant Call Format (VCF).
File S6. Pedigree file for a five-parent, half-diallel population within the V1 DArTag dataset (CSV). The file format follows that required for the software PolyOrigin (https://github.com/chaozhi/PolyOrigin.jl). Column 1 is the sample id. Column 2 is the population number (0 = parent). Columns 3 and 4 are the two parents. Column 5 is the ploidy.
File S7. Phased parental genotypes for the five-parent, half-diallel population in File S6 (CSV). Columns 1-4 contain the map (marker, chromosome, position in cM, position in bp). Columns 5-9 are the phased genotypes in Variant Call Format.
File S8. Pedigree file for a six-parent, partial diallel population in the V2 DArTag dataset (CSV). The file format follows that required for the software diaQTL (https://github.com/jendelman/diaQTL). Column 1 is the sample id. Columns 2 and 3 are the parents.
File S9. Parental genotype probabilities for the diallel population in File S8 (CSV). The file format follows that required for the software diaQTL (https://github.com/jendelman/diaQTL). Columns 1-4 are the map (marker, chrom, cM, bp), followed by sample id as the column headers.
File S10. Trait marker phenotypes for the partial diallel population in File S8 (CSV). Column 1 is sample id. Column 2 and 3 indicate presence (Y) or absence (N) of the ALT allele for the markers.
File S11. Sequence alignment and percent identity matrix for OFP20 (DOCX), generated using MUSCLE v3.8.
File S12. Marker concordance between V1 DArTag and the SNP array (CSV). Columns 1-3 are the map (marker, chromosome, position), followed by columns "RMSE.4x" for root-mean-squared-error of tetraploid genotypes; "CE.4x" for classification error of tetraploid genotypes; and "CE.2x" for classification error of pseudo-diploid genotypes.
File S13. Marker concordance between V2 DArTag and the SNP array (CSV). Columns 1-3 are the map (marker, chromosome, position), followed by column "CE.2x" for classification error of pseudo-diploid genotypes.