Data from: Comparing methods for SNP calling from Genotyping-By-Sequencing (GBS) data for a large-genome conifer without a published genome sequence
Cite this dataset
Shu, Mengjun; Moran, Emily (2021). Data from: Comparing methods for SNP calling from Genotyping-By-Sequencing (GBS) data for a large-genome conifer without a published genome sequence [Dataset]. Dryad. https://doi.org/10.5061/dryad.6fv8fb4
Reduced-representation restriction-enzyme-based sequencing methods have been demonstrated to be robust and cost-effective genotyping methods to identify Single Nucleotide Polymorphisms (SNPs). While alignment of the short-read fragments to a genome sequence of the same species results in better SNP calling than de novo approaches, only a few tree species - and few conifers in particular - have an annotated sequence. Many conifer genomes are huge (>19 GB) and include a large proportion of repeat sequences, making assembly difficult. While the sequence of a related species could be used, choosing the proper pipeline for SNP calling is still challenging. Here we compare the performance of four bioinformatics pipelines, two of which require a reference genome (TASSEL-GBS V2 and Stacks), two of which are de novo pipelines (UNEAK and Stacks). We used Illumina GBS data from 94 ponderosa pines. Using loblolly pine genome as the reference greatly increased the number of SNPs called (62 -196 thousand vs. 2.1 - 2.7 million SNPs). UNEAK was fastest and identified more SNPs than Stacks de novo. Reference-based Stacks produced the highest number of SNPs with lowest proportion of paralogs, TASSEL-GBS V2 exhibited the highest proportion of paralogs. The Stacks reference-based approach produced the best results overall, while UNEAK is the better de novo method. However, all four pipelines had distinct benefits and limitations. Differences in observed and expected heterozygosity between the SNP sets generated by the pipelines could lead to different conclusions when they are used for population genetics analyses.