GBS SNP datasets from "Genotyping-by-sequencing resolves relationships in Polygonaceae tribe Eriogoneae", TAXON
Data files
May 17, 2021 version files 280.46 MB
-
Pearman_et_al_Dryad.tar.gz
Abstract
The data come from genotyping-by-sequencing (GBS) of samples from plants in the tribe Eriogoneae (Polygonaceae). The data were used in evolutionary and population genomic analyses that can be found in Pearman et. al. "Genotyping-by-sequencing resolves relationships in Polygonaceae tribe Eriogoneae", published in TAXON. The analyses reported data from 51 species from the genera Chorizanthe, Eriogonum and Sidotheca. The data are organized by the analyses in which they were used. In most cases, README files and R scripts are included that detail the filtering that was conducted to produce the SNP data sets used in each case.
Methods
Whole genomic DNA was isolated with the DNeasy Plant Mini Kit (Qiagen, Hilden, Germany) by SGIKer genomic services at the University of the Basque Country, Leioa, Spain. Samples were first homogenized using 1.4 mm ceramic beads in a Percellys 24 (Bertin Instruments, Montigny-le-Bretonneux, France) and then extracted following kit instructions. Libraries for GBS (Elshire & al., 2011) were prepared at the Centro Nacional de d-Anàlisi Genòmica (CNAG, Barcelona, Spain). A single restriction enzyme, ApeKI, was used to cut genomic DNA during library construction. ApeKI recognizes a 5-base pair degenerative sequence (GCWGC). Barcodes were designed to allow for two sequencing errors without confusion of samples. Paired-end sequencing of 678 samples representing the genera Eriogonum, Chorizanthe and Sidotheca was conducted on Illumina HiSeq machines with a read length of 101 base-pairs.
Sequences from each run were parsed based on presence of the enzyme remnant cut site and in-line barcodes with GBS-SNP-CROP v.4.0 (Melo & al., 2016), and trimming based on quality and adapters was performed with GBS-SNP-CROP and Trimmomatic v.0.36 (Bolger & al., 2000). We accepted reads with a minimum Phred quality score of 20. These parsed and quality-filtered reads were demultiplexed according to their in-line barcode, and a pair of FASTQ files were produced for each sample with GBS-SNP-CROP. A mock reference was built with a de novo assembly method based on sequence similarity using Pear v.0.9.6 and Vsearch v.1.1.3 (Zhang & al., 2014; Rognes & al., 2016). Reads were aligned against this reference using BWA-MEM v.0.7.12 and mapped reads were filtered with SAMtools v.1.2 (Li & Durbin, 2009; Li & al., 2009). The properly paired, primary aligned reads were kept to produce an mpileup file for each sample. Variant calling was done using GBS-SNP-CROP pipeline, including a series of filters. We discarded SNPs with more than two alleles and all loci with read depths less than 6 and greater than 100 (to reduce the potential for confounding of non-orthologous loci). The resulting data on 699,331 SNPs and 678 samples were exported to PLINK v.1.9 (Chang & al., 2015) files for further filtering. SNPs with greater than 50% missingness across samples were removed, as were individuals with greater than 50% missing genotype data at the remaining SNPs. Additional details are provided in readme files and RStudio scripts.
Usage notes
The data are an assortment of ready-to-use files and ones that will need filtering. On-line files associated with the publication will be useful for working with the data sets.