Podarcis bocagei vs P. carbonelli hybrid zone SNP datasets from ddRADseq
Data files
Jun 02, 2023 version files 46.78 MB
-
bocagei_carbonelli_80-20_dataset_2300loci.vcf
-
bocagei_carbonelli_complete_dataset_6905loci_replicates.vcf
-
bocagei_carbonelli_complete_dataset_6905loci.vcf
-
bocagei_carbonelli_diagnostic_dataset_1241loci.vcf
-
popmap.txt
-
README.md
Abstract
We used double digestion restriction site associated DNA (ddRAD) sequencing to discover SNPs in samples across a transect including a hybrid zone between Podarcis carbonelli and Podarcis carbonelli. We used P. bocagei and P. carbonelli samples from the locations at the extremes of the transect as references. We obtained a SNP dataset including all SNPs after removing loci with depth coverage <8, missing data >20%, removing loci containing more than five SNPs, and with more than 70% heterozygosity (complete dataset; 6905 SNPs, 329 individuals). Additionally, we obtained from the complete dataset two other datasets, prior to apply a missing data filter. One dataset contained loci with allele frequencies higher than 0.8 in the reference population containing only parental individuals of one species and lower than 0.2 in the reference population of the other species ("80/20" dataset; 2300 SNPs, 329 individuals); the other dataset comprised diagnostic SNPs between reference populations (diagnostic dataset; 1241 SNPs, 236 individuals) but excluding private alleles from references, i.e. excluding alleles that are not present in the populations of contact. Individuals with missing data >35% were removed from all datasets (the number of individuals reported for each dataset is after applying this filter, but note that the 80/20 and the diagnostic datasets were obtained before applying this filter to the complete dataset). Across datasets, average depth of coverage by individuals was 28 (median = 26.8, min = 12.5, max = 85.8) and by loci was 29 (median = 28.8; min = 15.6; max = 48.6). The analysis of replicate samples (four samples were replicated twice, i.e. were amplified and sequenced in independent libraries and SNP calling was performed independently) showed high levels (99.87%) of multilocus genotype replicability.
Methods
Samples were collected between spring and autumn of 2013 in a contact zone between Podarcis bocagei and P. carbonelli. We collected samples across a transect with 8 locations, including the hybrid zone. Sampling scheme aimed at capturing all the individuals encountered, avoiding bias towards species, sex or age. We used 20 P. bocagei and 23 P. carbonelli samples from the locations at the extremes of the transect as references. We obtained SNP datasets from ddRAD sequencing from 356 samples by preparing one library following the modifications described by Brelsford et al. (2016) to the protocols from Parchman et al. (2013), Peterson et al. (2012) and Purcell et al. (2014), and sequenced on a Illumina® HiSeq 2000. Individual raw reads were demultiplexed using the process_radtags module of Stacks version 2.2 (Catchen et al., 2013). SNP calling was performed using the Stacks pipeline, following Rochette and Catchen (2017) recommendations, by running consecutivelly ustacks (build loci), cstacks (create a catalogue of loci), sstacks (match individual samples against the catalogue), tsv2bam (transpose data) and gstacks (align each read to a locus and call SNPs) units. SNP filtering was done with populations unit from Stacks, VCFtools 0.1.15 (Danecek et al., 2011) and a custom Python script (available at https://github.com/catpinho/filter_RADseq_data).