Interpretation of high-throughput sequence data requires an understanding of how decisions made during bioinformatic data processing can influence results. One source of bias that is often cited is PCR clones (or PCR duplicates). PCR clones are common in restriction site associated sequencing (RAD-seq) datasets, which are increasingly being used for molecular ecology. To determine the influence PCR clones and the bioinformatic handling of clones have on genotyping, we evaluate four RAD-seq datasets. Datasets were compared before and after clones were removed to estimate the number of clones present in RAD-seq data, quantify how often the presence of clones in a dataset cause genotype calls to change compared to when clones were removed, investigate the mechanisms that lead to genotype call changes, and test if clones bias heterozygosity estimates. Our RAD-seq datasets contained 30 – 60% PCR clones, but 95% of RAD-tags had five or fewer clones. Relatively few genotypes changed once clones were removed (5-10%), and the vast majority of these changes (98%) were associated with genotypes switching from a called to no-call state or vice versa. PCR clones had a larger influence on genotype calls in individuals with low read depth but appeared to influence genotype calls at all loci similarly. Removal of PCR clones reduced the number of called genotypes by 2% but had almost no influence on estimates of heterozygosity. As such, while steps should be taken to limit PCR clones during library preparation, PCR clones are likely not a substantial source of bias for most RAD-seq studies.
Brook trout clone filtered
Clone filtered VCF file of brook trout genotype data. VCF files were generated using stacks 2.46 with minimal filters (STACKS flags = -r 0.3, --min_maf 0.05). Data was generated using the SbfI enzyme, methods outlined in Ali et al. (2016)and prepared in the Genomic Variation Lab at the University of California--Davis and sequenced on Illumina NextSeq 500 (PE 75 bp reads, 96 samples/lane) at the Cornell Institute of Biotechnology.
bt_CF.vcf
Brook trout unfiltered
Non-clone filtered (unfiltered) VCF file of brook trout genotype data. VCF files were generated using stacks 2.46 with minimal filters (STACKS flags = -r 0.3, --min_maf 0.05). Data was generated using the SbfI enzyme, methods outlined in Ali et al. (2016)and prepared in the Genomic Variation Lab at the University of California--Davis and sequenced on Illumina NextSeq 500 (PE 75 bp reads, 96 samples/lane) at the Cornell Institute of Biotechnology
bt_noCF.vcf
Cisco clone filtered
Clone filtered (filtered) VCF file of cisco genotype data. VCF files were generated using stacks 2.46 with minimal filters (STACKS flags = -r 0.3, --min_maf 0.05). Data generated using the SbfI enzyme, methods outlined in Ali et al. (2016) prepared in the Larson Laboratory at the University of Wisconsin-Stevens Point and sequenced on a HiSeq 4000 (PE 150bp reads, 96 samples/lane) at the Michigan State Genomics Core Facility.
cisco_CF.vcf
Cisco unfiltered
Non-clone filtered (unfiltered) VCF file of cisco genotype data. VCF files were generated using stacks 2.46 with minimal filters (STACKS flags = -r 0.3, --min_maf 0.05). Data generated using the SbfI enzyme, methods outlined in Ali et al. (2016) prepared in the Larson Laboratory at the University of Wisconsin-Stevens Point and sequenced on a HiSeq 4000 (PE 150bp reads, 96 samples/lane) at the Michigan State Genomics Core Facility.
cisco_noCF.vcf
Walleye clone filtered
Clone filtered (filtered) VCF file of walleye genotype data. VCF files were generated using stacks 2.46 with minimal filters (STACKS flags = -r 0.3, --min_maf 0.05). Data generated using the SbfI enzyme, methods outlined in Ali et al. (2016) prepared in the Larson Laboratory at the University of Wisconsin-Stevens Point and sequenced on a HiSeq 4000 (PE 150bp reads, 192 samples/lane) at the Michigan State Genomics Core Facility.
wal_CF.vcf
Walleye unfiltered
Non-clone filtered (unfiltered) VCF file of walleye genotype data. Clone filtered (filtered) VCF file of walleye genotype data. VCF files were generated using stacks 2.46 with minimal filters (STACKS flags = -r 0.3, --min_maf 0.05). Data generated using the SbfI enzyme, methods outlined in Ali et al. (2016) prepared in the Larson Laboratory at the University of Wisconsin-Stevens Point and sequenced on a HiSeq 4000 (PE 150bp reads, 192 samples/lane) at the Michigan State Genomics Core Facility.
wal_noCF.vcf