Data from: Using transcriptome sequencing and pooled exome capture to study local adaptation in the giga-genome of Pinus cembra
Rellstab, Christian et al. (2018), Data from: Using transcriptome sequencing and pooled exome capture to study local adaptation in the giga-genome of Pinus cembra, Dryad, Dataset, https://doi.org/10.5061/dryad.4bb5849
Despite decreasing sequencing costs, whole-genome sequencing for population-based genome scans for selection is still prohibitively expensive for organisms with large genomes. Moreover, the repetitive nature of large genomes often represents a challenge in bioinformatic and downstream analyses. Here we use in-depth transcriptome sequencing to design probes for exome capture in Swiss stone pine (Pinus cembra), a conifer with an estimated genome size of 29.3 Gbp and no reference genome available. We successfully applied around 55,000 self-designed probes, targeting 25,000 contigs, to DNA pools of seven populations from the Swiss Alps and identified > 140,000 SNPs in around 13,000 contigs. The probes performed equally well in pools of the closely related species Pinus sibirica; in both species, more than 70% of the targeted contigs were sequenced at a depth ≥ 40x (number of haplotypes in the pool). However, a thorough analysis of individually sequenced P. cembra samples indicated that a majority of the contigs (63%) represented multi-copy genes. We therefore removed paralogous contigs based on heterozygote excess and deviation from allele balance. Without putatively paralogous contigs, allele frequencies of population pools represented accurate estimates of individually determined allele frequencies. We show that inferences of neutral and adaptive genetic variation may be biased when not accounting for such multi-copy genes. Without individual genotype data, it would have been nearly impossible to recognize and deal with the problem of multi-copy contigs. We advocate to put more emphasis on identifying paralogous loci, which will be facilitated by the establishment of additional high-quality reference genomes.