Data from: Paralogs are revealed by proportion of heterozygotes and deviations in read ratios in genotyping by sequencing data from natural populations
McKinney, Garrett J.; Waples, Ryan K.; Seeb, Lisa W.; Seeb, James E. (2016), Data from: Paralogs are revealed by proportion of heterozygotes and deviations in read ratios in genotyping by sequencing data from natural populations, Dryad, Dataset, https://doi.org/10.5061/dryad.cm08m
Whole genome duplications have occurred in the recent ancestors of many plants, fish, and amphibians, resulting in a pervasiveness of paralogous loci and the potential for both disomic and tetrasomic inheritance in the same genome. Paralogs can be difficult to reliably genotype and are often excluded from genotyping-by-sequencing (GBS) analyses; however, removal requires paralogs to be identified which is difficult without a reference genome. We present a method for identifying paralogs in natural populations by combining two properties of duplicated loci: 1) the expected frequency of heterozygotes exceeds that for singleton loci, and 2) within heterozygotes, observed read ratios for each allele in GBS data will deviate from the 1:1 expected for singleton (diploid) loci. These deviations are often not apparent within individuals, particularly when sequence coverage is low; but, we postulated that summing allele reads for each locus over all heterozygous individuals in a population would provide sufficient power to detect deviations at those loci. We identified paralogous loci in three species: Chinook salmon (Oncorhynchus tshawytscha) which retains regions with ongoing residual tetrasomy on eight chromosome arms following a recent whole genome duplication, mountain barberry (Berberis alpina) which has a large proportion of paralogs that arose through an unknown mechanism, and dusky parrotfish (Scarus niger) which has largely re-diploidized following an ancient whole genome duplication. Importantly, this approach only requires the genotype and allele-specific read counts for each individual, information which is readily obtained from most GBS analysis pipelines.