Data from: Next-generation sequencing for molecular ecology: a caveat regarding pooled samples


Anderson, Eric C.; Skaug, Hans J.; Barshis, Daniel J. (2013), Data from: Next-generation sequencing for molecular ecology: a caveat regarding pooled samples, Dryad, Dataset,


We develop a model based on the Dirichlet-compound multinomial distribution (CMD) and Ewens sampling formula to predict the fraction of SNP loci that will appear fixed for alternate alleles between two pooled samples drawn from the same underlying population. We apply this model to next-generation sequencing (NGS) data from Baltic Sea herring recently published by (Corander et al., , Molecular Ecology, 2931–2940), and show that there are many more fixed loci than expected in the absence of genetic structure. However, we show through coalescent simulations that the degree of population structure required to explain the fraction of alternatively fixed SNPs is extraordinarily high and that the surplus of fixed loci is more likely a consequence of limited representation of individual gene copies in the pooled samples, than it is of population structure. Our analysis signals that the use of NGS on pooled samples to identify divergent SNPs warrants caution. With pooled samples, it is hard to diagnose when an NGS experiment has gone awry; especially when NGS data on pooled samples are of low read depth with a limited number of individuals, it may be worthwhile to temper claims of unexpected population differentiation from pooled samples, pending verification with more reliable methods or stricter adherence to recommended sampling designs for pooled sequencing e.g. Futschik & Schlötterer , Genetics, 186, 207; Gautier et al., , Molecular Ecology, 3766–3779). Analysis of the data and diagnosis of problems is easier and more reliable (and can be less costly) with individually barcoded samples. Consequently, for some scenarios, individual barcoding may be preferable to pooling of samples.

