Data from: Suitability of different mapping algorithms for genome-wide polymorphism scans with Pool-Seq data.

Kofler, Robert1; Langmüller, Anna Maria1; Nouhaud, Pierre1; Otte, Kathrin Anna1; Schlötterer, Christian1

Published Sep 27, 2016 on Dryad. https://doi.org/10.5061/dryad.2g3s4

Data files

Sep 27, 2016 version files 675.30 MB

fst.zip

339.66 MB
ref-snps.zip

167.73 MB
ref-snpsrandposindel.zip

167.73 MB
scripts.zip

188.51 KB

Abstract

The cost-effectiveness of sequencing pools of individuals (Pool-Seq) provides the basis for the popularity and wide-spread use of this method for many research questions, ranging from unravelling the genetic basis of complex traits to the clonal evolution of cancer cells. Because the accuracy of Pool-Seq could be affected by many potential sources of error, several studies determined, for example, the influence of the sequencing technology, the library preparation protocol, and mapping parameters. Nevertheless, the impact of the mapping tools has not yet been evaluated. Using simulated and real Pool-Seq data, we demonstrate a substantial impact of the mapping tools leading to characteristic false positives in genome-wide scans. The problem of false positives was particularly pronounced when data with different read lengths and insert sizes were compared. Out of 14 evaluated algorithms novoalign, bwa mem and clc4 are most suitable for mapping Pool-Seq data. Nevertheless, no single algorithm is sufficient for avoiding all false positives. We show that the intersection of the results of two mapping algorithms provides a simple, yet effective strategy to eliminate false positives. We propose that the implementation of a consistent Pool-seq bioinformatics pipeline building on the recommendations of this study can substantially increase the reliability of Pool-Seq results, in particular when libraries generated with different protocols are being compared.

Data from: Suitability of different mapping algorithms for genome-wide polymorphism scans with Pool-Seq data.

Data files

Abstract

Simulated data: comparing allele frequencies with FST

Simulated data: with SNPs (no indels)

Simulated data: with SNPs and indels

Scripts used for the analysis

Data from: Suitability of different mapping algorithms for genome-wide polymorphism scans with Pool-Seq data.

Data files

Abstract

Usage notes

Simulated data: comparing allele frequencies with FST

Simulated data: with SNPs (no indels)

Simulated data: with SNPs and indels

Scripts used for the analysis

Works referencing this dataset