Genotype-environment association (GEA) methods have become part of the standard landscape genomics toolkit, yet, we know little about how to best filter genotype-by-sequencing data to provide robust inferences for environmental adaptation. In many cases, default filtering thresholds for minor allele frequency and missing data are applied regardless of sample size, having unknown impacts on the results. These effects could be amplified in downstream predictions, including management strategies. Here, we investigate the effects of filtering on GEA results and the potential implications for assessment of adaptation to environment. We use empirical and simulated datasets derived from two widespread tree species to assess the effects of filtering on GEA outputs. Critically, we find that the level of filtering of missing data and minor allele frequency affect the identification of true positives. Even slight adjustments to these thresholds can change the rate of true positive detection. Using conservative thresholds for missing data and minor allele frequency substantially reduces the size of the dataset, lessening the power to detect adaptive variants (i.e. simulated true positives) with strong and weak strengths of selection. Regardless, strength of selection was a good predictor for GEA detection, but even some SNPs under strong selection went undetected across all methods and filtering parameters. The rate of false positives varied markedly depending on the species and GEA method but scaled with the number of SNPs in the dataset, indicating that false positives are sensitive to the interaction between demography, sampling design, and filtering parameters. We further show that filtering can significantly impact the predictions of adaptive capacity of species in downstream analyses. We make several recommendations regarding filtering for GEA methods. Ultimately, there is no filtering panacea, but some choices are better than others, depending largely on the study system, availability of genomic resources, and desired objectives of the study.

Original datasets were collected from published work and provided here.

The demographic history from these empirical datasets were simulated using Baypass.simulation in R. Then missind data was added after simulation. Then filtering took place to create 150 separate but linked datasets. The code that was used to create these datasets are provided here.

R was used for the pipeline. All R code is provided for the creation of simulated datasets and filtering of those datasets.

We've also provide .012 data input files (.txt) with their env files (.env) and the outputs of baypass (.csv) and lfmm (calpval).

The name of the outputs look like this: emsim_156_6_0.5_0.1.txt.lfmm_env_2.calpval This naming convention is the same throughout.

emsim = name of the datastet E. microcarpa simulation

156 = # of individuals i.e., sample size

6 = number of individuals per population

0.5 = the missing data threshold (note, for coding purposes this is actually the % of data kept : 10% missing data will be 0.9) (one of 0.5, 0.6, 0.7 0.8, or 0.9)

0.1 = minor allele frequency (one of 0.1, 0.05, or 0.01)

Associated SNPs

V#####MT - SNPs associated with BIO5

V#####MP - SNPs associated with BIO14

Regarding the F-word: the effects of data Filtering on inferred genotype-environment associations

Data files

Abstract

Regarding the F-word: the effects of data Filtering on inferred genotype-environment associations

Data files

Abstract

Methods

Usage notes

Works referencing this dataset