The impact of estimator choice: Disagreement in clustering solutions across K estimators for Bayesian analysis of population genetic structure across a wide range of empirical datasets
Data files
Oct 04, 2021 version files 340.18 MB
-
README.txt
-
stankiewicz_et_al_2021.tar.gz
Abstract
The software program STRUCTURE is one of the most cited tools for determining population structure. To infer the optimal number of clusters from STRUCTURE output, the ΔK method is often applied. However, a recent study relying on simulated microsatellite data suggested that this method has a downward bias in its estimation of K and is sensitive to uneven sampling. If this finding holds for empirical datasets, conclusions about the scale of gene flow may have to be revised for a large number of studies. To determine the impact of method choice, we applied recently described estimators of K to re-estimate genetic structure in 41 empirical microsatellite datasets; 15 from a broad range of taxa and 26 focused on a diverse phylogenetic group, coral. We compared alternative estimates of K (Puechmaille statistics) with traditional (ΔK and posterior probability) estimates and found widespread disagreement of estimators across datasets. Thus, one estimator alone is insufficient for determining the optimal number of clusters regardless of study organism or evenness of sampling scheme. Subsequent analysis of molecular variance (AMOVA) between clustering solutions did not necessarily clarify which solution was best. To better infer population structure, we suggest a combination of visual inspection of STRUCTURE plots and calculation of the alternative estimators at various thresholds in addition to ΔK. Differences between estimators could reveal patterns with important biological implications, such as the potential for more population structure than previously estimated, as was the case for many studies reanalyzed here.
Methods
These are microsatellite data from 41 empirical datasets. Each microsatellite dataset was re-run through the software program STRUCTURE (via the R packages ParallelStructure) using the same settings as were used in the original publication. They were then processed in R to estimate optimal K values according to the Delta K method, the posterior probability method, and the Puechmaille statistics. Further, they were processed via CLUMPAK.