Data from: Subsampling reveals that unbalanced sampling affects STRUCTURE results in a multi-species dataset
Data files
Jul 10, 2018 version files 1.55 GB
-
data.zip
112.30 KB
-
R scripts used for parsing results.zip
9.48 KB
-
R scripts used for subsampling and structure.zip
51.65 KB
-
Results with alternative ancestry model.zip
23.87 MB
-
Structure_Aal.zip
146.21 MB
-
Structure_Cse.zip
137.70 MB
-
Structure_Doc.zip
112.30 MB
-
Structure_Gmo.zip
122.38 MB
-
Structure_Gre.zip
88.20 MB
-
Structure_Hhe.zip
119.51 MB
-
Structure_Hun.zip
148.35 MB
-
Structure_Jtr.zip
133.08 MB
-
Structure_Lal.zip
160.94 MB
-
Structure_Lpr.zip
129.84 MB
-
Structure_Sco.zip
97.23 MB
-
Structure_Sst.zip
131.31 MB
Abstract
Studying the genetic population structure of species can reveal important insights into several key evolutionary, historical, demographic, and anthropogenic processes. One of the most important statistical tools for inferring genetic clusters is the program STRUCTURE. Recently, several papers have pointed out that STRUCTURE may show a bias when the sampling design is unbalanced, resulting in spurious joining of underrepresented populations and spurious separation of overrepresented populations. Suggestions to overcome this bias include subsampling and changing the ancestry model, but the performance of these two methods has not yet been tested on actual data. Here, I use a dataset of twelve high-alpine plant species to test whether unbalanced sampling affects the STRUCTURE inference of population differentiation between the European Alps and the Carpathians. For four of the twelve species, subsampling of the Alpine populations –to match the sample size between the Alps and the Carpathians– resulted in a drastically different clustering than the full dataset. On the other hand, STRUCTURE results with the alternative ancestry model were indistinguishable from the results with the default model. Based on these results, the subsampling strategy seems a more viable approach to overcome the bias than the alternative ancestry model. However, subsampling is only possible when there is an a priori expectation of what constitute the main clusters. Though these results do not mean that the use of STRUCTURE should be discarded, it does indicate that users of the software should be cautious about the interpretation of the results when sampling is unbalanced.