Studying the genetic population structure of species can reveal important insights into several key evolutionary, historical, demographic, and anthropogenic processes. One of the most important statistical tools for inferring genetic clusters is the program STRUCTURE. Recently, several papers have pointed out that STRUCTURE may show a bias when the sampling design is unbalanced, resulting in spurious joining of underrepresented populations and spurious separation of overrepresented populations. Suggestions to overcome this bias include subsampling and changing the ancestry model, but the performance of these two methods has not yet been tested on actual data. Here, I use a dataset of twelve high-alpine plant species to test whether unbalanced sampling affects the STRUCTURE inference of population differentiation between the European Alps and the Carpathians. For four of the twelve species, subsampling of the Alpine populations –to match the sample size between the Alps and the Carpathians– resulted in a drastically different clustering than the full dataset. On the other hand, STRUCTURE results with the alternative ancestry model were indistinguishable from the results with the default model. Based on these results, the subsampling strategy seems a more viable approach to overcome the bias than the alternative ancestry model. However, subsampling is only possible when there is an a priori expectation of what constitute the main clusters. Though these results do not mean that the use of STRUCTURE should be discarded, it does indicate that users of the software should be cautious about the interpretation of the results when sampling is unbalanced.
data
This folder contains for every species the data for the AFLP markers. These are formatted in plain text files with each AFLP marker represented by a single column. A zero (0)represents absence of an marker band and a one (1) represents presence of a marker band.
Preceding the AFLP data are four columns with metadata:
1) individual: The name of the individual consisting of a three-digit code indicating the cell on the IntraBioDiv grid where the individual was sampled, followed by a hyphen and an index number.
2) usable: A Boolean (t/f) indicating whether the data is of sufficient quality to make the individual usable
3) longitude: the x-coordinate of the sampling location in decimal degrees
4) latitude: the y-coordinate of the sampling location in decimal degrees
R scripts used for parsing results
This folder contains several scripts to read the results of the Structure analyses and all replicate subsampling and produce plots. Note that the working directory for R should be set to the folder "Results with default parameters".
R scripts used for subsampling and structure
This folder contains for each species the script that reads the datafile, writes it in Structure format and also creates 500 Structure files where the populations of the Alps have been subsampled to match the number of populations from the Carpathians. For each file it then proceeds to run a Structure analysis with ten replicate runs per datafile (the mainparams and extraparams files required by Structure are also included). The script also reads all the output produced by Structure and parses it. Note that this script requires the Structure executable in order to run correctly. Also the script takes an inordinate amount of time to finish.
Results with alternative ancestry model
This folder contains for every species the results of a Structure analysis on the full dataset (so no subsampling) using the alternative ancestry model suggested by Wang (2017). The results are distributed over three folders, depending on the value used for the initial alpha setting of Structure: Alpha=0.3, Alpha=0.5 and Alpha=1.0.
For Alpha=0.3 and Alpha=1.0 only the results for k=2 clusters are shown; for Alpha=0.5 results are shown for values of K ranging from 1 to 11. In all cases, ten replicate runs were performed per value of K.
Results Arabis alpina
This folder contains for this species the full and subsampled input files for Structure plus the results of the Structure analyses on the full and 500 subsampled datasets (in total 15,618 files per species). The full dataset and the first 100 subsampled datasets have been run for values of K from 1 to 11. The other 400 subsampled datasets have been run only for K=2. In all cases, ten replicate runs were performed per value of K.
-The Structure input files are named "infileStructure_", followed by the three-letter species abbreviation and an index number from 1 to 501 (e.g. "infileStructure_Hun_465.txt"). The file with index number 1 contains the full dataset; the files with indexes 2 to 501 contain subsampled datasets where the number of population in the Alps matched that in the Carpathians.
-The Structure output files are named: "structOut_", followed by the three-letter species abbreviation, the number of clusters, the replicate structure run for this value of k, and the index number of the inputfile from 1 to 501 (e.g. "structOut_Hun_k2_rep4_orep465_f").
-In addition, this folder contains some plots and files with parsed output, plus the mainparams and extraparams files used by Structure.
Structure_Aal.zip
Results Carex sempervirens
see previous description for details
Structure_Cse.zip
Results Dryas octopetala
See above for details
Structure_Doc.zip
Results Geum montanum
See above for details
Structure_Gmo.zip
Results Geum reptans
See above for details
Structure_Gre.zip
Results Hedysarum hedysaroides
See above for details
Structure_Hhe.zip
Results Hypochaeris uniflora
See above for details
Structure_Hun.zip
Results Juncus trifidus
See above for details
Structure_Jtr.zip
Results Luzula alpinopilosa
See above for details
Structure_Lal.zip
Results Loiseleuria procumbens
See above for details
Structure_Lpr.zip
Results Sesleria caerulea
See above for details
Structure_Sco.zip
Results Saxifraga stellaris
See above for details
Structure_Sst.zip