Studying the genetic population structure of species can reveal important insights into several key evolutionary, historical, demographic, and anthropogenic processes. One of the most important statistical tools for inferring genetic clusters is the program STRUCTURE. Recently, several papers have pointed out that STRUCTURE may show a bias when the sampling design is unbalanced, resulting in spurious joining of underrepresented populations and spurious separation of overrepresented populations. Suggestions to overcome this bias include subsampling and changing the ancestry model, but the performance of these two methods has not yet been tested on actual data. Here, I use a dataset of twelve high-alpine plant species to test whether unbalanced sampling affects the STRUCTURE inference of population differentiation between the European Alps and the Carpathians. For four of the twelve species, subsampling of the Alpine populations –to match the sample size between the Alps and the Carpathians– resulted in a drastically different clustering than the full dataset. On the other hand, STRUCTURE results with the alternative ancestry model were indistinguishable from the results with the default model. Based on these results, the subsampling strategy seems a more viable approach to overcome the bias than the alternative ancestry model. However, subsampling is only possible when there is an a priori expectation of what constitute the main clusters. Though these results do not mean that the use of STRUCTURE should be discarded, it does indicate that users of the software should be cautious about the interpretation of the results when sampling is unbalanced.

data

This folder contains for every species the data for the AFLP markers. These are formatted in plain text files with each AFLP marker represented by a single column. A zero (0)represents absence of an marker band and a one (1) represents presence of a marker band. Preceding the AFLP data are four columns with metadata: 1) individual: The name of the individual consisting of a three-digit code indicating the cell on the IntraBioDiv grid where the individual was sampled, followed by a hyphen and an index number. 2) usable: A Boolean (t/f) indicating whether the data is of sufficient quality to make the individual usable 3) longitude: the x-coordinate of the sampling location in decimal degrees 4) latitude: the y-coordinate of the sampling location in decimal degrees

R scripts used for parsing results

This folder contains several scripts to read the results of the Structure analyses and all replicate subsampling and produce plots. Note that the working directory for R should be set to the folder "Results with default parameters".

R scripts used for subsampling and structure

This folder contains for each species the script that reads the datafile, writes it in Structure format and also creates 500 Structure files where the populations of the Alps have been subsampled to match the number of populations from the Carpathians. For each file it then proceeds to run a Structure analysis with ten replicate runs per datafile (the mainparams and extraparams files required by Structure are also included). The script also reads all the output produced by Structure and parses it. Note that this script requires the Structure executable in order to run correctly. Also the script takes an inordinate amount of time to finish.

Results with alternative ancestry model

This folder contains for every species the results of a Structure analysis on the full dataset (so no subsampling) using the alternative ancestry model suggested by Wang (2017). The results are distributed over three folders, depending on the value used for the initial alpha setting of Structure: Alpha=0.3, Alpha=0.5 and Alpha=1.0. For Alpha=0.3 and Alpha=1.0 only the results for k=2 clusters are shown; for Alpha=0.5 results are shown for values of K ranging from 1 to 11. In all cases, ten replicate runs were performed per value of K.

Results Arabis alpina

This folder contains for this species the full and subsampled input files for Structure plus the results of the Structure analyses on the full and 500 subsampled datasets (in total 15,618 files per species). The full dataset and the first 100 subsampled datasets have been run for values of K from 1 to 11. The other 400 subsampled datasets have been run only for K=2. In all cases, ten replicate runs were performed per value of K. -The Structure input files are named "infileStructure_", followed by the three-letter species abbreviation and an index number from 1 to 501 (e.g. "infileStructure_Hun_465.txt"). The file with index number 1 contains the full dataset; the files with indexes 2 to 501 contain subsampled datasets where the number of population in the Alps matched that in the Carpathians. -The Structure output files are named: "structOut_", followed by the three-letter species abbreviation, the number of clusters, the replicate structure run for this value of k, and the index number of the inputfile from 1 to 501 (e.g. "structOut_Hun_k2_rep4_orep465_f"). -In addition, this folder contains some plots and files with parsed output, plus the mainparams and extraparams files used by Structure.

Structure_Aal.zip

Results Carex sempervirens

see previous description for details

Structure_Cse.zip

Results Dryas octopetala

See above for details

Structure_Doc.zip

Results Geum montanum

See above for details

Structure_Gmo.zip

Results Geum reptans

See above for details

Structure_Gre.zip

Results Hedysarum hedysaroides

See above for details

Structure_Hhe.zip

Results Hypochaeris uniflora

See above for details

Structure_Hun.zip

Results Juncus trifidus

See above for details

Structure_Jtr.zip

Results Luzula alpinopilosa

See above for details

Structure_Lal.zip

Results Loiseleuria procumbens

See above for details

Structure_Lpr.zip

Results Sesleria caerulea

See above for details

Structure_Sco.zip

Results Saxifraga stellaris

See above for details

Structure_Sst.zip

Data from: Subsampling reveals that unbalanced sampling affects STRUCTURE results in a multi-species dataset

Data files

Abstract

data

R scripts used for parsing results

R scripts used for subsampling and structure

Results with alternative ancestry model

Results Arabis alpina

Results Carex sempervirens

Results Dryas octopetala

Results Geum montanum

Results Geum reptans

Results Hedysarum hedysaroides

Results Hypochaeris uniflora

Results Juncus trifidus

Results Luzula alpinopilosa

Results Loiseleuria procumbens

Results Sesleria caerulea

Results Saxifraga stellaris

Data from: Subsampling reveals that unbalanced sampling affects STRUCTURE results in a multi-species dataset

Data files

Abstract

Usage notes

data

R scripts used for parsing results

R scripts used for subsampling and structure

Results with alternative ancestry model

Results Arabis alpina

Results Carex sempervirens

Results Dryas octopetala

Results Geum montanum

Results Geum reptans

Results Hedysarum hedysaroides

Results Hypochaeris uniflora

Results Juncus trifidus

Results Luzula alpinopilosa

Results Loiseleuria procumbens

Results Sesleria caerulea

Results Saxifraga stellaris

Works referencing this dataset