Data from: Identifying Key Biodiversity Areas based on distinct genetic diversity

Gronefeld, Sarah Christin 1 ; López, Heriberto2; Schmidt, Robin3; Hochkirch, Axel4

Published Nov 07, 2025; Updated Dec 04, 2025 on Dryad. https://doi.org/10.5061/dryad.573n5tbhk

Data files

Nov 07, 2025 version files 133.01 GB

231201_NB501850_A_L1_4_AZJH_1_R1.fastq

37.37 GB
231201_NB501850_A_L1_4_AZJH_1_R2.fastq

37.37 GB
231201_NB501850_A_L1_4_AZJH_2_R1.fastq

29.13 GB
231201_NB501850_A_L1_4_AZJH_2_R2.fastq

29.13 GB
README.md

2.63 KB
Supp_Ariagona_dataset.str

2.30 MB

Dec 04, 2025 version files 133.01 GB

231201_NB501850_A_L1_4_AZJH_1_R1.fastq

37.37 GB
231201_NB501850_A_L1_4_AZJH_1_R2.fastq

37.37 GB
231201_NB501850_A_L1_4_AZJH_2_R1.fastq

29.13 GB
231201_NB501850_A_L1_4_AZJH_2_R2.fastq

29.13 GB
README.md

2.64 KB
Supp_Ariagona_dataset.str

2.30 MB

Abstract

Key Biodiversity Areas (KBAs) are sites that contribute significantly to the global persistence of biodiversity. Distinct genetic diversity has been introduced as one of the metrics to estimate whether a site holds a threshold proportion of a species’ global genetic diversity during the KBA identification process. However, genetic data has so far not been used due to the lack of thoroughly tested methods and guidance. We tested the applicability of Analyses of Molecular Variance (AMOVA), allelic overlap, the diversity index Simpson's λ, Δ+, D_est, and effective population size (N_e) for identification of KBAs. We conclude that Δ+, a measure that has originally been developed to measure taxonomic distinctness of biotic communities, performs best in the context of KBA identification reflects the unique nature of a species’ genetic diversity, is based on simple allele frequencies, and can be easily applied and calculated. AMOVA, N_e, allelic overlap, and our modified version of λ, were difficult to apply, interpret, or both. D_est is easily applied for measuring genetic distinctiveness but not genetic diversity. For this reason, it may not be suitable for prioritizing areas for the long-term protection of the species.

Here, we deposited additional information on which methods, and how they were calculated on which data sets, including references. We included additional information on how these methods performed. Moreover we included raw reads, an already processed .str file and additional information of additional data set we created to add a case study to our publication. The code we used for our analyzes can be found on GitHub and Zenodo. For more information about the code and the detailed procedure, view the associated publication, the readme on GitHub, and the Supplemental_Information.pdf file we publish here.

Identifying KBAs based on distinct genetic diversity

Our article explains which method is best to identify KBAs based on the distinct genetic diversity metric in the KBA standard. To archieve we used 30 different datasets to calculated six different methods: allelic overlap, AMOVA, Delta+, Dest, lambda, and Ne. We also included 2 case studies. One of the case studies was done on one of the 30 data sets and the other case study was performed on our own data set. Our own data set is from the species Ariagona magaritae. We included collection information, and barcodes for our won dataset. Code was deposited on github (https://github.com/TheC0der856/genetic_distinct_diversity) and zenodo.

Description of the data and file structure

Supplemental_Information.pdf contains all relevant information: all SuppTabs and SuppFigs including descriptions of the data und it also places them in the context of our paper and is introducing an order to our supplementary files. We recommend to open Supplemental_Information.pdf in the beginning.

Files and variables

Descriptions of tables and figures, additional information supporting the article (Zenodo): Supplemental_Information.pdf

All figures included in the Supplemental_Information.pdf were added as pdf and jpg format (Zenodo):

SuppFig1.pdf
SuppFig2.pdf
SuppFig3.pdf

All tables included in the Supplemental_Information.pdf were added in txt, csv, or ods format (Zenodo):

SuppTab1.txt
SuppTab3.csv
SuppTab2.ods
SuppTab4.csv

These are the raw reads of the Ariagona margaritae dataset (Dryad):

231201_NB501850_A_L1_4_AZJH_1_R1.fastq
231201_NB501850_A_L1_4_AZJH_1_R2.fastq
231201_NB501850_A_L1_4_AZJH_2_R1.fastq
231201_NB501850_A_L1_4_AZJH_2_R2.fastq

processed Ariagona margaritae dataset: Supp_Ariagona_dataset.str (Dryad)

Code/software

Tables can be opened in Excel or LibreOffice Calc. csv files can be opened additonaly in Wordpad, text editor, and Notepad++.

pdf files can be opened in Adobe Acrobat, LibreOffice Draw, or in your browser.

A str file can be opended with Notepad++, but it can be also read into R with the adegenet package.

The fastq files are huge and we only recommend to work with them on a cluster with programms such as Stacks. We conducted our analyses with Stacks using bash commands. Apart from this all our code was written in R.

Access information

The data are openly available in Dryad under the Creative Commons Zero (CC0) public domain dedication. No access restrictions apply. Our code is available in github under the MIT Licence.

As both SNP and microsatellite datasets are commonly used to analyze intraspecific genetic variance, we tested the performance of our chosen analytical approaches on 30 published diploid datasets, of which 15 used SNPs (with an average of 184 SNP loci) and 15 microsatellite datasets (with an average of 31 microsatellite loci). Each dataset was analyzed with six methods: AMOVA, allelic overlap , Δ⁺, D_est , λ corrected for sample size , and N_e. To apply all six methods, an R project was created that makes use of many packages that facilitate displaying results and working with genetic data and tables.

For better comparability between the six methods, all datasets were prepared in the same way. Sites with fewer than 30 individuals were removed from the analysis. Individuals with > 20% missing data were removed from the dataset. For loci with missing genotypes, the missing allele counts were replaced with the mean of the observed alleles at that locus across all individuals in the dataset.

To explore similarities between the different approaches, correlations between the results of all six methods, allelic overlap, AMOVA, Δ⁺, D_est, N_e, and λ_cor, were calculated in R. Correlations with allelic richness were additionally calculated. Two outliers were removed from AMOVA results. A Kendall correlation was chosen. Correlations between allelic overlap, allelic richness, AMOVA, Δ⁺, D_est, N_e, and λ_cor, are based on different sample sizes, since N_e could not be calculated for some areas.

We tested all results against two KBA criteria, A1b (> 1 % of the global distinct genetic diversity occurs at this site) and B1 (> 10 % of the global distinct genetic diversity occurs at this site). For each of the six methods, the proportion of distinct genetic diversity was calculated as the simple proportion of distinct genetic diversity at each location of the sum of all locations, as proposed by the KBA standard for AMOVA results. Areas lacking N_e were allocated the median of remaining areas to enable the application of KBA criteria without inflating the proportion of the remaining sites.

To illustrate the results and coverage of different genetic clusters, Structure analyses were conducted in addition to the calculation of the five metrics for two case studies: the Chinook salmon (Oncorhynchus tshawytscha; Gomez-Uchida et al. 2019) as well as a hitherto unpublished dataset of the Tenerife Short-winged Bush-cricket (Ariagona margaritae Kraus, 1892). The site selection was based upon KBA criterion B1. The results were processed in R using several packages. The maps were created using ArcGIS Pro.

To create the Tenerife Short-winged Bush-cricket dataset specimens were collected 2010–2023 on Tenerife and El Hierro. DNA was extracted using the Qiagen DNeasy® Blood & Tissue kit. ddRADseq libraries were prepared for paired-end sequencing on a High-Output Flow cell of an Illumina NextSeq platform (2 x 75bp). Stacks 2.6.6 was used to demultiplex, filter, and trim raw reads to 65bp, create an assembly and a catalogue of loci to finally identify SNPs (n= 64, -p 150, -r 1). Default settings were maintained. Individuals containing more than 20% of missing data were removed from the analysis. The resulting dataset comprised 108 individuals and 5198 loci. No area was excluded from the analysis as each had ≥20 sequenced individuals. The allelic overlap method was omitted for this dataset due to extensive calculation times. For N_e, the smallest possible natural number was added to transform all N_e into positive numbers. Apart from that, this dataset was analyzed in the same way as the previously used datasets.

Data from: Identifying Key Biodiversity Areas based on distinct genetic diversity

Data files

Abstract

README

Methods

Change log