Mapping the missing: assessing amphibian sampling completeness and overlap with global protected areas
Data files
Mar 21, 2025 version files 1.25 GB
-
R_dryad.zip
1.25 GB
-
README.md
14.39 KB
Abstract
The aim of the study was to assess amphibian sampling completeness and the overlap of sampling completeness categories with natural protected areas (NPAs) and key biodiversity areas (KBAs) at global scale. We evaluated amphibian sampling completeness across six of the earth's eight biogeographic realms to identify well‐sampled, under‐sampled, and data‐gap areas in the context of global amphibian distribution. Additionally, we examined the spatial overlap of each sampling category with NPAs and KBAs. The Nearctic and Australasian realms had the highest number of records and well‐sampled areas. Significant data gaps were identified, particularly in the Afrotropical, Indo‐Malayan, Neotropical, and Palearctic realms. We found low levels of spatial match (< 35%) between classified areas and NPAs/KBAs. Amphibian distribution data are largely incomplete, with the most extensive gaps in the most species‐rich realms: Neotropic, Indo‐Malayan, and Afrotropical. The low overlap between under‐sampled and data‐gap areas with NPAs and KBAs suggests that these regions, critical for amphibian diversity, are insufficiently represented within established conservation priorities. Given the urgent threats to biodiversity from global change, rapid responses are essential to enhance our understanding of species distributions and community structures in amphibians. This study provides spatial insights to help identify key data‐gap areas for amphibian research and conservation prioritization.
This dataset is part of the Data Availability Statement of the article Mapping the Missing: Assessing Amphibian Sampling Completeness and Overlap With Global Protected Areas (DOI: 10.1002/ECE3.71137), published in Ecology and Evolution. The article provides detailed citations for each dataset used, along with a full description of the processing steps applied.
To respect the usage rights of publicly available datasets, we do not share the original files but instead provide simplified versions that retain only the essential structure required to replicate our analyses. We strongly encourage users to access the original datasets for a more comprehensive source of information.
The most important datasets to run through the available scripts are:
Global amphibian distribution data: Retrieved from the Global Biodiversity Information Facility (GBIF) and available at DOI: https://doi.org/10.15468/dl.gv57xr. We provide a filtered and simplified version that preserves only the necessary information for analysis replication. This modified version is distributed under the CC BY-NC 4.0 license, aligning with the original dataset’s licensing and our statement’s non-commercial focus.
Key Biodiversity Areas (KBAs): We do not have permission to distribute the original KBA dataset, nor modified versions of it, therefore, we ask that in cases where KBAs are needed, you download the file on request from: https://www.keybiodiversityareas.org/kba-data/request and include it in the file path marked in the script.
Natural Protected Areas (NPAs): We do not have permission to distribute the original NPAs dataset, nor modified versions of it, therefore, we ask that in cases where NPAs are needed, you download the file on request from: https://www.protectedplanet.net/en/thematic-areas/wdpa?tab=WDPA and include it in the file path marked in the script.
Usage Guidelines:
This material contains files of various types. Ideally, all files and folders should be managed directly in R via the RStudio graphical interface by opening the project “R_manuscrito.Rproj”, which is included in the compressed file. You can download R from r-project.org and RStudio from posit.co. While most files can also be opened with other software, their compatibility depends on the file format. Below are recommendations for suitable programs to open different file types:
• .rda – R data files that should be loaded in R using the command load(file.rda). We recommend accessing these files only within the provided scripts (scripts folder).
• .gpkg / .shp – Vector-based spatial data files that can be opened in QGIS or in R using the sf or terra packages, as demonstrated in the scripts.
• .txt / .csv – Plain text files that can be viewed with standard text editors (e.g., Gedit on Debian-based Linux systems) or office suites like LibreOffice, OpenOffice, or Microsoft Office.
• .png / .tiff – Image files that can be opened with default image viewers (e.g., GNOME Image Viewer on Debian-based Linux systems), office software, or image editors such as GIMP (gimp.org).
• .R – R script files containing executable code for the R console. It is recommended to open them in RStudio.
For a structured workflow, we recommend installing R and RStudio, opening the project “R_manuscrito.Rproj”, and running the scripts sequentially (01, 02, 03, etc.). Each script includes detailed explanations of the analytical workflow.
Below, we describe the contents of the dataset:
• data: Corresponds to the input data used as the basis for the analyses. In cases where they are made available, they are modified and simplified versions intended solely for use in running this script. In each case, the user should refer to the article mentioned at the beginning of this README to access the complete datasets through the citations.
• output: Refers to the processed data generated through the application of the different scripts. These contain the results presented in the article mentioned at the beginning.
• scripts: Contains the scripts in R language used to process the data (data) and obtain the results (output).
Detailed description of the dataset:
data:
• Amphinom: This file contains the taxonomic database for amphibian species, sourced from the Amphinom package: Liedtke (2019). AmphiNom: an amphibian systematics tool. Systematics and Biodiversity, 17(1), 1-6. In .rda format.
• Borders: A shapefile and a .gpkg outlining the continents, primarily used for mapping purposes.
• Donerstein_2017: These .gpkg includes the biogeographic domains defined by Olson et al. (2001) and Dinestein et al. (2017). The original dataset can be accessed at https://doi.org/10.1093/biosci/bix014.
• extent: The biogeographic extent used in this study, described in detail in the methodology section of the article referenced at the beginning of this README. The format is .gpkg.
• Holt_2013: This .gpkg contains the biogeographic domains from Holt et al. (2013), available at https://doi.org/10.1126/science.1228282.
• IUCN: A modified and simplified .gpkg version of the spatial outline created by overlapping individual amphibian species polygons with IUCN spatial data. This file is used exclusively to define the study extent.
• KBAs and NPAs: These folders are empty except for a README file containing instructions for downloading the Key Biodiversity Areas and Natural Protected Areas datasets. The README also specifies the directory where these polygon files should be stored, as we do not have permission to distribute the original data.
Output:
• Biogeographic_prop: Results from the comparison of the biogeographic frameworks proposed by Holt et al. (2013) and Dinestein et al. (2017). These findings are discussed in detail in the supplementary material of the article cited at the beginning of this README. This material contains two .png files illustrating a Similarity Analysis (ANOSIM) and a Cluster Analysis, comparing different biogeographic classifications. To fully grasp the analytical process and the results, we recommend reviewing the script "06_choose_biogeographic.R".
• completeness: Contains the results of the sampling completeness analysis, which is examined and discussed in depth in the article cited at the beginning of this README. This folder contains two identical images, available in .png and .tiff formats, depicting variation in sampling completeness across different biogeographic realms. Additionally, it includes three .csv tables with the following variables:
◦ estimators2.csv / se2.csv:
▪ Area: Hexagon identifier
▪ Records: Number of records within each hexagon
▪ Observed.richness: Observed species richness
▪ Richness: Estimated species richness
▪ Slope: Slope of the species accumulation curve
▪ Completeness: Estimated completeness
▪ Ratio: Ratio of records to species richness
▪ SE: Standard error of the richness estimate
▪ R²: Coefficient of determination of the accumulation curve
◦ species_per_site2.csv:
▪ Species name
▪ Longitude (X-coordinate)
▪ Latitude (Y-coordinate)
▪ Number of records per species
To fully understand the logic behind these files and their generation process, refer to the script “03_Completeness_estimation.R”.
• effect_size: Stores the results of the analysis assessing the impact of hexagon sizes on completeness estimates. These results are thoroughly discussed in the supplementary material of the article cited at the beginning of this README. This folder contains three subfolders and an image. Each subfolder includes tables with units that follow the same interpretation as those in the "completeness" folder described earlier. The .png image illustrates the effect of hexagon size on species richness and sampling completeness. To understand the data generation process, refer to the script "05_effect_size.R".
• hexagons: A dataset of hexagons generated based on our defined extent, serving as the foundation for subsequent analyses. All files in this folder are in .gpkg format. Their origin and generation process are thoroughly explained in the script "03_Completeness_estimation.R".
• maps: Includes the maps generated and presented as part of the results in the article cited at the beginning of this README. This folder contains image files in .tiff and .png formats, displaying the manuscript results. To understand their origin and the process behind their creation, please refer to the scripts "03_Completeness_estimation.R" and "07_maps.R".
• occurrence_records: A modified version of occurrence data retrieved from GBIF, along with various sub-versions used throughout the analysis. This folder contains all the species tables used in the analysis, in both table format (.csv) and spatial format (.gpkg). Below is a description of the variables in each table:
• data_KnowBR_format.csv – Species: Species name, Longitude: x-coordinates, Latitude: y-coordinates, Counts: number of independent records (see script "03_Completeness_estimation.R").
• Gbif_records_00.csv – Simplified table of GBIF records. Species: species name, Genus: genus, SpecificEpithet: specific epithet, Family: family, Order: order, Class: class, GbifID: GBIF record ID, DecimalLatitude: y-coordinates, DecimalLongitude: x-coordinates, Country: country of the record, Year: year of the record, VerbatimEventDate: record date, IndividualCount: individual count, IUCNRedListCategory: IUCN species category (see "00_Download_GBIF_records.R").
◦ Gbif_records_01.csv – Simplified and cleaned GBIF records table, with the same interpretation as the previous table (see "01_Clean_records.R").
◦ Gbif_records_02_full.csv / Gbif_records_02_reduced.csv – GBIF records with updated and curated taxonomy, with the same interpretation as the previous table (see "02_Update_the_taxonomy_of_records.R").
◦ sinonimos.csv / sinonimos_updated.csv – Contain taxonomic updates of the species used in the analysis. Query: species name in GBIF, Stripped: encoded name, Status: taxonomic status of the name, Warnings: warnings, ASW_names: species name in Amphibian Species of the World (see "02_Update_the_taxonomy_of_records.R").
• tables: The tables summarizing key results, as presented and discussed in the article cited at the beginning of this README. This folder contains three .csv tables, corresponding to the tables presented in the manuscript. Below is a description of the variables in each dataset:
◦ tab2.csv – This table provides information on sampling completeness per biogeographic domain. Variables include realm (biogeographic domain), area (total domain area), well_sampled (well-sampled area), wsp (percentage of the domain classified as well-sampled), under_sampled (under-sampled area), usp (percentage of the domain classified as under-sampled), no_inf (area with information gaps), and nip (percentage of the domain classified as having information gaps). All areas are expressed in million hectares (Mha).
◦ tab3.csv – This table focuses on the overlap between sampling completeness categories and natural protected areas (NPAs). It includes realm (biogeographic domain), npa.area (total area of NPAs within the domain), well_sampled_area (total well-sampled area), well_sampled_ov (overlapping area between NPAs and well-sampled areas), wsp (percentage of well-sampled areas overlapping with NPAs), under_sampled_area (total under-sampled area), under_sampled_ov (overlapping area between NPAs and under-sampled areas), usp (percentage of under-sampled areas overlapping with NPAs), dd (total data-deficient area), ddov (overlapping area between NPAs and data-deficient areas), and ddp (percentage of data-deficient areas overlapping with NPAs). All areas are in million hectares (Mha).
◦ tab4.csv – This table examines the overlap between sampling completeness categories and Key Biodiversity Areas (KBAs). Variables include realm (biogeographic domain), kba.area (total area of KBAs), well_sampled_ov.kba (overlapping area between well-sampled areas and KBAs), wsp.kba (percentage of well-sampled areas overlapping with KBAs), usp.kba (overlapping area between under-sampled areas and KBAs), under_sampled_ov.kba (percentage of under-sampled areas overlapping with KBAs), ddov.kba (overlapping area between data-deficient areas and KBAs), and ddp.kba (percentage of data-deficient areas overlapping with KBAs). All areas are in million hectares (Mha).
Scripts:
• 00 - Download GBIF records: Downloads global amphibian occurrence data from GBIF.
• 01 - Clean records: Cleans and preprocesses species occurrence records retrieved from GBIF.
• 02 - Update taxonomy: Updates species taxonomy following the classification of the American Museum of Natural History (Frost, 2023).
• 03 - Completeness estimation: Computes amphibian sampling completeness at the global scale using hexagonal grids.
• 04 - Human population and sampling completeness correlation: Assesses the relationship between human population density and sampling completeness by analyzing the mean population per hexagon and its associated completeness value.
• 05 - Effect size analysis: Evaluates the impact of grid cell size on completeness estimates.
• 06 - Biogeographic framework selection: Compares the biogeographic classifications proposed by Holt et al. (2013) and Dinerstein et al. (2017).
• 07 - Map generation: Produces the maps presented in the article.
• 08 - Overlap with conservation areas: Calculates the spatial overlap between different sampling quality categories and NPAs (Natural Protected Areas) and KBAs (Key Biodiversity Areas).
• 09 - Cartogram maps: Generates area-distorted cartogram representations.
For any inquiries about the use of these data and code, please contact the authors of the article.