Data from: The invisible species: Big data unveil coverage gaps in the Atlantic forest hotspot
Data files
Oct 02, 2025 version files 451.12 MB
-
Data_and_scripts.zip
451.12 MB
-
README.md
3.29 KB
Abstract
Rapid technological advancements and the biodiversity crisis have motivated efforts to document species before their extinction. However, taxonomic coverage gaps, where certain species are underrepresented in biodiversity databases, can distort our understanding of ecosystems. This dataset includes the data we used to quantify how many plant species within a biodiversity hotspot are "invisible," meaning they are excluded from studies due to insufficient occurrence data. Additionally, we identified factors influencing the invisibility of species.
We downloaded and filtered occurrence data from 15,010 plant species from online biodiversity databases. We utilized multiple thresholds, each representing a minimum required number of records, to classify species as “invisible” if their record count fell below these thresholds. We fitted logistic models to estimate how factors such as life form, presence of a vernacular name, geographical distribution, endemism, and year of taxonomic publication influence the odds of species exclusion.
The proportion of invisible species ranged from 14% when employing simple tools requiring just three records to as high as 64% with more demanding tools requiring at least 60 records. Species with specific characteristics are more prone to invisibility, including non-tree species, species without vernacular names, species with restricted distributions within Atlantic Forest, endemic species, and species with names published more recently. A significant portion of these invisible species are distributed along the coastline. In contrast, the continental portion of the biome exhibits fewer taxonomic coverage gaps of known species, most likely due to lower rates of new species descriptions. Coverage gaps are shaped by the interaction of biological traits, societal preferences, limited technical support, and human activities.
https://doi.org/10.5061/dryad.866t1g1x7
Data_and_scripts.rar: The data contains all the information and reproducible R scripts to generate the results and figures associated with the paper.
Description of the data and file structure
Scripts: This folder contains reproducible R scripts used to perform the analyses and generate the figures presented in the paper.
- GLM.R – Fits a logistic regression model to estimate the drivers of species invisibility.
- Plot species features.R – Plots the total number of plant species by life form, presence of vernacular names, geographical distribution, endemism, and year of publication of the species name.
- Plot maps.R – Produces maps of vegetation types within the Atlantic Forest and shows the percentage of invisible species in each region.
- Forest plot of GLM.R – Generates a forest plot displaying the results of the logistic regression model.
- Supplementary material.R – Compiles and organizes supplementary material, including the most common issues encountered during occurrence record filtering, the number of records across the Neotropics, and a summary table of logistic model statistics.
SpatialFiles: This folder contains spatial files (GeoPackage and TIFF formats) required for running the analyses and reproducing the figures presented in the paper.
- AF_dissolved.gpkg – Polygon of the Atlantic Forest boundaries.
- AtlanticForestVegetationEnglishbyState.gpkg – Polygons of vegetation types (IBGE, 2006) within the Atlantic Forest, subdivided by Federal States.
- AtlanticForestVegetationSimplified.gpkg – Polygons of vegetation types (IBGE, 2006) within the Atlantic Forest (simplified version).
- brasil2021_cover_percent.tiff – Raster representing the percentage of natural vegetation cover across the Atlantic Forest (2021).
- EcoRegions.gpkg – Polygons of ecoregions (Olson, 2001) within the Atlantic Forest.
- Neotropics.gpkg – Polygon of the Neotropical Region.
- Raster_base.tiff – Base raster file used as reference for spatial analyses.
- South_America.gpkg – Polygon of South America.
Data: A folder containing main results:
- Models.RDS: R objects encompassing data and results of the GLM models. It can be loaded in R using the function readRDS().
- SpeciesData.gz: A comprehensive list of all species studied, including information regarding their characteristics (e.g., life form, endemism, distribution, etc.) and the number of available records. The file is a gzip file and can be loaded in R using the function data.table::fread("Data/SpeciesData.gz").
- Final_Occurrence_data.gz: The filtered occurrence records. The file is a gzip file and can be loaded in R using the function data.table::fread("Data/Final**Occurrence*_data.gz").
Code/Software
- Spatial files (TIFF and GPKG) can be opened with any standard geospatial software (e.g., QGIS) or imported into R using the terra package (rast() and vect() functions).
- Compressed files (.gz) can be imported into R using the fread() function from the data.table package.
- Trindade, Weverton C. F.; Marques, Márcia C. M. (2024). The Invisible Species: Big Data Unveil Coverage Gaps in the Atlantic Forest Hotspot. Diversity and Distributions. https://doi.org/10.1111/ddi.13931
