Global patterns of taxonomic uncertainty and its impacts on biodiversity research

Guedes, Jhonny J. M.1 ; Moura, Mario R.2 ; Jardim, Lucas 3 ; Diniz-Filho, José Alexandre F.1

Published Feb 20, 2025 on Dryad. https://doi.org/10.5061/dryad.5x69p8d9z

Data files

Feb 20, 2025 version files 85.26 MB

Abstract

Over two million species have been named so far, but many will be invalidated due to redundant descriptions. Undetected invalid species (i.e., synonyms) can impair inferences we make in biodiversity research and hamper the implementation of effective conservation strategies. However, the processes leading to the accumulation of invalid names remain largely unknown. Using multi-model inferences, we investigated the patterns and potential drivers of species- and assemblage-level variation in synonym counts across terrestrial vertebrates globally. We also explored how taxonomic uncertainty (i.e., instability in species identities) can affect latitudinal variation of diversification rates. The average number of synonyms was higher for species described earlier, better represented in scientific collections, with larger geographic ranges, occurring in temperate regions, and in areas of high biodiversity attention. In assemblage-level models, a higher average number of synonyms was associated with temperate regions harbouring more early-described species. Areas of high endemism richness showed fewer synonyms across amphibians and reptiles but had an inverse effect for birds and mammals. Other predictor-response relationships varied across taxonomic groups, biogeographical realm, and spatial grain. Assuming that more synonyms indicate more stable species that have been thoroughly studied and reviewed, high synonym numbers in temperate species and assemblages support claims of a potential latitudinal taxonomy gradient, where geographic variation in taxonomic practice could hinder the proper recognition of tropical species. We show that the accumulation of invalid names is not random and discuss how invalid hidden names can affect biodiversity inferences. A potential approach to address this problem would be developing a taxonomic uncertainty metric that could be incorporated into models (i.e., as weights to account for varying degrees of uncertainty during the fitting process). Our study provides an initial approximation and highlights the often-neglected issue of uncertainty and instability in species identities from a macroecological perspective.

https://doi.org/10.5061/dryad.5x69p8d9z

There are 10 files associated with this paper:

(1) our raw data, containing some predictors used in cross-species models (AdditionalVars.csv); (2) marine taxa as per the IUCN (marine_spp_IUCN.csv); (3) synonym information for currently valid species stored in a long format, including their description dates (Tetrapoda_BetaTaxonomy.csv); (4) phylogenetic trees for amphibians, birds, mammals, and reptiles stored in a zip file (phylo_trees.zip); (5) outputs from phylogenetic autocorrelation analyses (phylocorr_outs.zip); (6-7) spatial data (including predictor and response variables) needed for the assemblage-level analyses and mapping spatial variation in synonym counts (SpatialDataset.Rdata & gridShps.Rdata); (8) the R code to replicate the analyses performed in this work (Rcode_DarwinianShortfalls.R); (9) additional R scripts needed for the exercise on diversification rates (diversification_scripts.zip); (10) the supporting information (figures and tables) associated with our study.

Description of the data and file structure

(1) AdditionalVars.csv: contains 5 columns, each one explained below:

- Scientific.Name: Scientific name of each species included in the most fully-sampled phylogenies for tetrapods.

- Synonyms: Known unique synonyms for each species according to the Catalogue of Life checklist.

- N_synonyms: Number of unique synonyms for each species.

- N_authors_per_taxa: Number of authors in the taxonomic family in the year of the species description (see main text for detailed explanation).

- N_specimens: Number of preserved specimens per species according to the GBIF database.

(2) marine_spp_IUCN.csv; contains 2 columns (see below) and represents species classified as marine by the International Union for Conservation of Nature (IUCN). This data is used to remove marine tetrapods, keeping only terrestrial vertebrates in subsequent analyses.

- scientificName: scientific name of each tetrapod species classified as marine by the IUCN.

- authority: authors of species description.

(3) Tetrapoda_BetaTaxonomy.csv: this file is derived from the Catalogue of Life checklist and contains 6 columns, each one explained below:

- parent_species: Scientific name of each species included in the most fully-sampled phylogenies for tetrapods.

- scientificName: The scientific name of a taxon.

- status: Informs whether a given scientific name represents a accepted (valid) species or a synonym (invalid).

- rank: Informs whether a given scientific name represents a species, subspecies, or infraspecific name.

- class: Informs the taxonomic class that each species belong to.

- year: Informs the year of description of each scientific name, when available.

(4) phylo_trees.zip: contains 4 files, which represent 100 fully-sampled phylogenies for amphibians, birds, mammals and reptiles. All species in each phylogenies matches the species in Scientific.Name. These trees are needed for checking phylogenetic autocorrelation in model residuals. Data for amphibians, birds, and mammals are available on nexus format, while data for reptiles is stored on Rdata format. Nexus files are primarily designed for use with specialised software, but they are plain text files and can be opened and edited using standard text editors. Here, we open and explore both nexus and Rdata files with R software using the R code (Rcode_BetaTaxonomy.R) provided along with the other files.

(5) phylocorr_outs.zip: contains 28 Rdata files, which represent the results of the phylogenetic autocorrelation analyses across each class-realm combination. As these analyses may take too long to run on personal computers, we have made these outputs available so that readers can load them to understand the data structure and recreate the related plots. Rdata files can be read with the software R.

(6) SpatialDataset.Rdata: represents datasets containing predictors and response variables for assemblage-level analyses in 3 spatial grains (110, 220, and 440 km). Each dataset have 14 columns, each one explained below:

- TreeTaxon: Informs the taxonomic class of each species, which matches the species included in the TetrapodTraits dataset, and therefore, the more recent fully-sampled phylogenies for tetrapods (see Methods).

- Cell_Id110: Grid cell's identifiers.

- Long: Longitude of each grid cell (centroid).

- Lat: Latitude of each grid cell (centroid).

- nSpp: Species richness, representing the number of species whose range intersected each grid cell.

- Syn_counts: Total number of synonyms per grid cell.

- AvgSynCounts: Average number of synonyms per grid cell (total synonym counts divided by species richness).

- EndRichness: Endemism richness as the sum of the inverse of species range sizes per grid cell.

- DiscYear: Median description year based on all species per grid cell.

- N_specimens: The number of preserved specimens per grid cell based on data from GBIF.

- elev: Median elevation value per grid cell.

- ETA50K: Median time to travel to cities with more than 50,000 inhabitants per grid cell.

- GDP: Median per capita gross domestic product.

- N_biodInst: The number of biodiversity institutions — i.e., herbaria, museums, and universities — per grid cell

(7) gridShps.Rdata: represents the grid cell shapefiles at 110, 220, and 440 km spatial resolutions. Each shape file contains grid cell's unique identifiers (Cell_Id), coordinate data (Long and Lat) and a geometry column, needed for mapping.

(8) Rcode_BetaTaxonomy.R: this file contains the R code used to perform all analyses of our work as well as for creating the images on the main text and supplementary material. The script is commented by the authors and intuitively structured, including needed packages and their respective versions.

(9) diversification_scripts.zip: contains 10 additional R scripts needed for running the diversification rates exercise across each taxonomic class. You need to reate a new folder named 'diversification_scripts', placing the downloaded scripts on it. This folder should be created within the 'diversification' folder, which will be automatically created in step 15 of the Rcode_BetaTaxonomy.R scrip.

(10) Supporting_information.pdf: contains 16 supplementary figures and five supplementary tables associated with our study.

Code/Software

All analyses were performed using the software R version 4.3.1.

The R-code has 18 steps:

1. Explore the beta taxonomy of terrestrial vertebrates.

2. Load and understand cross-species data, then explore the response and predictor variables.

3. Prepare cross-species variables that will be used, check for multicollinearity and standardise.

4. Model synonym counts per species using generalised linear models (GLMs).

5. Check whether GLM residuals show phylogenetic autocorrelation.

6. Create a plot with model coefficients and CI intervals.

7. Prepare assemblage-level variables that will be used, check for multicollinearity and standardise.

8. Plot the average number of synonyms per grid cell

9. Model the average number of synonyms per assemblage using generalised linear models (GLMs).

10. Check whether GLM residuals from assemblage models show spatial autocorrelation.

11. Compute 'residuals autocovariates' to account for spatial autocorrelation.

12. Model assemblage-level data using spatial GLMs (SAR-type models).

13. Check spatial autocorrelation of residuals from spatial SAR-type models.

14. Create a plot with model coefficients from SAR-type models.

15. Impact of synonyms on diversification rates across regions and taxa.

16. Plot results from the impacts of synonyms on diversification rates.

17. Impacts of synonyms on diversification rates using a Yule process.

18. Sensitivity analysis -- adding currently valid subspecies as synonyms.

Usage notes

Software R is required to open the Rcode_BetaTaxonomy.R file as well as for reading phylogenetic trees available in the phylo_trees.zip file (note that you will need a ZIP extractor to extract the phylogenies first) and the Rdata files within phylocorr_outs.zip (you also need to extract the files first to your working directory/folder).

Microsoft Excel can be used to view any .csv file.

Synonym Data

We extracted the number of synonyms per species from the Catalogue of Life (CoL) Checklist (https://www.checklistbank.org/dataset/286246/download; Bánki et al., 2024) on 20 Feb 2024. The taxonomic backbone of the CoL for amphibians, birds, and mammals was based on the Integrated Taxonomic Information System (ITIS; https://itis.gov/), while data for reptiles were from the Reptile Database (http://www.reptile-database.org; Uetz and Hošek, 2021). The CoL dataset included columns for rank and status, with the former informing if a scientific name represented, for example, a species or subspecies, and the latter indicating whether names were accepted, ambiguous, or unique synonyms. Using these columns, we filtered our dataset to include only accepted (i.e., valid) species and unique synonyms (i.e., invalid names). Consequently, currently valid subspecies were not analysed, as they are classified neither as full species nor as synonyms. However, we conducted a sensitivity analysis for cross-species models, treating valid subspecies as synonyms to assess the robustness of our main conclusions amid significant taxonomic changes.

Taxonomic Data

We followed the taxonomy employed in fully-sampled phylogenies available for amphibians (Jetz and Pyron, 2018), turtles and crocodiles (Colston et al., 2020), squamates (Tonini et al., 2016), birds (Jetz et al., 2012), and mammals (Upham, Esselstyn and Jetz, 2019) to incorporate species’ phylogenetic relationship into analyses and to take advantage of a recently available trait database for the world’s tetrapods (Moura et al., 2024). While the taxonomies we followed included mostly species described before 2016, we highlight that taxa described after that may not have been targeted by revisionary work that could lead to the accumulation of synonyms.

We merged the information from the CoL with data from the TetrapodTraits using fuzzy logic to check for and fix mismatches in species names between datasets due to spelling errors. For such, we used the fuzzyjoin package (Robinson, 2020) in software R version 4.3.1 (R Core Team, 2023). Checks were initially performed between valid species names in both datasets, then between valid species in TetrapodTraits with unique synonyms (i.e., names that can be tracked back to a single valid species) in the CoL database. After merging operations, we found synonym information for 31,175 (93.7%) species in TetrapodTraits. Our dataset initially included 33,281 species: 7,238 amphibians, 9,993 birds, 5,911 mammals, and 10,139 reptiles. However, we removed 228 marine species (as per the International Union for Conservation of Nature — IUCN) and focused solely on terrestrial vertebrates (n = 33,049).

Species-level Covariates

Attributes related to species biology included body size and habitat use. For body size, we used body mass data for birds, mammals, and reptiles, which had on average a data coverage exceeding 95% (n = 24,758 out of 25,811 species), and body length for amphibians, which covered 97.9% (n = 7,083) of species. Habitat use data were obtained for 31,497 species (95.3%), being converted into a continuous metric of verticality (Oliveira and Scheffers, 2019), with species scored as: 0 = strictly fossorial, 0.25 = fossorial and aquatic/ terrestrial , 0.5 = aquatic, or terrestrial, or aquatic/ terrestrial, or fossorial/ terrestrial/ arboreal, or fossorial/ terrestrial / aquatic/ arboreal, 0.75 = terrestrial and arboreal, or aquatic and arboreal, or terrestrial and aquatic/ arboreal, or terrestrial and aerial, or terrestrial and arboreal/ aerial, and 1 = strictly arboreal or aerial. Sources on body size and microhabitat are available in Moura et al. (2024).

Attributes related to taxonomic practice included year of description, taxonomic activity, and knowledge of species’ evolutionary relationships. Species’ description years were based on the original publication date and was available for all but one species (the squamate Indotyphlops pushpakumara, a nomen nudum included in Tonini’s phylogeny; Tonini et al., 2016). To inform taxonomic activity per species, we selected all taxa described within the same family and year, and then tallied the total number of unique taxonomists. Because the per-family number of taxonomists is expected to increase with family richness, we divided the number of taxonomists by the number of species described per year in a given family (Moura and Jetz, 2021). We acknowledge that many authors contributing to species descriptions are not necessarily taxonomists, with recent rise in authorship reflecting changes in taxonomic practice rather than an increase in taxonomic capacity (Boero, 2001; Rodrigues et al., 2010; Bebber et al., 2014). Despite this potential bias, the number of authors remains a useful proxy for taxonomic activity in beta taxonomic studies, highlighting the growing collaboration between taxonomists and researchers from various fields (Grieneisen et al., 2014; Poulin and Presswell, 2016). Lastly, we used a binary variable to identify tetrapod species with evolutionary relationships imputed (0 = not imputed, 1 = imputed) in fully-sampled phylogenies (Jetz et al., 2012; Tonini et al., 2016; Jetz and Pyron, 2018; Upham, Esselstyn and Jetz, 2019; Colston et al., 2020).

Some predictors analysed here were extracted from Moura et al. (2024) and represent within-range attributes derived from spatial intersections between expert-based range maps for amphibians (Jetz, McPherson and Guralnick, 2012; IUCN, 2021; Moura et al., 2024), birds (Jetz et al., 2012; Jetz, McPherson and Guralnick, 2012), mammals (Jetz, McPherson and Guralnick, 2012; IUCN, 2021; Marsh et al., 2022) and reptiles (Roll et al., 2017; Colston et al., 2020; IUCN, 2021), environmental, and socioeconomic layers using Lambert’s cylindrical equal-area grid of 110×110 km. This spatial resolution has been suggested to minimises comission errors related to the use of expert range maps (Hurlbert and Jetz, 2007). We used the following spatially based attributes: (i) range size, computed as the number of 110×110 km grid cells overlapped by the species’ range, (ii) latitude, based on the centroid of each species’ range map, (iii) elevation, as the average within-range elevation, and (iv) on-ground accessibility, measured as the average within-range time to travel to cities with more than 50,000 inhabitants (ETA50k). These four attributes were available for 33,037 (99.6 %) species.

To evaluate endemism richness, we used the spatial intersections of species’ range maps across the 110×110 km grid cells (available in Moura et al., 2024) to compute the sum of the inverse range sizes of all species per taxonomic class within a grid cell (Kier and Barthlott, 2001), which we then used to calculate the median value for each species based on the grid cells they occupy. Endemism richness was available for 33,898 (96 %) species. We obtained the number of preserved specimens deposited in biological collections per species based on the Global Biodiversity Information Facility (GBIF) database — data available for 32,985 (99.8%) species. Specifically, we used the function occ_count in the rgbif R package (Chamberlain et al., 2023), creating search queries with valid species names plus their unique synonyms and setting the basisOfRecord argument to preserved specimens. After excluding species with missing data in the response or predictor variables, our final dataset had a total of 28,802 terrestrial vertebrates: 6,440 amphibians, 9,183 birds, 4,158 mammals, and 9,021 reptiles.

Assemblage-level Covariates

For assemblage-level analyses, we modelled the average number of synonyms per grid cell for each taxonomic class separately across three spatial grains (110, 220, and 440 km). As predictors, we used latitude, as well as median values per grid cell for elevation (Amatulli et al., 2018), ETA50K (Nelson et al., 2019), endemism richness, and year of description. For the number of preserved specimens per assemblage, we downloaded all data on preserved tetrapod species available on GBIF (GBIF.org, 2024) on 03 Apr 2024, selecting only records with available coordinates and not flagged as suspicious (potentially erroneous) by GBIF. In total, we downloaded 12,485,651 records; however, after data cleaning procedures (i.e., removing duplicate records), the number of observations was reduced to 6,288,920.

We computed three additional predictors for the assemblage-level analyses: species richness, representing the number of species whose range intersected each grid cell; median per capita gross domestic product at 5 arc-min resolution, weighted by the country area covering each grid cell (Kummu, Taka and Guillaume, 2018), which represents a proxy for financial resources potentially available for biodiversity research; and the number of biodiversity institutions — i.e., herbaria, museums, and universities — per grid cell, based on the dataset available in the R package CoordinateCleaner (Zizka et al., 2019).

Determinants of Synonym Count Variation

We modelled synonym counts per species (cross-species analyses) and the average number of synonyms (i.e., total number of synonyms divided by the total number of species) per grid cell (assemblage-level analyses) separately for amphibians, reptiles, birds, and mammals. Our two response variables were modelled as a function of biological, geographical, taxonomical, and socioeconomic-related factors (see Figure 1).

In the cross-species analyses, we modelled synonym counts through generalised linear models (GLMs) using a negative binomial distribution to account for overdispersion (Lindén and Mäntyniemi, 2011; Stoklosa, Blakey and Hui, 2022). We assessed whether phylogenetic regression models were necessary by examining the phylogenetic autocorrelation of GLM residuals, constructing Moran’s I correlograms with coefficients calculated at 14 distance classes (Revell, 2010). For this, we used the most recent phylogenies available for major tetrapod groups: amphibians (Jetz and Pyron, 2018), turtles and crocodiles (Colston et al., 2020), squamates (Tonini et al., 2016), birds (Jetz et al., 2012), and mammals (Upham, Esselstyn and Jetz, 2019). We selected 50 fully sampled phylogenies, computing Moran’s I correlations for each tree and then averaging the results (except for global models, where due to computation constraints, we limited calculations to five trees). For reptiles, we first combined Tonini’s et al. and Colston’s et al. phylogenies using the R function tree.merger in the package RRphylo v. 2.7.0 (Castiglione, Serio, et al., 2022), which preserves branch length information in the combined trees (Castiglione, Melchionna, et al., 2022). We inspected model residuals using the R package DHARMa v. 0.4.6 (Hartig, 2022) and assessed model fit by calculating Nagelkerke’s pseudo-R² with the R package performance v. 0.10.2 (Lüdecke et al., 2021). Analyses of phylogenetic autocorrelation were performed using the R packages phylobase (Bolker et al., 2020) and phylosignal (Keck et al., 2016).

In the assemblage-level analyses, we initially modelled the average number of synonyms per grid cell using non-spatial GLMs with a Gaussian error distribution and an identity link function. We then tested for spatial autocorrelation (SAC) in model residuals using Moran’s I coefficients based on 14 distance classes (Dormann et al., 2007; Revell, 2010). Upon detecting SAC in the residuals of the non-spatial GLMs (Supplementary Fig. S1), we proceeded with spatial GLMs. To account for SAC, we first calculated ‘residuals autocovariates’ (RAC), which were then included as covariates in the spatial GLMs, similar to an error-type SAR model (Meyer et al., 2015). Specifically, we used the autocov_dist function in the R package spdep v. 1.2.7 (Bivand, 2022) to compute RACs based on the queen’s contiguity neighbourhood structure (a list of neighbourhood cells for each grid cell) and the residuals from the non-spatial model (see the R-code for details). After fitting the spatial models, we re-evaluated SAC in the residuals, which showed a substantial reduction compared to the non-spatial models (Supplementary Fig. S2). We inspected model residuals using the package DHARMa and assessed model fit through the coefficient of determination R² statistic.

Continuous predictors were log₁₀-transformed if their skewness or kurtosis fell outside the range of -2 to +2 (George and Mallery, 2010), and then centred and scaled (z-transformed) to allow direct comparisons of effect sizes. We checked for multicollinearity among predictors using Variation Inflation Factors (VIFs), where strong multicollinearity is usually attributed to VIFs > 10, suggesting that variables should be removed from the analysis (Kutner et al., 2005). As none of our continuous variables had VIFs > 5, all were retained for subsequent analyses (Supplementary Tables S1–2). We modelled synonym counts globally as well as separately for each biogeographic realm (Dinerstein et al., 2017), excluding Oceania due to its low sample size (Supplementary Table S3 ). Species whose ranges overlapped with realms by more than 70% were assigned to realm-scale models, which ensures that only species with a strong association to a specific realm are included in realm-specific analyses.

Influence of Synonym Counts on Diversification Rates

We explored the potential impacts of taxonomic uncertainty on eco-evolutionary research through an exercise based on latitudinal variation in recent speciation rates, measured by the diversification rate statistic (Jetz et al., 2012) using the R package picante v. 1.8.2 (Kembel et al., 2010). Initially, we identified cases where the estimated number of synonyms (N_est.) differed significantly from the observed number of synonyms (N_obs.) based on global cross-species models. Species with N_est. > N_obs. (negative residuals) were classified as needing to “gain” synonyms, leading to lumping events. Conversely, species with N_est. < N_obs. (positive residuals) were flagged as needing to “lose” synonyms, prompting splitting events by elevating synonyms to full species status.

In the lumping exercise, we collapsed branches of 10%, 15%, and 20% of species with the highest negative residual values, recomputing diversification rates for each proportion across ‘new’ phylogenies and calculating delta values based on current phylogenetic knowledge. This exercise involves removing the selected species from the phylogeny, mimicking their lumping with their respective sister species. In the splitting exercise, we selected 10%, 15%, and 20% of species with the highest positive residual values and added different proportions of synonyms (20%, 30%, and 40% of synonyms randomly chosen from the pool of synonyms among selected species) to the most recent phylogenies, thus reflecting distinct future scenarios of synonymisation rates. These synonyms were placed as a monophyletic clade within the current valid species to which they belong. If more than one synonym was added for the same species, the first synonym was placed at a position sampled by a uniform distribution of the valid species’ branch length, and subsequent synonyms were placed randomly on sampled branches descending from the most recent common ancestor between the valid species and its synonyms. This procedure simulated the uncertainty about the phylogenetic relationships of synonyms. Again, we calculated delta values based on current phylogenetic knowledge. We performed an additional analysis in the splitting exercise, simulating under the Yule model to test its effect on branch lengths and the estimation of diversification rates, compared with a uniform distribution.

We bound species on phylogenies using the R packages phytools v. 1.2.0 (Revell, 2012) and ape v. 5.6.2 (Paradis and Schliep, 2019). Both lumping and splitting procedures were repeated for 100 trees. For each scenario, we computed the average species diversification rate across trees per taxonomic class and plotted these rates by latitude using the R package ggplot2 v. 3.4.0 (Wickham, 2016). This exercise allowed us to explore how speciation rates might be perceived under distinct scenarios of taxonomic knowledge.

References

Amatulli, G. et al. (2018) ‘A suite of global, cross-scale topographic variables for environmental and biodiversity modeling’, Scientific Data, 5(1), pp. 1–15.

Bánki, O. et al. (2024) Catalogue of Life Checklist (Version 2024-03-26), Catalogue of Life. Available at: https://doi.org/https://doi.org/10.48580/dfz8d.

Bebber, D.P. et al. (2014) ‘Author inflation masks global capacity for species discovery in flowering plants’, New Phytologist, 201, pp. 700–706. Available at: https://doi.org/10.1111/nph.12522.

Bivand, R. (2022) ‘R Packages for Analyzing Spatial Data: A Comparative Case Study with Areal Data’, Geographical Analysis, 54(3), pp. 488–518. Available at: https://doi.org/10.1111/gean.12319.

Boero, F. (2001) ‘Light after dark: the partnership for enhancing expertise in taxonomy’, Trends in Ecology & Evolution, 16(5), p. 266.

Bolker, B. et al. (2020) ‘phylobase: Base Package for Phylogenetic Structures and Comparative Data’. Available at: https://cran.r-project.org/package=phylobase.

Castiglione, S., Serio, C., et al. (2022) ‘Fast production of large, time calibrated, informal supertrees with tree.merger’, Palaeontology, (e12588), pp. 1–11. Available at: https://doi.org/10.1111/pala.12588.

Castiglione, S., Melchionna, M., et al. (2022) ‘Human face‐off: a new method for mapping evolutionary rates on three‐dimensional digital models’, Palaeontology, 65(1), pp. 1–10. Available at: https://doi.org/10.1111/pala.12582.

Chamberlain, S. et al. (2023) ‘rgbif: Interface to the Global Biodiversity Information Facility API. R package version 3.7.8.’ Available at: https://cran.r-project.org/package=rgbif (Accessed: 13 November 2023).

Colston, T.J. et al. (2020) ‘Phylogenetic and spatial distribution of evolutionary diversification, isolation, and threat in turtles and crocodilians (non-avian archosauromorphs)’, BMC Evolutionary Biology, 20(81), pp. 1–16. Available at: https://doi.org/10.1186/s12862-020-01642-3.

Dinerstein, E. et al. (2017) ‘An Ecoregion-Based Approach to Protecting Half the Terrestrial Realm’, BioScience, 67(6), pp. 534–545. Available at: https://doi.org/10.1093/biosci/bix014.

Dormann, C.F. et al. (2007) ‘Methods to account for spatial autocorrelation in the analysis of species distributional data: A review’, Ecography, 30(5), pp. 609–628. Available at: https://doi.org/10.1111/j.2007.0906-7590.05171.x.

GBIF.org (2024) GBIF Occurrence Download. Available at: https://doi.org/https://doi.org/10.15468/dl.k8c7ac.

George, D. and Mallery, P. (2010) SPSS for Windows Step by Step: A Simple Guide and Reference, 17.0 update. 10th edn. Boston: Allyn & Bacon.

Grieneisen, M.L. et al. (2014) ‘Biodiversity, Taxonomic Infrastructure, International Collaboration, and New Species Discovery’, BioScience, 64(4), pp. 322–332. Available at: https://doi.org/10.1093/biosci/biu035.

Hartig, F. (2022) ‘DHARMa: Residual Diagnostics for Hierarchical (Multi-Level / Mixed) Regression Models’. Available at: https://cran.r-project.org/package=DHARMa.

Hurlbert, A.H. and Jetz, W. (2007) ‘Species richness, hotspots, and the scale dependence of range maps in ecology and conservation’, Proceedings of the National Academy of Sciences of the United States of America, 104(33), pp. 13384–13389. Available at: https://doi.org/10.1073/pnas.0704469104.

Jetz, W. et al. (2012) ‘The global diversity of birds in space and time’, Nature, 491(7424), pp. 444–448. Available at: https://doi.org/10.1038/nature11631.

Jetz, W. and Pyron, R.A. (2018) ‘The interplay of past diversification and evolutionary isolation with present imperilment across the amphibian tree of life’, Nature Ecology & Evolution, 2, pp. 850–858. Available at: https://doi.org/10.1038/s41559-018-0515-5.

Keck, F. et al. (2016) ‘Phylosignal: An R package to measure, test, and explore the phylogenetic signal’, Ecology and Evolution, 6(9), pp. 2774–2780. Available at: https://doi.org/10.1002/ece3.2051.

Kembel, S.W. et al. (2010) ‘Picante: R tools for integrating phylogenies and ecology’, Bioinformatics, 26(11), pp. 1463–1464. Available at: https://doi.org/10.1093/bioinformatics/btq166.

Kier, G. and Barthlott, W. (2001) ‘Measuring and mapping endemism and species richness: A new methodological approach and its application on the flora of Africa’, Biodiversity and Conservation, 10(9), pp. 1513–1529. Available at: https://doi.org/10.1023/A:1011812528849.

Kummu, M., Taka, M. and Guillaume, J.H.A. (2018) ‘Gridded global datasets for Gross Domestic Product and Human Development Index over 1990-2015’, Scientific Data, 5, pp. 1–15. Available at: https://doi.org/10.1038/sdata.2018.4.

Kutner, M.H. et al. (2005) Applied Linear Statistical Models. 5th edn. New York: McGraw-Hill Irwin.

Lindén, A. and Mäntyniemi, S. (2011) ‘Using the negative binomial distribution to model overdispersion in ecological count data’, Ecology, 92(7), pp. 1414–1421.

Lüdecke, D. et al. (2021) ‘performance: An R Package for Assessment, Comparison and Testing of Statistical Models’, Journal of Open Source Software, 6(60), p. 3139. Available at: https://doi.org/10.21105/joss.03139.

Meyer, C. et al. (2015) ‘Global priorities for an effective information basis of biodiversity distributions’, Nature Communications, 6, pp. 1–8. Available at: https://doi.org/10.1038/ncomms9221.

Moura, M.R. et al. (2024) ‘A phylogeny-informed characterisation of global tetrapod traits addresses data gaps and biases’, PLoS Biology, 22(7), p. e3002658. Available at: https://doi.org/10.1371/journal.pbio.3002658.

Moura, M.R. and Jetz, W. (2021) ‘Shortfalls and opportunities in terrestrial vertebrate species discovery’, Nature Ecology & Evolution, 5(5), pp. 631–639. Available at: https://doi.org/10.1038/s41559-021-01411-5.

Nelson, A. et al. (2019) ‘A suite of global accessibility indicators’, Scientific Data, 6(1), pp. 1–9. Available at: https://doi.org/10.1038/s41597-019-0265-5.

Oliveira, B.F. and Scheffers, B.R. (2019) ‘Vertical stratification influences global patterns of biodiversity’, Ecography, 42(2), pp. 249–249. Available at: https://doi.org/10.1111/ecog.03636.

Paradis, E. and Schliep, K. (2019) ‘ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R’, Bioinformatics. Edited by R. Schwartz, 35(3), pp. 526–528. Available at: https://doi.org/10.1093/bioinformatics/bty633.

Poulin, R. and Presswell, B. (2016) ‘Taxonomic Quality of Species Descriptions Varies over Time and with the Number of Authors, but Unevenly among Parasitic Taxa’, Systematic Biology, 65(6), pp. 1107–1116. Available at: https://doi.org/10.1093/sysbio/syw053.

R Core Team (2023) ‘R: A Language and Environment for Statistical Computing’, R Foundation for Statistical Computing [Preprint]. Vienna, Austria. Available at: https://www.r-project.org/.

Revell, L.J. (2010) ‘Phylogenetic signal and linear regression on species data’, Methods in Ecology and Evolution, 1, pp. 319–329. Available at: https://doi.org/10.1111/j.2041-210X.2010.00044.x.

Revell, L.J. (2012) ‘phytools: An R package for phylogenetic comparative biology (and other things)’, Methods in Ecology and Evolution, 3(2), pp. 217–223. Available at: https://doi.org/10.1111/j.2041-210X.2011.00169.x.

Robinson, D. (2020) ‘fuzzyjoin: Join Tables Together on Inexact Matching. R package version 0.1.6.’ Available at: https://cran.r-project.org/package=fuzzyjoin.

Rodrigues, A.S.L.L. et al. (2010) ‘A global assessment of amphibian taxonomic effort and expertise’, BioScience, 60(10), pp. 798–806. Available at: https://doi.org/10.1525/bio.2010.60.10.6.

Stoklosa, J., Blakey, R. V. and Hui, F.K.C. (2022) ‘An Overview of Modern Applications of Negative Binomial Modelling in Ecology and Biodiversity’, Diversity, 14(5), p. 320. Available at: https://doi.org/10.3390/d14050320.

Tonini, J.F.R. et al. (2016) ‘Fully-sampled phylogenies of squamates reveal evolutionary patterns in threat status’, Biological Conservation, 204, pp. 23–31. Available at: https://doi.org/10.1016/j.biocon.2016.03.039.

Uetz, P. and Hošek, J. (2021) The Reptile Database, In O. Bánki, Y. Roskov, M. Döring, G. Ower, D. R. Hernández Robles, C. A. Plata Corredor, T. Stjernegaard Jeppesen, A. Örn, L. Vandepitte, D. Hobern, P. Schalk, R. E. DeWalt, K. Ma, J. Miller, T. Orrell, R. Aalbu, J. Abbott, R. Adlard, C. Aedo, et al. Cat. Available at: https://doi.org/https://doi.org/10.48580/dfvll-37s.

Upham, N.S., Esselstyn, J.A. and Jetz, W. (2019) ‘Inferring the mammal tree: Species-level sets of phylogenies for questions in ecology, evolution, and conservation’, PLoS Biology, 17(12), pp. 1–44. Available at: https://doi.org/10.1371/journal.pbio.3000494.

Wickham, H. (2016) ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.

Zizka, A. et al. (2019) ‘Coordinate Cleaner: Standardized cleaning of occurrence records from biological collection databases’, Methods in Ecology and Evolution, 19, pp. 744–751. Available at: https://doi.org/10.1111/2041-210X.13152.

Global patterns of taxonomic uncertainty and its impacts on biodiversity research

Data files

Abstract

README: Global patterns of taxonomic uncertainty and its impacts on biodiversity research

Methods

Works referencing this dataset