Skip to main content

Macroecological correlates of Darwinian shortfalls across terrestrial vertebrates

Cite this dataset

Guedes, Jhonny J. M.; Diniz-Filho, José Alexandre F.; Moura, Mario Ribeiro (2024). Macroecological correlates of Darwinian shortfalls across terrestrial vertebrates [Dataset]. Dryad.


Most described species have not been explicitly included in phylogenetic trees—a problem named the Darwinian shortfall—due to a lack of molecular and/or morphological data, thus hampering the explicit incorporation of evolution into large-scale biodiversity analyses. We investigate potential drivers of the Darwinian shortfall in tetrapods, a group where at least one-third of described species still lack phylogenetic data, thus necessitating the imputation of their evolutionary relationships in fully-sampled phylogenies. We show that the number of preserved specimens in scientific collections is the main driver of phylogenetic knowledge accumulation, highlighting the major role of biological collections in unveiling novel biodiversity data and the importance of continued sampling efforts to reduce knowledge gaps. Additionally, large-bodied and wide-ranged species, as well as terrestrial and aquatic amphibians and reptiles, are phylogenetically better known. Therefore, future efforts should prioritize phylogenetic research on organisms that are narrow-ranged, small-bodied, and underrepresented in scientific collections, such as fossorial species. Addressing the Darwinian shortfall will be imperative for advancing our understanding of evolutionary drivers shaping biodiversity patterns and implementing comprehensive conservation strategies.

README: README: Macroecological correlates of Darwinian shortfalls across terrestrial vertebrates

There are five main documents associated with this paper:

(1) our raw data (AdditionalVars.csv); (2) marine taxa as per the IUCN (marine_spp_IUCN.csv); (3) phylogenetic trees for amphibians, birds, mammals, and reptiles stored in a zip file (; (4) outputs from phylogenetic autocorrelation analyses (; and (5) the R code (Rcode_DarwinianShortfalls.R).

Description of the data and file structure

(1) AdditionalVars.csv; contains 3 columns, each one explained below:

  • Scientific.Name: scientific name of each species included in the most fully-sampled phylogenies for tetrapods.

  • sequenced: Informs if there is any genetic data available for each species (data obtained from Smid 2022; DOI: 10.1111/jbi.14491)

  • GBIF_occ: Total number of preserved specimens housed in scientific collection for each species based on the Global Biodiversity Information Facility (GBIF).

(2marine_spp_IUCN.csv; this file contains 2 columns (see below) and represents species classified as marine by the International Union for Conservation of Nature (IUCN). This data is used to remove marine tetrapods, keeping only terrestrial vertebrates in subsequent analyses. 

  1. scientificName: scientific name of each tetrapod species classified as marine by the IUCN.
  2. authority: authors of species description.

(; contains 4 files, which represent 100 fully-sampled phylogenies for amphibians, birds, mammals and reptiles. All species in each phylogenies matches the species in "Scientific.Name". These trees are needed for checking phylogenetic autocorrelation in model residuals and for computing Darwinian deficits across taxonomic families and classes. Data for amphibians, birds, and mammals are available on nexus format, while data for reptiles is stored on Rdata format. Nexus files are primarily designed for use with specialised software, but they are plain text files and can be opened and edited using standard text editors. Here, we open and explore both nexus and Rdata files with R software using the R code (RcodeDarwinianShortfalls.R) provided along with the other files.

(; contains 28 Rdata files, which represent the results of the phylogenetic autocorrelation analyses across each class-realm combination. Since these analyses may take too long to run on personal computers, we have made these outputs available so that readers can load them to understand the data structure and recreate the related plots. Rdata files can be read with the software R.

(5Rcode_DarwinianShortfalls.R; this file contains the R code used to perform all analyses of our work as well as for creating the images on the main text and supplementary material. The script in commented by the authors and intuitively structured, including needed packages and their respective versions.


All analyses were performed using the software R version 4.3.1. 

The R-code has 7 steps:

1. Load and understand the data, then explore the response and predictor variables.

2. Make barplots showing the proportion of phylogenetically imputed species per family.

3. Prepare variables that will be used, check for multicollinearity and standardise.

4. Model the binary response using generalised linear mixed-effects models (GLMMs).

5. Check whether GLMM residuals show phylogenetic autocorrelation.

6. Create a plot with model coefficients and CI intervals.

7. Compute Darwinian deficits across taxonomic families and classes.

Usage notes

Software R is required to open RcodeDarwinianShortfalls.R as well as for reading phylogenetic trees available in the file (note that you will need a ZIP extractor to extract the phylogenies first) and the Rdata files within (you also need to extract the files first to your working directory/folder).

Microsoft Excel can be used to view AdditionalVars.csv and marine_spp_IUCN.csv


(a)   Phylogenetic assessment and species-level covariates

We firstly identified which tetrapod species had their evolutionary relationships imputed in fully-sampled phylogenies available for the group (1–5), thus creating a binary response variable for 33,281 species included in such phylogenies. We excluded 228 marine species (according to the International Union for Conservation of Nature — IUCN), keeping only terrestrial vertebrates (n = 33,053) in subsequent analyses. We then selected eight putative predictors to investigate how species’ biology, geography and biodiversity appeal, and socioeconomic-related factors affect the probability of species being imputed on phylogenies. Many predictors used here comes from the TetrapodTraits, a recent database for the world’s tetrapod including standardised species-level attributes (6). Below, we outline the attributes used as predictors, along with brief explanations of how they were computed.

The year of species description was based on the original description publication date and was available for all but one species (Indotyphlops pushpakumara, an undescribed species included in Tonini’s phylogeny, 4). For body size, we used body mass information for birds, mammals, and reptiles as data coverage was on average higher than 95% (n = 24,761 out of 25,815 species). Conversely, body mass information was available for only 21.4% (n = 1,547) amphibian species, and therefore, we used body length, which coverage represented 97.9% (n = 7,083) species. Information on microhabitat use was available for 31,501 (95.3%) species, and it was converted into a continuous metric of verticality (7), with species being scored as: 0 = strictly fossorial, 0.25 = fossorial and aquatic/terrestrial, or fossorial and aquatic and terrestrial, 0.5 = aquatic/terrestrial, or fossorial and arboreal, or fossorial and aquatic/terrestrial and arboreal, 0.75 = terrestrial/aquatic and arboreal, or terrestrial and aquatic and arboreal, or terrestrial and aerial, and 1 = strictly arboreal or aerial. Species-specific sources on body size and microhabitat are available in (6).

Some TetrapodTraits attributes were derived as within-range predictors based on expert-based range maps for amphibians (8–10), birds (2,10), mammals (9–11), and reptiles (1,9,12) overlaid onto a 110×110 km cylindrical equal-area grid cell scheme. This grain size minimizes the false presences related to the use of expert range maps (13). Species range size was represented as the number of 110-km grid cells overlapped by range maps. Two other attributes corresponded to within-range averages of raster layers aggregated to the resolution of the grid cell scheme, namely: elevation, which was based on a 1km topographic layer (14) and roughly captures broad elevational patterns in a continuous format, without the discretization of species into lowland vs. highland areas, and human density (as inhabitants per km2 at year 2017), derived from the HYDE 3.2 database at a spatial resolution of 5 arc-min (15). Additionally, we obtained endemism richness, which represents a proxy for range rarity. This metric was computed as the sum of the inverse range sizes of all species per taxonomic class in a grid-cell (16), which we then used to calculate the median endemism richness for each species based on the cells they occur in.

We extracted the number of preserved specimens per species deposited in biological collections worldwide using the function occ_count in the rgbif R package (17). For that, we created search queries containing valid species name plus their unique synonyms (i.e., invalid names that can be tracked back to a single valid name) and setting the basisOfRecord argument to preserved specimen. Synonym information was obtained from the IUCN taxonomy backbone using the rl_synonyms function in the rredlist R package (18). Data on these—and the spatially-based—predictors were available for more than 99.9% species. After excluding species with missing data in a least one predictor, our final dataset included 30,321 species: 6,526 amphibians, 9,369 birds, 5,052 mammals, and 9,374 reptiles.


(b)  Statistical analyses

We modelled the probability of a species being phylogenetically imputed (binary response variable: 0 = not imputed, 1 = imputed) as a function of eight putative predictors (fixed effects; see Box 1) while accounting for taxonomic relatedness (random effects) using generalised linear mixed-effects models (GLMM; 19) with a binomial error distribution and a logit link function (20). We included taxonomic family as a random variable in our models to minimize dependence issues among species. Families with less than three species (supplementary material, table S1) were removed to reduce instability in model estimates (21), which decreased to 30,178 the number of species to be modelled.

We evaluated whether phylogenetic regression models were needed by examining the phylogenetic autocorrelation of GLMM residuals through Moran’s I correlograms computed across 14 distance classes (22). The phylogenetic correlograms were based on averaged results from 50 fully sampled phylogenies for each taxonomic class (1–5). For reptiles, we firstly constructed supertrees by combining Tonini’s and Colston’s phylogenies (1,4) using the function tree.merger in the R package RRphylo (23), which preserves branch length information in the combined trees (24). For the global models, we used only five trees due to computational limitations. Analyses of phylogenetic autocorrelation were performed using the R packages phylobase (25) and phylosignal (26).

Prior to constructing the GLMMs, continuous predictors were log10-transformed if skewness or kurtosis were outside of the range of -2 and +2 (27), then centred and scaled (z-transformed) to allow direct comparisons of their effect sizes. We checked for multicollinearity among predictors using the Variation Inflation Factors (VIF), where strong multicollinearity is usually attributed to VIFs > 5, indicating that variables should be removed from the analysis (28). Since none of our continuous variables had VIF > 4, we kept them all in subsequent analyses (supplementary material, table S2).

We modelled our binary response variable separately for amphibians, birds, mammals, and reptiles. Models were constructed globally as well as separately for each biogeographic realm (29), except for Oceania due to its low sample size (supplementary material, table S3). Species whose range overlapped realms by >70% were assigned to realm-scale models. We inspected model fit using the R package DHARMa (30) and assessed explained variation by calculating the pseudo-R2 with the R package performance (31). We used the package lme4 (32) for fitting the mixed-effect models and usdm (33) for computing VIF values. All analyses were performed using R version 4.3.1 (34). See data accessibility for raw data and R-code.

Lastly, we computed a measure of “Darwinian deficit” (35) per family and class to quantify the relative contribution of phylogenetically imputed relationships in representing the accumulated evolutionary history across taxa. This measure is based on Faith’s phylogenetic diversity (PD) metric (36) and was calculated as: PD imputed species / (PD imputed species + PD non-imputed species). The Darwinian deficit ranges from 0 to 1 and informs the proportion of total PD (i.e., sum of branch lengths) that is attributed to imputed species in a sample (e.g., family). We obtained the Darwinian deficit per family across 100 fully-sampled phylogenies for each taxonomic class, using only families with at least one imputed species. We then inspected whether average values per family was influenced by the respective species richness.



1.        Colston TJ, Kulkarni P, Jetz W, Pyron RA. Phylogenetic and spatial distribution of evolutionary diversification, isolation, and threat in turtles and crocodilians (non-avian archosauromorphs). BMC Evol Biol. 2020;20(81):1–16.

2.        Jetz W, Thomas GH, Joy JB, Hartmann K, Mooers AO. The global diversity of birds in space and time. Nature. 2012;491(7424):444–8.

3.        Jetz W, Pyron RA. The interplay of past diversification and evolutionary isolation with present imperilment across the amphibian tree of life. Nat Ecol Evol. 2018;2:850–8.

4.        Tonini JFR, Beard KH, Ferreira RB, Jetz W, Pyron RA. Fully-sampled phylogenies of squamates reveal evolutionary patterns in threat status. Biol Conserv. 2016;204:23–31.

5.        Upham NS, Esselstyn JA, Jetz W. Inferring the mammal tree: Species-level sets of phylogenies for questions in ecology, evolution, and conservation. PLoS Biol. 2019;17(12):1–44.

6.        Moura MR, Ceron K, Guedes JJM, Chen-Zhao R, Sica Y V, Hart J, et al. A phylogeny-informed characterisation of global tetrapod traits addresses data gaps and biases. PLoS Biol. 2024;In press.

7.        Oliveira BF, Scheffers BR. Vertical stratification influences global patterns of biodiversity. Ecography (Cop). 2019;42(2):249–249.

8.        González-del-Pliego P, Freckleton RP, Edwards DP, Koo MS, Scheffers BR, Pyron RA, et al. Phylogenetic and Trait-Based Prediction of Extinction Risk for Data-Deficient Amphibians. Curr Biol. 2019;29(9):1557–63.

9.        IUCN. The IUCN Red List of Threatened Species. Version 2021.3 [Internet]. 2021 [cited 2019 Mar 19]. Available from:

10.      Jetz W, McPherson JM, Guralnick RP. Integrating biodiversity distribution knowledge: Toward a global map of life. Trends Ecol Evol. 2012;27(3):151–9.

11.      Marsh CJ, Sica Y V., Burgin CJ, Dorman WA, Anderson RC, del Toro Mijares I, et al. Expert range maps of global mammal distributions harmonised to three taxonomic authorities. J Biogeogr. 2022;49(5):979–92.

12.      Roll U, Feldman A, Novosolov M, Allison A, Bauer AM, Bernard R, et al. The global distribution of tetrapods reveals a need for targeted reptile conservation. Nat Ecol Evol. 2017;1:1677–1682.

13.      Hurlbert AH, Jetz W. Species richness, hotspots, and the scale dependence of range maps in ecology and conservation. Proc Natl Acad Sci U S A. 2007;104(33):13384–9.

14.      Amatulli G, Domisch S, Tuanmu MN, Parmentier B, Ranipeta A, Malczyk J, et al. A suite of global, cross-scale topographic variables for environmental and biodiversity modeling. Sci Data. 2018;5(1):1–15.

15.      Goldewijk KK, Beusen A, Doelman J, Stehfest E. Anthropogenic land use estimates for the Holocene - HYDE 3.2. Earth Syst Sci Data. 2017;9(2):927–53.

16.      Kier G, Barthlott W. Measuring and mapping endemism and species richness: A new methodological approach and its application on the flora of Africa. Biodivers Conserv. 2001;10(9):1513–29.

17.      Chamberlain S, Barve V, Mcglinn D, Oldoni D, Desmet P, Geffert L, et al. rgbif: Interface to the Global Biodiversity Information Facility API. R package version 3.7.8. [Internet]. 2023 [cited 2023 Nov 13]. Available from:

18.      Chamberlain S. rredlist: “IUCN” Red List Client. R package version 0.7.0. [Internet]. 2020. p. 23. Available from:

19.      Zuur AF, Ieno EN, Walker NJ, Saveliev AA, Smith GM. Statistics for Biology and Health. Springer; 2009. 574 p.

20.      Bolker BM, Brooks ME, Clark CJ, Geange SW, Poulsen JR, Stevens MHH, et al. Generalized linear mixed models: a practical guide for ecology and evolution. Trends Ecol Evol. 2009;24(3):127–35.

21.      Harrison XA, Donaldson L, Correa-Cano ME, Evans J, Fisher DN, Goodwin CED, et al. A brief introduction to mixed effects modelling and multi-model inference in ecology. PeerJ. 2018;2018(5):1–32.

22.      Revell LJ. Phylogenetic signal and linear regression on species data. Methods Ecol Evol. 2010;1:319–29.

23.      Castiglione S, Melchionna M, Profico A, Sansalone G, Modafferi M, Mondanaro A, et al. Human face‐off: a new method for mapping evolutionary rates on three‐dimensional digital models. Palaeontology. 2022;65(1):1–10.

24.      Castiglione S, Serio C, Mondanaro A, Melchionna M, Raia P. Fast production of large, time calibrated, informal supertrees with tree.merger. Palaeontology. 2022;(e12588):1–11.

25.      Bolker B, Butler M, Cowan P, Vienne D de, Eddelbuettel D, Holder M, et al. phylobase: Base Package for Phylogenetic Structures and Comparative Data [Internet]. 2020. Available from:

26.      Keck F, Rimet F, Bouchez A, Franc A. Phylosignal: An R package to measure, test, and explore the phylogenetic signal. Ecol Evol. 2016;6(9):2774–80.

27.      George D, Mallery P. SPSS for Windows Step by Step: A Simple Guide and Reference, 17.0 update. 10th ed. Boston: Allyn & Bacon; 2010. 386 p.

28.      Kutner MH, Nachtsheim CJ, Neter J, Li W. Applied Linear Statistical Models. 5th ed. New York: McGraw-Hill Irwin; 2005. 1396 p.

29.      Dinerstein E, Olson D, Joshi A, Vynne C, Burgess ND, Wikramanayake E, et al. An Ecoregion-Based Approach to Protecting Half the Terrestrial Realm. Bioscience. 2017 Jun;67(6):534–45.

30.      Hartig F. DHARMa: Residual Diagnostics for Hierarchical (Multi-Level / Mixed) Regression Models [Internet]. 2022. Available from:

31.      Lüdecke D, Ben-Shachar M, Patil I, Waggoner P, Makowski D. performance: An R Package for Assessment, Comparison and Testing of Statistical Models. J Open Source Softw. 2021;6(60):3139.

32.      Bates D, Mächler M, Bolker BM, Walker SC. Fitting linear mixed-effects models using lme4. J Stat Softw. 2015;67(1).

33.      Naimi B, Hamm NAS, Groen TA, Skidmore AK, Toxopeus AG. Where is positional uncertainty a problem for species distribution modelling? Ecography (Cop). 2014;37(2):191–203.

34.      R Core Team. R Foundation for Statistical Computing, Vienna, Austria. 2023. R: A Language and Environment for Statistical Computing. Available from:

35.      Nakamura G, Richter A, Soares BE. FishPhyloMaker: An R package to generate phylogenies for ray-finned fishes. Ecol Inform. 2021;66:101481.

36.      Faith DP. Conservation evaluation and phylogenetic diversity. Biol Conserv. 1992;61(1):1–10.


Coordenação de Aperfeicoamento de Pessoal de Nível Superior, Award: proc. 88887.478942/2020-00

Fundação de Amparo à Pesquisa do Estado de São Paulo, Award: 2021/11840-6 & 2022/12231-6

National Council for Scientific and Technological Development, Award: 465610/2014-5

Fundação de Apoio a Pesquisa do Estado de Goiás, Award: 201810267000023