Data from: Future palaeontologists will detect current mammal latitudinal biodiversity gradient

Galván, Sofía 1 ; Gamboa, Sara1; Chiarenza, Alfio Alessandro2; Rotatori, Filippo Maria3; Oliver, Adriana1; Varela, Sara1

Published Jul 04, 2025 on Dryad. https://doi.org/10.5061/dryad.612jm64db

Data files

Jul 04, 2025 version files 291.52 KB

Fossil_bias_LDG_script_independent.R

60.40 KB
Fossil_bias_LDG_script_quaternary.R

108.16 KB
Fossil_bias_LDG_script_replicates.R

40.52 KB
Fossil_bias_LDG_script.R

79.01 KB
README.md

3.44 KB

Abstract

Aim: Fossil data provide crucial insights into past biogeographic and macroecological patterns. However, geological, biological, and sampling biases can potentially compromise genuine biodiversity inferences. Here, we tested whether fossil biases may hinder the accurate retrieval of the Latitudinal Biodiversity Gradient (LBG).

Location: Global.

Time period: Contemporary.

Major taxa studied: Mammals.

Methods: We implemented a filtering process to current mammal distribution maps, simulating one geological, two biological, and three sampling sources of bias. Namely, distribution maps were downgraded to regions with sediments, species preservation was modulated by their range size and body size, and sampling was applied to locations with a fossil record. We applied the filters sequentially to mimic a process of progressive fossilisation, considering three preservation rates and removing up to 98.8% of the original species. We also applied filters independently to assess their individual effect. Lastly, we quantified the richness loss, the change in the slope between latitude and richness, and the change in richness maxima throughout the filters.

Results: Results indicate that the applied filters collectively and distinctly influence the detection and robustness of the LBG signal. However, although the slope of the richness gradient diminishes progressively (especially for filters affecting species by their body size or taxonomic group), a LBG signal is detected across all the filters. Equally, despite the critical species loss, richness maxima remain around the equator.

Main conclusions: We demonstrated that strongly incomplete or biased samples can still recover accurate large-scale biogeographic patterns such as the LBG. Our results show an optimistic scenario in which, although the LBG intensity is sensitive to the uneven loss of information in biodiversity data, a detectable signal can be retrieved for all scenarios.

https://doi.org/10.5061/dryad.612jm64db

Description of the data and file structure

Code and data for running the analyses in the article "Future palaeontologists will detect current mammal latitudinal biodiversity gradient". This repository includes the species presence/absence databases, the fossil ocurrences database, and the R scripts used for processing and analysing.

Files and variables

"Fossil_bias_LDG_script.R" : R code for running the main text consecutive bias-simulating workflow (one replicate, Time-unlimited approach). It includes the code for processing the data, running analyses, and generating the figures.
"Fossil_bias_LDG_script_replicates.R": Code for running the consecutive bias-simulating workflow for the remaining replicates (Time-unlimited approach). It includes the code for processing the data, running analyses, and generating the figures.
"Fossil_bias_LDG_script_independent.R": Code for running the independent bias-simulating workflow (Time-unlimited approach). It includes the code for processing the data, running analyses, and generating the figures.
"Fossil_bias_LDG_script_quaternary.R": Code for running the Period-limited bias-simulating workflow. It includes the code for processing the data, running analyses, and generating the figures.

The following files are under Zenodo supplemental information in order to align with the license requirements

"Fresh_mam_database.rds": Geographical database for freshwater mammals from IUCN. It is an R object indicating the presence (1) or absence (0) of each species in each of the pixel. Rows correspond to map pixels, while columns correspond to species (one column per species). The database also contains two extra columns indicating the landmass and climate zone to which each pixel belongs, but these variables were not used in the present study.
"Terr_mam_database.rds": Geographical database for terrestrial mammals from IUCN. It is an R object indicating the presence (1) or absence (0) of each species in each of the pixel. Rows correspond to map pixels, while columns correspond to species (one column per species).
"pbdb_mamfossildata.rds": Paleobiology Database downloaded mammal data from https://paleobiodb.org/#/. From the originally provided columns, we used "Class" and "family" columns for to perform the taxonomic modifications, and "lng" and "lat" to create the map of fossil occurrences.

Code/software

All analyses were conducted using R version 4.2.2.

Implemented packages: devtools, dplyr, ggplot2, lepidochroma, paletteer, quantreg, raster, rgdal, rgeos, rnaturalearth, sf, sp, spData, terra, tidyverse.

Access information

Data was derived from the following sources:

IUCN; https://www.iucnredlist.org/resources/spatial-data-download
WorldClim; https://worldclim.org/data/worldclim21.html
Unconsolidated sediments (UcS) file; https://doi.org/10.1594/PANGAEA.884822
Fossil record data; https://paleobiodb.org/#/
COMBINE; https://doi.org/10.6084/m9.figshare.13028255.v4

All analyses were conducted using R version 4.2.2 (R Core Team, 2022), and a list of implemented packages can be found in Table S9.

Mammal distribution data

We obtained current species range maps from the International Union for Conservation of Nature website (IUCN; https://www.iucnredlist.org/resources/spatial-data-download). Terrestrial mammal data were downloaded on 24^th January 2022 and freshwater (inland water) mammal data on 21^st September 2022. Range shapefiles were loaded in R using the ‘rgdal’ package version 1.5-28 (Bivand, Keitt, & Rowlingson, 2021). We excluded polygons associated with distributional data of Delphinidae, Iniidae, Lipotidae, Phocidae, Phocoenidae, Platanistidae, and Trichechidae families, due to their predominantly aquatic habits. We also excluded species range polygons with presence values 3 (“possibly extant”) and 6 (“presence uncertain”), as well as with origin values 3 (“introduced”) and 4 (“vagrant”), avoiding highly uncertain records and keeping the natural range of each species (Miraldo et al., 2016). As a result, 5,739 species were initially considered for the study (Table 1). We rasterised each range information at 0.5º using the ‘terra’ package version 1.5-21 (Hijmans, 2022).

Climate data

As the tropics have been widely renowned as the most species-rich region (Hillebrand, 2004), we decided to divide the land based on Köppen-Geiger climate classification system (Beck et al., 2018a) to analyse the effect of different sources of bias on mammal richness across the globe. To do so, we used monthly precipitation and temperature current data from WorldClim (Fick & Hijmans, 2017) and reclassified them into Köppen-Geiger five main climate zones (tropical, arid, temperate, cold, and polar) following Beck et al. (2018b). To do so, we used the “KoppenGeiger” MATLAB function (Beck et al., 2018a) adapted to R (Galván, Gamboa, & Varela, 2023).

Filters for geological bias sources (“geological filters”)

To simulate physical sources of bias in the deposition of potential fossils, we implemented an unconsolidated sediments (UcS) raster file (Börker, Hartmann, Amann, & Romero-Mujalli, 2018). Rather than a precise predictor of sedimentary rock formation, fossil deposition, or future rock exposure probabilities, this map serves as an initial geographic constraint for fossil formation. This map divides the sedimentary surface into 39 classes (Table S1). Since not all sedimentary environments are equally conductive to fossilisation, we excluded “Ice and Glaciers” and “Glacial” classes (including proglacial, till, and undifferentiated), because preservation potential is probably extremely low in these environments (A. Castillo, personal communication, April 2023). We reclassified the resulting map into a binary raster, where 1 represents current UcS and 0 represents other geological materials, resampling it to 0.5º resolution. To do so, we used the classify and resample functions of the ‘terra’ package version 1.5-21 (Hijmans, 2022). We then filtered the raw biodiversity data (referred as “Original” scenario) using this layer (referred as “Sediments” filter), creating an optimistic scenario by retaining all pixels with UcS classes. We acknowledge that some categories with lower fossilisation potential were included (“glacio-fluvial”, “glacio-lacustrine”, and “glacio-marine”), because these probably are the ones with the highest preservation potential inside glacial sediments. We also included “Evaporites” class because, although mammal preservation here is likely lower, some findings have been registered (Bennett et al., 2021). Similarly, we included “Water” class because lakes and other water bodies may act as sedimentary repositories over time (Steadman et al., 2007).

Depositional environments not only influence the fossilisation potential, but also affect how samples are recovered, ranging from well-preserved articulated fossils to dissociated or abraded ones (Holland, 2016). Additionally, some current sedimentary configuration, such as ephemeral sediments, may not persist over longer time frames (Holland, 2016; Nyberg & Howell, 2015). Despite these complexities, for the sake of simplicity, we assumed that species would be recoverable and identifiable in sedimentary basins. Equally, other factors such as the sediment age and grain size may further influence the fossil information being retrieved. However, due to limitations in the global availability of these data, we opted for exclusively using UcS presence information (Börker et al., 2018).

Filters for biological bias sources (“biological filters”)

We applied two filters to simulate a biological bias against species with small body sizes and narrow geographic ranges (referred as “Body-size” and “Range-size” filters, respectively). Range-size filter also accounts for variability in abundance, owing to the established occupancy-abundance relationship (Jernvall & Fortelius, 2004). Mammal body size information was obtained from the COMBINE database (Soria, Pacifici, Di Marco, Stephen, & Rondinini, 2021; downloaded on 15^th June 2022). To determine range sizes, we calculated the area of all the pixels occupied by each species, accounting for pixel size variation (cellSize function in ‘terra’ package 1.5-21; Hijmans, 2022). For analytical purposes, we only included the species found in both IUCN range maps and COMBINE database (5,701 species; 99.33% of original species). Two more species (Nyctophilus howensis and Pteropus rodricensis), whose distributions fell outside land limits of WorldClim data, were excluded (Table 1). We sampled species weighting them by their body size and range size, giving a higher sampling probability to those species with larger body size/range size. In this sense, body size and range size values act as probability weights for the sampling procedure (sample function of the ‘base’ package version 4.2.2; R Core Team, 2022).

We considered these biological filters just affecting species’ incorporation into the fossil record, although we are aware that they can be important factors along all the fossilisation and recovery processes. For instance, smaller individuals are generally more susceptible to destructive taphonomic processes, although they are also more readily buried (Eberth, Rogers, & Fiorillo, 2007). However, we decided to maintain the same assumption as previously, where these filters only affect the inputs to the fossil record. In other words, once a species is included in the record based on its range size or body size (preserving all pixels where that species is present), we assume it will be entirely recovered and recognisable if it passes through subsequent filters. While this approach may overlook some complexities of the fossilisation process, it allows us to isolate the effects of specific filters and assess their impact on the detection of the LBG.

Filters for sampling bias sources (“Sampling filters”)

To assess sampling-sourced bias, we first created a “Time-unlimited” geographical filter by selecting all the locations where fossil occurrences were found (referred as “Sampling-spatial” filter). We downloaded all “Animalia” records from the PBDB (https://paleobiodb.org/#/, 12^th January 2023), and extracted “Mammalia” fossil and coordinates information. We then calculated the number of occurrences per geographic point, rasterised this data at 0.5º, and reclassified it into a presence/absence map of fossil occurrences. To achieve this, we used the rasterize and classify functions of the ‘terra’ package version 1.5-21 (Hijmans, 2022). Lastly, we applied this filter to the already geologically- and biologically-filtered data, preserving exclusively the pixels containing fossil information. We assumed that the prior filters (geological and biological) already captured the conditions required for fossil preservation. In this sense, we selected the number of mammal occurrences in the PBDB as a control for collecting effort (Upchurch, Mannion, Benson, Butler, & Carrano, 2011), and did not select any particular geological stage to reflect all current distribution of sampling effort.

We incorporated two additional filters based on taxonomy in order to reflect the unequal recovery and identification of taxonomic groups at a fossil site (Benton, 2009). We focused on the family level and calculated the number of fossil occurrences per family in the PBDB (i.e., we assume that those families with higher PBDB occurrences are receiving a greater interest; Table S2). For the first taxonomy-based simulation (referred as “Taxonomic whole-families” filter), we selected the families with the highest number of observations, retaining all their species. For the second approach (referred as “Taxonomic-weighted” filter), we sampled a portion of the species weighting by the number of observations of the family they belong to, using the sample function of the ‘base’ package version 4.2.2 (R Core Team, 2022). In this sense, for the Taxonomic whole-families filter, we mimic a preference towards some families, being completely sampled, and other families with no chances of being sampled (creating advantageous and neglected groups; Whitaker & Kimmig, 2020). For the Taxonomic-weighted filter we permitted a more flexible scenario, where every species has a chance of being sampled depending on the number of observations of the family they belong to (i.e., more observations, more chances of being sampled). Unlike previous steps, both taxonomic filters were independently applied to the previous Sampling-spatial scenario, allowing us to differentiate between both approaches. Due to taxonomic inconsistencies, we modified some PBDB taxonomic groups to align them with those in COMBINE database (Methods S1). Sampling filters of the Time-unlimited approach are hereafter referred exclusively as Sampling-spatial, Taxonomic-weighted, and Taxonomic whole-families.

Additionally, to test the potential effect of erosion and burial processes not controlled by the Time-unlimited sampling approach, we carried out a complementary analysis by including period-specific sampling filters. To do so, we repeated the filtering process by just selecting Quaternary fossil and coordinates information (2.6 million years to the present). Reference to these filters is followed by the specification “Period-limited” (Sampling-spatial (Period-limited), Taxonomic-weighted (Period-limited), and Taxonomic whole-families (Period-limited)).

Filtering process and estimation of global statistics and LBG signal

In general, combined filters may reduce the amount of available biodiversity information. For example, geological and biological filters may act together, as certain environments, like fine-grained and poorly consolidated sediments, are more conducive to the preservation of small/delicate fossils, compared to older coarse-grained sequences subjected to alteration (Smith & Mcgowan, 2011). Equally, sampling and biological filters may intersect, as certain fossil sites, such as specific macro or micro fossil sites, receive more attention (Eberth, Shannon, & Noland, 2007). To address these complexities, we applied each filter consecutively, starting with the Original scenario (5,699 species) and following this order: original, geological, biological, and sampling filters. Exception were the Taxonomic-weighted and the Taxonomic whole-families filters, which were independently applied to the Sampling-spatial filter (Figure 1). However, we also conducted an independent test in which each filter was directly applied to the Original scenario to inspect its individual effect.

To account for differential intensity on preservation potential, we assumed 75%, 50%, and 25% species sampling rates based on previous studies (Darroch et al., 2021; Fraser, 2017; Fraser, Hassall, Gorelick, & Rybczynski, 2014). We aimed to test Fraser’s results (2017), where medium (50%) and high (75%) preservation rates reliably estimated North American mammal LBG, but started to fail for lower rates (25%). Each percentage (75%, 50% and 25%) was applied to the remaining data from previous filters, so the final number of species can be as low as 1,514, 510 and 68 species for Taxonomic-weighted filter (representing 26.6%, 8.9% and 1.1% of original richness; Table 1). As the filtering process may sample different species each time, we ran reproducible workflows with 10 random seeds (288, 1496, 1795, 960, 1893, 1610, 717, 774, 944, and 1942). The workflow for seed = 288 is shown in the main text; results for the remaining seeds can be found in Table S3 and S6.

All final raster data were projected into the Mollweide equal area projection using the project function of the ‘terra’ package version 1.5-21 (Hijmans, 2022). We measured mammal species richness across all filters, both globally and within Köppen-Geiger climate zones. We also calculated species richness per pixel, generating richness maps, and the location (latitude) of richness maxima throughout the filters. Given that we implemented both pixel-focused (Sediments and Sampling-spatial) and species-focused (Range-size, Body-size, and both Taxonomic) filters, we measured area loss (as the number of pixels) after the Sediments and Sampling-spatial filters, and species richness loss after all filters.

Measuring the slope of the linear regression between latitude and richness has been widely used as a proxy for LBG strength (Fraser, 2017; Hillebrand, 2004; Kinlock et al., 2018; Marcot, Fox, & Niebuhr, 2016). However, based on the observation of an upper boundary pattern that limits the maximum richness per pixel at each latitude (Figure S1); we performed 0.9 quantile regressions to predict the “maximum” relationship (90% percentile) between latitude and richness. To do so, we transformed richness maps into spatial vector objects (‘sf’ package version 1.0-8; Pebesma, 2018), extracted richness values (total number of species) per raster pixel, and performed 0.9 quantile regressions between absolute latitude and richness (rq function of the ‘quantreg’ package version 5.95; Koenker, 2022). We applied 100 bootstrap replicates, randomly sampling 50% of the points and calculating the median and 95% confidence intervals of the slope estimates for each filter.