Preservation biases in the fossil record distort species ecological niche and distribution models

Bellve, Andre 1 ; Syverson, Val 2 ; Blois, Jessica 2 ; Jarzyna, Marta 3

Research facility: The Ohio State University

Published Dec 11, 2025 on Dryad. https://doi.org/10.5061/dryad.mpg4f4rbm

Data files

Dec 11, 2025 version files 5.51 GB

Abstract

Ecological niche models (ENMs) increasingly leverage the fossil record to understand species’ environmental associations and predict their geographic distributions. However, fossils do not occur uniformly through time and space, which can compromise the robustness of ENMs and thus affect ecological conclusions. Here, we assessed how preservation biases in the fossil record impact our ability to reconstruct ecological niches and distributions of North American small mammals during the late Quaternary. First, using small mammal fossil occurrences and associated depositional environment data, we quantified preservation potential (the likelihood that a given environment supports fossil preservation) and the preservation niche (environmental correlates of preservation) for three late Quaternary time periods (the Last Glacial Maximum, the deglacial period, and the Holocene). Second, we imposed the calculated preservation potential on simulated distributions of six virtual species to evaluate its impact on reconstructing species niches and geographic distributions through time. We found that preservation potential was highest in the Holocene and lowest during the deglacial, with the differences driven by variations in climate and the prevalence of Holocene archaeological sites. In all intervals, warm, wet, and highly seasonal environments exhibited low preservation potential. These spatial and temporal differences in preservation potential significantly influenced niche reconstructions and geographic predictions, particularly impeding model quality when species niches extended beyond the preservation niche. We warn that such distortions can lead to erroneous ecological inferences, including inaccurate predictions of species responses to environmental changes and mischaracterizations of community assembly processes. We propose that our approach to modelling preservation potential can be applied across different regions, time periods, and taxonomic groups to help correct distortions caused by sampling biases through weighted background point selection that reflects these processes. Ultimately, this framework enhances the ability to disentangle true ecological patterns from preservation artifacts, improving the reliability of fossil-based ecological and evolutionary inferences.

Dataset DOI: 10.5061/dryad.mpg4f4rbm

Description of the data and file structure

Files and variables

File: preservation_biases.zip

Description:

This zip file contains an RProject for our associated manuscript. The code contained in the scripts folder are sequential and prepare the necessary environmental data for fitting MaxEnt and PCA models to occurrence data of mammals, virtual species and fossil occurrences. Subsequently, we use these outputs and combine them in a 'virtual ecology' framework to explore how preservation potential can distort ecological niche models.

The primary outputs of our project are stored in the "./output/" folder, and are primarily split between our preservation potential layers (see the "predictor_data" and "taphonomy_maps" folders), the creation of real species ENMS (see the "occurrence_data" and "species_enm" folders), and the virtual species we created for our assessment of the impact of preservation biases (see the "virtual_species" folder). Within the "./output/virtual_species/" folder, the "true_niche", "abundance", "scaled_abundances", and "biased_maps" folders contain heat maps for each virtual species at each of our time periods - these formed the primary data for our study of the impacts of preservation potential on ecological niche modelling. The "niche_reconstruction" and "prediction_accuracy" folders contain the results of these analyses.

Additionally, we have included our three preservation potential layers outside of the zip file for ease of access: '01_LGM_taphonomy.tif', '02_DG_taphonomy.tif', '03_HOL_taphonomy.tif', referring to the Last Glacial Maximum (16,000 y.b.p.), the deglacial (13,500 y.b.p.), and the Holocene (6,000 y.b.p.), respectively. These are stored in the World Geodetic System 1984, and can be read into programs such as R with the appropriate geospatial data packages (e.g., terra).

DATA

The data folder has two sub-folders. The first contains the fossil occurrence data we used for our preservation bias layers (./data/fossil_data/fossil_occ_2024_02_27.csv). The fossil occurrence data is 2544 × 12: ‘siteid’ is the unique identifier for each fossil collection cite; ‘lat’/’long’ are the WGS84 coordinates for the occurrences; ‘collection’ indicates the unique ‘collection unit’ for each site; ‘depositionalenvironment’ is the Neotoma2 DB identifier for the depositional environment, while ‘depenvtTop’ is our aggregated classification of the depositional environments, as described in our manuscript; the ‘ageboundyounger’ and ‘ageboundolder’ correspond to the minimum and maximum estimated ages, in years before present (y.b.p.), for the collection units; ‘LGM’ (the Last Glacial Maximum), ‘DG’ (the deglacial period) and ‘HOL’ (the Holocene) are logical columns indicating whether a particular occurrence was included in one of our time period models; ‘Nintervals’ is numeric showing the number of time periods one specific occurrence was included in.

The second folder contains the mammalian density data. The species_densities.csv is 26 ×14; ‘order’, ‘genus’, and ‘species’ show the taxonomy for each entry, while ‘spp_code’ ascribes a six-letter species code using the first three letters of each species Latin binomial; ‘habitat’ provides a descriptor of the kinds of habitats the species commonly occupies; ‘min_density’, ‘mean_density’, and ‘max_density’ are the minimum, mean and maximum densities reported for the species by the attributed study; ‘mean_variation’ indicates the variation around the mean reported by the study, while ‘variation_type’ describes the type of variation reported; ‘sample_size’ reports the number of surveys undertaken by the study to estimate the densities; ‘notes’ includes notes made by us about the study in question that could affect the density estimates; ‘source’ names the title of the manuscript from which the densities were sourced and ‘hyperlink’ list the URL for the article. Note, any blank cells indicate that no data was available for these entries. We chose to leave these cells empty, rather than specifying "n/a" or a similar value, as R will automatically identify these values as NAs if left blank.

The ./output/virtual_species/enms/ directory contains subfolders organising the RDS files for all the dismo::maxent models we fitted for our virtual species investigation. They are categorised by the observation period > sampling type (biased or unmodified) > species codes > and sample size. Individual RDS files are labelled with their replicate number. As the dismo::maxent function outputs hundreds of files for each individual model, we will not exhaustively document them here. Instead, please see the dismo R packages vignette for a detailed description of the files and their contents.

The ./output/predictions/ directory contains the individual cross-validated model predictions we produced for each of our treatment groups. Each of these is a .tif file showing the probability of occurrence for the respective group with values ranging from 0 - 1. These rasters were subsequently summarised for our project.

In addition, our analysis requires environmental data sourced from GBIF, CHELSA-TRaCE21K and NaturalEarth, which are protected by copyright and so are not included in our repository. All data are freely available for their respective sources.

SCRIPTS/OUTPUT

All scripts were designed to be run sequentially, often depending on the output of prior scripts to function. Functions we created for our project are stored in the './scripts/functions/' directory, and are described in the following sections where they were used.

01_env_data_prep

This script processes the environmental raster files from CHELSA-TRaCE21K by cropping them to the extent of covering North America and masking out both contemporary glaciers and waterbodies. All environmental layers were sourced from CHELSA-TraCE21K (Karger et al. 2023), except slope, which was derived from the CHELSA-TraCE21K digital elevation model based on the surrounding eight grid cells. We used the NaturalEarth North American historic lakes shapefile (v5.0.0; naturalearthdata.com) to mask out all major water bodies for the present and the Holocene. In addition, it creates rasters that break up North America by geopolitical boundaries and creates our distance to road raster to account for sampling biases. These files are copyrighted by the authors but can be freely accessed from their website (see 'Access information' below)

02_mammal_occ_gbif

This script downloads and prepares the GBIF modern mammal occurrence download. The output of this script is stored in ‘./output/occurrence_data/gbif_mammal_occ_filtered.rds’, which contains four columns: ‘species’ – the Latin binomial species name; ‘decimal_latitude’ and ‘decimal_longitude’ – the WGS84 coordinates of the species occurrence; the ‘basis_of_record’ which indicates what kind of source the observation was ascribed to following GBIF conventions.

03_enm_fitting

This script fits ENM to the GBIF occurrences downloaded in 02_mammal_occ_gbif.R. This script requires the ‘./scripts/functions/response_curve’ function written by us to take the output from the MaxEnt models created and produce marginal response curves for each environmental variable for each species. In addition, it makes the ‘./output/species_enms/model_evaluation.csv’ file, which evaluates the performance of the ENM models for each species, listing the test area under the reciever-operator curve ('test_auc') measuring the models performance in predicting new data, along with the permutational importance of each environmental variable for that model ("permutational_importance"). In addition, it produces the heat maps for these real species which were subsequently used to create our virtual species analouges (see ‘./output/species_enms/heatmaps/’) - these maps are stored as .tif files with the maps in the WGS 84 (EPSG:4326) projection.

04_virtual_spp_creation

This script defines the functions for the virtual species analogs of the species ENMs fitted in 03_enm_fitting. This script directly stores no output, but rather the entire script is sourced directly by the preceding script (05_virtual_paleo_distributions.R).

05_virtual_paleo_distributions

This script takes the response definitions created in the above script and creates heat maps for each of our virtual species at three time periods during the late Quaternary and then converts these maps into abundance and relative abundance layers. It requires a modified version of the ‘virtualspecies::genSpFromFun’ function created by us (‘./scripts/functions/gen_sp_fun.R’) and the ‘…/populate_rast’ function to create abundance rasters. The abundance layers are stored in ‘./output/virtual_species/abundances/’ and scaled abundances are stored in ‘./output/virtual_species/scaled_abundances/’.

Subsequently, this script samples from these true relative abundance layers and fits maximum entropy models. All standard ‘dismo’ MaxEnt files are stored in ‘./output/virtual_species/true_niche/maxent_files/’. Note, the dismo::maxent() function produces a large number of files, and are not individually used in our analyses, and so are not detailed here. However, descriptions of these files can be found in the dismo package vignette. We have supplied these files to aid in reproducibility. All files are organised by their time period, and then their species name. These models were subsequently used to create heat maps of where these species would have lived in the past, which are stored in ‘./output/virtual_species/true_niche/enm_distributions/’ as .tif files in WGS 84.

06_taphonomy_models

This script takes the aforementioned fossil occurrence data, cleans it, and then spatially thins occurrences before fitting ENMs to these data. It relies on the output of script ‘01_env_data_prep’, which prepares the environmental and spatial bias data. The cleaned presence and newly produced absence data are stored in ‘./output/predictor_data/pres_abs_env_data.rds’. This object is a large nested list, with the primary list elements organised by time period. Within each of these elements are the environmental variable values, the ID of the rasters cell ("cell"), the cells longitude/ latitude ("x" and "y", respectively), the type of observation ("type"), the depositional environment of the presence observations (see Table S4 in the supplementary information for details), and the time period they relate too ("epoch"). These are the data that were used to create our preservation potential layers, which are stored in './output/taphonomy_maps/' as individual .tifs named by their time period codes, following the naming conventions described above.

07_occurrence_sampling

This script creates biased versions of the scaled abundance layers of each virtual species × time period . Then, it samples both the unmodified and biased layers repeatedly to get replicate occurrence samples for each data layer × species × time period × sample size. It requires two functions created by us: (1) ‘sample_occ’ (./scripts/functions/sample_occ.R), which samples occurrences and creates associated background points, and (2) ‘nested_extract’ (…/nested_extract.R), which extracts the environmental covariates from the associated layers for each sampled occurrence. The output of this script is a nested list (‘./output/virtual_species/occurrence_samples/env_occ_all.rds’), which contains all the sampled presences/background points for each replicate - it follows the same structure as the "pres_abs_env_data.rds" described in "06_taphonomy_models", although with an additional split to distinguish between the "unmodified" and "biased" occurrence samples.

08_sample_enms_parallel

This script uses the sampled data described above (07_occurrence_sampling) and fits ENMs to these data in parallel, and was designed to be run on the Ohio State Supercomputer Pitzer cluster. All dismo MaxEnt models are stored in ‘./output/virtual_species/enms/’ in nested folders denoting their time period > sampling type > virtual species > sample size > and replicates saved as individual RDS files (e.g., rep_*.rds). Details of the MaxEnt model object structure and contents are outlined in the dismo::maxent() help file and the R packages associated vignette.

09_niche_reconstruction

This script takes the true and sampled data described in 05_virtual_paleo_distributions and 07_occurrence_sampling, respectively, and assesses niche reconstruction. First, PCAs are fitted to the true data, then sampled data are superimposed, and their centroids are calculated. All summarised centroid data are stored as RDS objects in ‘./output/virtual_species/niche_reconstruction/centroids/’ as list objects, use the following naming convention: 'time period''sampling type''virtual species code'_'sample size'_centroid_ls.rds with each list element containing individual replicates.

10_prediction_accuracy

This script uses the products of 05_virtual_paleo_distributions, where the true MaxEnt ENMs are fitted, and 08_sample_enms_parallel, where each replicated MaxEnt ENM trained on sample data is created. It iterates through the true MaxEnt ENMs and samples 100 cells for each value between 0 – 1 at 0.01 increments in the layer. Our biased/unmodified models were then used to predict the habitat suitability for these sampled cells and the error measured.

Code/software

Much of our output files/models were stored as .RDS objects which can be read in R, the statistical computing program.

All data manipulation and analyses were performed in R (v4.4.0; R Core Team 2014) using an RStudio interface (v2024.04.2 Build 764 “Chocolate Cosmos”; Rstudio Team 2020). Data manipulations were carried out with dplyr (version 1.1.2; Wickham et al. 2023b), tidyr (version 1.3.0; Wickham et al. 2023a) , and stringr (version 1.5.0; Wickham 2023). Geospatial data manipulations were performed with terra (version 1.7-78; Hijmans et al. 2022), tidyterra (version 0.6.0; Hernangómez 2023), and sf (version 1.0-16; Pebesma 2018). MaxEnt algorithm models were fitted using dismo (version 1.3-14; Hijmans et al. 2017), while niche comparisons were performed with stats (version 4.3.0; R Core Team 2014). Virtual species were created using virtualspecies (version 1.6; Leroy et al. 2016) with a modified version of virtualspecies::generateSpFromFun utilize terra (see code repository for details). Parallelization was implemented using foreach (version 1.5.2; Weston and Calaway 2015) and doSNOW (version 1.0.20; Calaway and Weston 2017). All base visualization were created with ggplot2 (version 3.5.1; Wickham 2011), ggdensity (version 1.0.0; Otto and Kahle 2022), tidyterra and viridis (version 0.6.5). All analyses were conducted at the Ohio Supercomputer Center (1987) on the ‘Owens’ cluster.

Access information

Other publicly accessible locations of the data:

Environmental Rasters: https://chelsa-climate.org/chelsa-trace21k/
Paleo and historic waterbodies: https://www.naturalearthdata.com/downloads/10m-physical-vectors/10m-lakes/
Mammalian Occurrences: https://www.gbif.org/

Data was derived from the following sources:

Fossil Occurrences: https://www.neotomadb.org/