Data from: Silent past: Biogeographic gaps in the Cenozoic fossil archive

Matamala-Pagès, Marta 1 ; Hagen, Oskar 2 ; Castro-Insua, Adrián1; Oliver, Adriana1; Méndez-Quintas, Eduardo1; Sotelo, Graciela1; Rey-Rodríguez, Iván1; Gamboa, Sara 1 ; Galván, Sofía1; Varela, Sara1

Published Oct 17, 2025 on Dryad. https://doi.org/10.5061/dryad.34tmpg4wk

Data files

Oct 17, 2025 version files 724.69 MB

Fossil-Potential-Mapping.zip

724.67 MB
README.md

15.34 KB

Abstract

This dataset accompanies an analysis of fossil information loss across the Cenozoic and integrates palaeoclimatic reconstructions, lithological sedimentary data, and fossil occurrences. The dataset includes processed climatic variables (temperature and precipitation) derived from the HadCM3 model at 14 geological intervals from 66 to 0 Ma, categorized into Köppen-Geiger climate zones. Lithological data were extracted from a generalized global geological map (Chorlton, 2007) to isolate sedimentary formations with fossil preservation potential. Fossil occurrence records were obtained from the Paleobiology Database (PBDB) as of February 2025. The R script provided merges and analyses these data layers to assess the spatial-temporal overlap between climate zones, sedimentary coverage, and fossil distribution. Outputs include estimates of information loss in the fossil record due to the absence of suitable depositional environments within specific climate zones. This dataset facilitates reproducibility of the analyses and supports reuse for studies involving palaeoclimatic modeling, fossil preservation biases, or Cenozoic biodiversity patterns. No ethical or legal restrictions apply to the use of these data.

Dataset DOI: 10.5061/dryad.34tmpg4wk

Description of the data and file structure

This dataset was generated as part of the analyses presented in “Silent Past: Biogeographic Gaps in the Cenozoic Fossil Archive” (Palaeogeography, Palaeoclimatology, Palaeoecology, 2025). The data compilation integrates paleoclimatic model outputs, sedimentary basin reconstructions, and fossil occurrence data to explore the spatial and environmental representativeness of the Cenozoic fossil record.

Specifically, the dataset combines:

Paleoclimate simulations (temperature and precipitation) from the PALEOMAP/Scotese reconstructions across 14 time slices (0–66 Ma).
Reconstructed sedimentary polygons, representing areas with potential fossil preservation for each time interval.
Fossil occurrence data compiled from the Paleobiology Database (PBDB), spatially rotated to their paleocoordinates using GPlates.

The data processing workflow includes raster-based climate averaging, spatial extraction of climatic values from sedimentary and fossil datasets, and biome classification using Köppen–Geiger criteria. The resulting datasets provide harmonized climatic, sedimentary, and biotic information through the Cenozoic, enabling the quantification of geographic and environmental sampling biases in the fossil archive.

Files and variables

File: Fossil-Potential-Mapping.zip

Description: The dataset is organized in a single folder named Fossil-Potential-Mapping, which contains all scripts, data, and derived files used to reproduce the analyses of “Silent Past: Biogeographic Gaps in the Cenozoic Fossil Archive”.

1. Top-level contents

File / Folder	Description
script.R	Main R script performing the full data processing workflow: loading input data, rotating palaeocoordinates, extracting climatic values, and generating summary statistics and plots.
support_functions.R	Secondary R script containing custom helper functions (e.g., raster extraction, palaeocoordinate rotation, and data cleaning utilities) called by the main script.
data/	Directory containing all input datasets used in the analyses, divided into thematic subfolders.

2. `data/` subdirectories

a) `fossils/`

Contains fossil occurrence data from the Paleobiology Database (PBDB).

File	Description
pbdb_data_27-02-25.csv	Raw PBDB data download (as of February 27, 2025) including all Cenozoic occurrences used in the study.
pbdb_data_27-02-25processed.csv	Cleaned and subsetted PBDB dataset containing only relevant columns for analysis (taxonomic group, coordinates, geological age, collection number, etc.).
fossil_data_rotated.csv	Fossil dataset with rotated palaeocoordinates for each occurrence, as reconstructed using GPlates rotations.

Main variables:

Variable	Description	Units / Notes
`collection_no`	PBDB collection identifier	—
`accepted_name`	Accepted taxonomic name	—
`max_ma`, `min_ma`	Maximum and minimum estimated ages	Ma
`paleolat`, `paleolng`	Palaeocoordinates (rotated)	degrees
`modern_lat`, `modern_lng`	Modern coordinates	degrees
`environment`	Depositional environment (PBDB field)	categorical
`phylum`, `class`, `order`, `family`	Taxonomic classification	—

Missing values are coded as blank cells or NA when unavailable.

b) `possible_fossil_reconstructed_dissolved/`

Contains shapefiles of reconstructed sedimentary polygons for each Cenozoic time bin.

File pattern	Description
`possible_fossil_recons_[Ma]_dissolved.shp` (and associated .dbf, .shx, .prj files)	Polygon shapefiles representing areas with potential fossil preservation for each time bin (0–69 Ma; analyses restricted to ≤66 Ma).

Attributes within shapefiles:

Variable	Description
`ID`	Polygon identifier
`Age_Ma`	Midpoint age of reconstruction
`Area_km2`	Polygon area in km²
`Lithology`	Generalized lithological type (if present)
`Preservation_potential`	Qualitative estimate of fossil preservation potential

c) `scotese/`

Contains palaeoclimate model outputs and metadata from Scotese & Getech Cenozoic reconstructions.

File	Description
bathymetry_Scotese_getech_Ceno.tar.nc	NetCDF file with Cenozoic palaeobathymetry reconstructions.
Ceno_temp_precip_Ceno.tar.nc	NetCDF file with gridded temperature and precipitation data.
Scotese_temp_precip_Ceno.tar.nc	Equivalent climatic archive (used for raster extraction in the script).
Expts_Scotese_Getech.xlsx	Metadata on simulation experiments, including time slice IDs, boundary conditions, and CO₂ levels.
format_script_atmos_vars_Ceno_Getech_Scotese_runs.txt	Log file describing the formatting of atmospheric variables.
fossil_climate_data.csv	Table linking fossil occurrences to their extracted climatic values (mean annual temperature, annual precipitation, etc.).
formatted_data/	Folder containing climate raster outputs (temperature, precipitation, masks, orography) for each time bin (tfkea–tfkeo). Generated after executing `untar("scotese_temp_prec_ceno.tar.nc")` in `script.R`.

The .tar.nc files (bathymetry_Scotese_getech_Ceno.tar.nc, Ceno_temp_precip_Ceno.tar.nc, and Scotese_temp_precip_Ceno.tar.nc) must be untarred prior to use. Each archive, when extracted, produces a folder structure containing individual NetCDF files for each Cenozoic time slice. These unpacked files are required inputs for the climate data extraction steps in script.R.

Main variables (in fossil_climate_data.csv):

Variable	Description	Units
`id`	Internal fossil record identifier	—
`LIDNUM`	Original Paleobiology Database collection number	—
`LAT`, `LONG`	Modern latitude and longitude of fossil occurrence	degrees
`MAX_AGE`, `MIN_AGE`	Maximum and minimum estimated geological ages of the fossil occurrence	Ma
`mid_point`	Midpoint between `MAX_AGE` and `MIN_AGE`, used as representative age	Ma
`LONG_rotated`, `LAT_rotated`	Paleocoordinates (rotated to paleogeographic position using GPlates)	degrees
`temp`	Extracted mean annual temperature at fossil location	°C
`prec`	Extracted mean annual precipitation at fossil location	mm/year
`Period`	Geological period or time bin corresponding to the reconstruction	categorical

Missing climatic or biome data are denoted as NA.

3. Missing values

Across all files, missing data are represented consistently as:

NA (standard R convention)
or blank cells (in CSVs exported from R when data were unavailable).

Code/software

Code / Software

All analyses were performed in the R programming language (version 4.3.3), using freely available packages and open-source data formats.

1. Required software

R (≥4.3) — main environment for running the scripts.
RStudio (optional, for interactive use).
GPlates (version 2.3 or later) — used externally to rotate fossil coordinates to paleocoordinates.

All scripts and data are compatible with standard open formats (.csv, .shp, .nc, .xlsx).

2. R packages

The following R packages were used to execute the analysis pipeline:

Package	Purpose
tidyverse	Data manipulation and plotting (readr, dplyr, ggplot2, tidyr, etc.)
raster / terra	Reading and processing gridded climate and elevation data
sf	Handling and processing shapefiles and spatial geometries
sp	Legacy spatial support (for backward compatibility)
ncdf4	Reading NetCDF paleoclimate files
dplyr	Data wrangling and summarization
ggplot2	Visualization of climate–fossil relationships
RColorBrewer / viridis	Color palettes for plots
data.table	Efficient large data handling
tidyr	Data reshaping and cleaning
stringr	String manipulation for file naming
lubridate	Handling of time and date metadata
rnaturalearth	Adding base maps for modern reference outlines

All required packages can be installed using:

install.packages(c("tidyverse", "sf", "raster", "terra", "ncdf4", "RColorBrewer",
                   "viridis", "data.table", "tidyr", "stringr", "lubridate", "rnaturalearth"))

3. Included code files

File	Description
script.R	Main workflow script that loads data, extracts paleoclimate values for fossil and sedimentary polygons, classifies biomes, and generates summary statistics and plots.
support_functions.R	Supplementary script containing all custom functions called within `script.R` (e.g., coordinate rotation helpers, raster extraction, climate–biome classification).

4. Workflow overview

Input loading – The script reads fossil occurrence data (from data/fossils/) and climate/sediment data (from data/scotese/ and data/possible_fossil_reconstructed_dissolved/).
Coordinate rotation – Paleocoordinates already rotated externally in GPlates are imported from fossil_data_rotated.csv.
Climate extraction – Using raster/terra, temperature and precipitation values from Scotese paleoclimate rasters are extracted at each fossil and sediment location.
Biome classification – The script applies Köppen–Geiger rules to classify paleobiomes based on MAT and MAP.
Output generation – Results are exported as CSV files (e.g., fossil_climate_data.csv) and figures showing climate–fossil relationships and sampling bias analyses.

5. Reproducibility

All analyses can be reproduced by:

Opening script.R in R or RStudio.
Setting the working directory to the folder Fossil-Potential-Mapping.
Running the script from start to finish.