Data from: Silent past: Biogeographic gaps in the Cenozoic fossil archive
Data files
Oct 17, 2025 version files 724.69 MB
-
Fossil-Potential-Mapping.zip
724.67 MB
-
README.md
15.34 KB
Abstract
This dataset accompanies an analysis of fossil information loss across the Cenozoic and integrates palaeoclimatic reconstructions, lithological sedimentary data, and fossil occurrences. The dataset includes processed climatic variables (temperature and precipitation) derived from the HadCM3 model at 14 geological intervals from 66 to 0 Ma, categorized into Köppen-Geiger climate zones. Lithological data were extracted from a generalized global geological map (Chorlton, 2007) to isolate sedimentary formations with fossil preservation potential. Fossil occurrence records were obtained from the Paleobiology Database (PBDB) as of February 2025. The R script provided merges and analyses these data layers to assess the spatial-temporal overlap between climate zones, sedimentary coverage, and fossil distribution. Outputs include estimates of information loss in the fossil record due to the absence of suitable depositional environments within specific climate zones. This dataset facilitates reproducibility of the analyses and supports reuse for studies involving palaeoclimatic modeling, fossil preservation biases, or Cenozoic biodiversity patterns. No ethical or legal restrictions apply to the use of these data.
Dataset DOI: 10.5061/dryad.34tmpg4wk
Description of the data and file structure
This dataset was generated as part of the analyses presented in “Silent Past: Biogeographic Gaps in the Cenozoic Fossil Archive” (Palaeogeography, Palaeoclimatology, Palaeoecology, 2025). The data compilation integrates paleoclimatic model outputs, sedimentary basin reconstructions, and fossil occurrence data to explore the spatial and environmental representativeness of the Cenozoic fossil record.
Specifically, the dataset combines:
- Paleoclimate simulations (temperature and precipitation) from the PALEOMAP/Scotese reconstructions across 14 time slices (0–66 Ma).
- Reconstructed sedimentary polygons, representing areas with potential fossil preservation for each time interval.
- Fossil occurrence data compiled from the Paleobiology Database (PBDB), spatially rotated to their paleocoordinates using GPlates.
The data processing workflow includes raster-based climate averaging, spatial extraction of climatic values from sedimentary and fossil datasets, and biome classification using Köppen–Geiger criteria. The resulting datasets provide harmonized climatic, sedimentary, and biotic information through the Cenozoic, enabling the quantification of geographic and environmental sampling biases in the fossil archive.
Files and variables
File: Fossil-Potential-Mapping.zip
Description: The dataset is organized in a single folder named Fossil-Potential-Mapping, which contains all scripts, data, and derived files used to reproduce the analyses of “Silent Past: Biogeographic Gaps in the Cenozoic Fossil Archive”.
1. Top-level contents
| File / Folder | Description |
|---|---|
| script.R | Main R script performing the full data processing workflow: loading input data, rotating palaeocoordinates, extracting climatic values, and generating summary statistics and plots. |
| support_functions.R | Secondary R script containing custom helper functions (e.g., raster extraction, palaeocoordinate rotation, and data cleaning utilities) called by the main script. |
| data/ | Directory containing all input datasets used in the analyses, divided into thematic subfolders. |
2. data/ subdirectories
a) fossils/
Contains fossil occurrence data from the Paleobiology Database (PBDB).
| File | Description |
|---|---|
| pbdb_data_27-02-25.csv | Raw PBDB data download (as of February 27, 2025) including all Cenozoic occurrences used in the study. |
| pbdb_data_27-02-25processed.csv | Cleaned and subsetted PBDB dataset containing only relevant columns for analysis (taxonomic group, coordinates, geological age, collection number, etc.). |
| fossil_data_rotated.csv | Fossil dataset with rotated palaeocoordinates for each occurrence, as reconstructed using GPlates rotations. |
Main variables:
| Variable | Description | Units / Notes |
|---|---|---|
collection_no |
PBDB collection identifier | — |
accepted_name |
Accepted taxonomic name | — |
max_ma, min_ma |
Maximum and minimum estimated ages | Ma |
paleolat, paleolng |
Palaeocoordinates (rotated) | degrees |
modern_lat, modern_lng |
Modern coordinates | degrees |
environment |
Depositional environment (PBDB field) | categorical |
phylum, class, order, family |
Taxonomic classification | — |
Missing values are coded as blank cells or NA when unavailable.
b) possible_fossil_reconstructed_dissolved/
Contains shapefiles of reconstructed sedimentary polygons for each Cenozoic time bin.
| File pattern | Description |
|---|---|
possible_fossil_recons_[Ma]_dissolved.shp (and associated .dbf, .shx, .prj files) |
Polygon shapefiles representing areas with potential fossil preservation for each time bin (0–69 Ma; analyses restricted to ≤66 Ma). |
Attributes within shapefiles:
| Variable | Description |
|---|---|
ID |
Polygon identifier |
Age_Ma |
Midpoint age of reconstruction |
Area_km2 |
Polygon area in km² |
Lithology |
Generalized lithological type (if present) |
Preservation_potential |
Qualitative estimate of fossil preservation potential |
c) scotese/
Contains palaeoclimate model outputs and metadata from Scotese & Getech Cenozoic reconstructions.
| File | Description |
|---|---|
| bathymetry_Scotese_getech_Ceno.tar.nc | NetCDF file with Cenozoic palaeobathymetry reconstructions. |
| Ceno_temp_precip_Ceno.tar.nc | NetCDF file with gridded temperature and precipitation data. |
| Scotese_temp_precip_Ceno.tar.nc | Equivalent climatic archive (used for raster extraction in the script). |
| Expts_Scotese_Getech.xlsx | Metadata on simulation experiments, including time slice IDs, boundary conditions, and CO₂ levels. |
| format_script_atmos_vars_Ceno_Getech_Scotese_runs.txt | Log file describing the formatting of atmospheric variables. |
| fossil_climate_data.csv | Table linking fossil occurrences to their extracted climatic values (mean annual temperature, annual precipitation, etc.). |
| formatted_data/ | Folder containing climate raster outputs (temperature, precipitation, masks, orography) for each time bin (tfkea–tfkeo). Generated after executing untar("scotese_temp_prec_ceno.tar.nc") in script.R. |
The .tar.nc files (bathymetry_Scotese_getech_Ceno.tar.nc, Ceno_temp_precip_Ceno.tar.nc, and Scotese_temp_precip_Ceno.tar.nc) must be untarred prior to use. Each archive, when extracted, produces a folder structure containing individual NetCDF files for each Cenozoic time slice. These unpacked files are required inputs for the climate data extraction steps in script.R.
Main variables (in fossil_climate_data.csv):
| Variable | Description | Units |
|---|---|---|
id |
Internal fossil record identifier | — |
LIDNUM |
Original Paleobiology Database collection number | — |
LAT, LONG |
Modern latitude and longitude of fossil occurrence | degrees |
MAX_AGE, MIN_AGE |
Maximum and minimum estimated geological ages of the fossil occurrence | Ma |
mid_point |
Midpoint between MAX_AGE and MIN_AGE, used as representative age |
Ma |
LONG_rotated, LAT_rotated |
Paleocoordinates (rotated to paleogeographic position using GPlates) | degrees |
temp |
Extracted mean annual temperature at fossil location | °C |
prec |
Extracted mean annual precipitation at fossil location | mm/year |
Period |
Geological period or time bin corresponding to the reconstruction | categorical |
Missing climatic or biome data are denoted as NA.
3. Missing values
Across all files, missing data are represented consistently as:
NA(standard R convention)- or blank cells (in CSVs exported from R when data were unavailable).
Code/software
Code / Software
All analyses were performed in the R programming language (version 4.3.3), using freely available packages and open-source data formats.
1. Required software
- R (≥4.3) — main environment for running the scripts.
- RStudio (optional, for interactive use).
- GPlates (version 2.3 or later) — used externally to rotate fossil coordinates to paleocoordinates.
All scripts and data are compatible with standard open formats (.csv, .shp, .nc, .xlsx).
2. R packages
The following R packages were used to execute the analysis pipeline:
| Package | Purpose |
|---|---|
| tidyverse | Data manipulation and plotting (readr, dplyr, ggplot2, tidyr, etc.) |
| raster / terra | Reading and processing gridded climate and elevation data |
| sf | Handling and processing shapefiles and spatial geometries |
| sp | Legacy spatial support (for backward compatibility) |
| ncdf4 | Reading NetCDF paleoclimate files |
| dplyr | Data wrangling and summarization |
| ggplot2 | Visualization of climate–fossil relationships |
| RColorBrewer / viridis | Color palettes for plots |
| data.table | Efficient large data handling |
| tidyr | Data reshaping and cleaning |
| stringr | String manipulation for file naming |
| lubridate | Handling of time and date metadata |
| rnaturalearth | Adding base maps for modern reference outlines |
All required packages can be installed using:
install.packages(c("tidyverse", "sf", "raster", "terra", "ncdf4", "RColorBrewer",
"viridis", "data.table", "tidyr", "stringr", "lubridate", "rnaturalearth"))
3. Included code files
| File | Description |
|---|---|
| script.R | Main workflow script that loads data, extracts paleoclimate values for fossil and sedimentary polygons, classifies biomes, and generates summary statistics and plots. |
| support_functions.R | Supplementary script containing all custom functions called within script.R (e.g., coordinate rotation helpers, raster extraction, climate–biome classification). |
4. Workflow overview
- Input loading – The script reads fossil occurrence data (from
data/fossils/) and climate/sediment data (fromdata/scotese/anddata/possible_fossil_reconstructed_dissolved/). - Coordinate rotation – Paleocoordinates already rotated externally in GPlates are imported from
fossil_data_rotated.csv. - Climate extraction – Using
raster/terra, temperature and precipitation values from Scotese paleoclimate rasters are extracted at each fossil and sediment location. - Biome classification – The script applies Köppen–Geiger rules to classify paleobiomes based on MAT and MAP.
- Output generation – Results are exported as CSV files (e.g.,
fossil_climate_data.csv) and figures showing climate–fossil relationships and sampling bias analyses.
5. Reproducibility
All analyses can be reproduced by:
- Opening
script.Rin R or RStudio. - Setting the working directory to the folder
Fossil-Potential-Mapping. - Running the script from start to finish.
