Data and R code from: Global Phanerozoic biodiversity, can variation be explained by spatial sampling intensity

Phillipi, Daniel 1

Research facility: Syracuse University

Published Jul 27, 2024 on Dryad. https://doi.org/10.5061/dryad.2280gb621

Data files

Jul 27, 2024 version files 1.17 GB

obis_subset_no_decapoda.csv

575.01 MB
PBDB_Dataset1_edited_v2_forR.csv

597.48 MB
phanerozoic_ts.csv

8.28 KB
README.md

9.39 KB

Abstract

Variation in observed global generic richness over the Phanerozoic must be partly explained by changes in the numbers of fossils and their geographic spread over time. The influence of sampling intensity (i.e., the number of samples) has been well addressed, but the extent to which the geographic distribution of samples might influence recovered biodiversity is comparatively unknown. To investigate this question, we create models of genus richness through time by resampling the same occurrence dataset of modern global biodiversity using spatially explicit sampling intensities defined by the paleo-coordinates of fossil occurrences from successive time intervals. Our steady-state null model explains about half of observed change in uncorrected fossil diversity and a quarter of variation in sampling-standardized diversity estimates. The inclusion in linear models of two additional explanatory variables associated with the spatial array of fossil data (absolute latitudinal range of occurrences, percent of occurrences from shallow environments) and a Cenozoic step increase the accuracy of steady-state models, accounting for 67% of variation in sampling-standardized estimates and more than one third of the variation in first differences. Our results make clear that the spatial distribution of samples is at least as important as numerical sampling intensity in determining the trajectory of recovered fossil biodiversity through time, and caution the overinterpretation of both the variation and the trend that emerges from analyses of global Phanerozoic diversity.

https://doi.org/10.5061/dryad.2280gb621

These data records pertain to our manuscript that attempted to answer the question - "what if we sampled biodiversity from the modern world's oceans with the same spatial completeness of the fossil record?". To do this, we needed fossil data (from the PBDB - paleobiology database), records of the spatial distribution of modern marine organisms (from the OBIS - ocean biodiversity information system), and spatial statistics to figure out how to sample the records.

These files contain the original (altered in some cases) data files and the R code that was used to perform the statistical analyses and create the charts.

Methodology

First, we manually download and clean fossil data from the Paleobiology Database. In our case, these are marine invertebrates from the entire Phanerozoic, but other groups or time periods could be selected.

Secondly, we identify taxonomic groups which exist both in the fossil dataset and within the database of modern marine organisms (OBIS). In our case, we identify common taxonomic orders. A file containing the records of modern marine animals is downloaded and parsed automatically using the OBIS R API.

Using these two datasets, the goal of the program is to identify modern geographic grid cells (equal-area hexagonal grids, roughly 100km across at the equator) that contain organismal records, with similar distributions to fossil time bins spanning the temporal range of the fossil dataset. I.e., a selection of grid cells containing modern marine organism distribution data is made for each of ~50 fossil time bins, based on the distribution of fossil data within that time bin.

When matching cells, we consider the most important element to be the range in latitude which is represented in the dataset, because it is generally understood that biodiversity correlates with latitude, in both the modern and fossil setting. We consider both the absolute range and the actual range in latitude to be important (i.e., there is a difference between sampling only the northern hemisphere and sampling both north and south). Secondly, we also consider the longitudinal range but not the actual values of longitude - so, if the fossil data spans 160 degrees of longitude, we should try to sample the same span in the modern grid cells, but we don't care if the grid cells are mainly from the western or eastern hemisphere.

Once these cells have been selected, a resampling protocol is run, based on the actual distributions of fossil organisms within grid cells (so, e.g., if there is one fossil grid cell with many samples, and the others have few samples, then one modern cell will be sampled many times and the other modern cells sampled few times), and the number of taxa discovered through sampling is counted. Since the true number of taxa is always the same (because the same, modern marine, underlying dataset is being used each time), differences in the number of taxa discovered per replicate is only due to changes in the spatial structure of sampling.

Finally, we make statistical comparisons between the diversity of taxa within the fossil record and the number of taxa identified through our resampling protocol, to attempt to understand how much spatial sampling bias is influencing our perception of fossil biodiversity.

Description of the data and file structure

There are two important elements within the dataset:

1) Data: these are the original data files used in the analysis, including a fossil occurrence dataset (which has been manually cleaned) and a modern marine organism dataset. You will also need paleogeographic reconstructions - specifically the Scotese reconstructions, which are available on this site: https://www.earthbyte.org. They are used in this analysis in a completely unaltered state, but I can't provide them as part of the dataset.

2) Scripts: this contains the original R code that was used to transform and analyze the datasets.

There are also saved R workspaces with the final results of my analyses, which can be loaded directly.

NULL and Missing Values

There are many instances of "empty" cells within the PBDB data file - these should be treated as NULL values, and represent data fields that were not entered when the original data was uploaded to the database. Typically, PBDB columns with many NULL values were not used in our analysis, and in cases where NULL values were present in an important column, all NULL entries were removed.

Code/Software

There are many custom R scripts included in this dataset. They do everything from transforming data, performing statistical analyses of many kinds, and creating custom plots. The analysis is started from the file "chapter3_main.R", where the other scripts are run by command.

If you wish to look into the details of the other scripts, they are available in the "scripts" folder. Each script has code comments explaining the general strategy of what I was trying to do, but the scripts themselves are complex and have gone through many iterations. In some cases, I have included previous iterations of the scripts, which may help further contextualize how they evolved as the project developed. In those cases, the naming scheme is like "script_v3.R", where the highest version number is the one that was actually used in the analysis.

The most important script elements are:

1) OBIS_interface.R. Accesses the OBIS API and downloads the modern marine organism data used in the analyses.

2) OBIS_combine.R. Performs some additional transformations of the downloaded files, and combines them into one readable dataset.

3) cell_selector_v4.R. Identifies which geographic grid cells should be sampled from the modern marine record, based on the true sampling of the fossil record. The core of the analysis happens here.

4) resampler_v4_sqs.R. Performs the repeated resampling to estimate spatially-constrained biodiversity within each temporal time slice.

5) diversity_plots.R. Creates many of the initial plots and outputs a lot of the most important statistics comparing the spatially-resampled fossil and modern biodiversity metrics.

The other scripts perform various other functions, some of which are supplementary, some of which perform a specific analysis that was requested by a curious reviewer. Not all of these made it into the final manuscript, but I have included them for completeness.

Settings and Variables

These settings exist in the chapter3_main.R file. The goal is to have all the important settings exposed in this file, and then the other scripts are run from this main file. The default settings as they are now will produce the main analysis presented in the paper, and the alternate analyses (such as randomizing the geographic positions of fossil records) can be done by changing these settings.

taxon_remover: (list - string) used to ignore certain taxonomic orders in the OBIS data as described in the manuscript. Specifically, we ignored Decapoda, but other groups could be added to this list. The strings should be the name of the file which is downloaded from the OBIS (e.g., Decapoda_records.csv).

obisOutputFilename: (string) the output file of the combined OBIS records - the program will download many files separately, then combine them into a single dataset with this name.

pbdbDateasetFilename: (string) the filename of the PBDB data that will be loaded into the program.

pbdbTaxaFilename: (string) a filename for a csv containing a list of taxonomic orders to search for in the OBIS. In our case, all the marine invertebrate orders present in the PBDB.

clip2Shallow: (bool) - makes it so the program only looks for PBDB-OBIS matches in the shallow parts of the modern ocean, rather than grabbing modern OBIS occurrences from e.g., the middle of the Pacific. Default is True

test_modernWithPBDB: (bool) - this version doesn't use OBIS data, instead it compares the PBDB record to a time-independent copy of the PBDB, where all the occurrences for a given geographical grid cell are lumped together regardless of their geological time.

l_breaks: (list - decimal) how should the longitudinal portion of the data be broken up?

cell_replicates: (int) how many alternative cell groups should be selected when looking for geographically similar cells in the modern dataset. I.e., the program tries to select modern geographical cells with a similar latitudinal and longitudinal distribution as in each time slice in the paleo record, and this setting controls how many times that process is performed. Then, one of the results is chosen at random per overall resample.

save_altered_geometry2file: (bool) should the program try to save the modified shapefiles produced as part of the analysis? (Can produce crashes, so this setting should not be turned on).

resample_replicates: (int) how many sampling replicates should be run - i.e., a cell selection is chosen, then sampling is performed, and the steps are repeated.

randomize: (bool) randomize the geographic cell (via sampling without replacement - i.e., scrambling the cells) that each PBDB record belongs to.