Museum ‘dark data’ show variable impacts on deep-time biogeographic and evolutionary history
Abstract
The age of digitally accessible datasets has transformed palaeontology, enabling previously impossible macroevolutionary insights. However, a substantial reservoir of generally inaccessible ‘dark data’ resides within museum collections, which may alter our understanding of ancient groups and their ecological and evolutionary history. We demonstrate how the addition of data held exclusively in museums impacts our macroevolutionary understanding of an entire taxonomic group, using a dataset of Palaeozoic echinoids containing the majority of museum occurrences for the clade. We find that museum ‘dark data’ shows clear differences in composition compared to data available in the published literature and strongly impacts biogeographic patterns, increasing the geographic range size of taxa by 35% on average. Global model results assessing drivers of diversity are also significantly affected by the addition of museum only data. Conversely, “dark data” has a more limited impact on the temporal ranges of taxa or estimates of overall diversity, and are impacted by similar socio-geographic biases as the published record. These findings show that unpublished museum data are necessary to obtaining a complete understanding of macroevolutionary patterns in deep-time, illustrating the importance of the collection, curation, digitisation, and continued care of ‘dark data’ in the age of ‘Big Data’ in palaeobiology.
README: Museum ‘dark data’ show variable impacts on deep-time biogeographic and evolutionary history
https://doi.org/10.5061/dryad.tmpg4f57t
Description of the data and file structure
This dataset contains the following information:
- 3 R scripts which generate all analyses used in the paper:
- Dark_Functions. All functions used to run analyses.
- Dark_Setup. Script to set up data read for analyses.
- Dark_main_script. Script to run all analyses, which calls the first two scripts.
- A folder of additional data, containing time bins used for analysis.
- A folder of occurrence data, containing the main datasets used as well as additional datasets of palaeozoic echinoids, echinoderms and invertebrates used for analysis.
- Two files of Electronic Supplementary Materials, additionally submitted with the main manuscript.
A general note: all appearances of "N/A" in datasets indicate missing/not available data, rather than non-applicable data.
Files and variables
File: Data.zip
Description: A .zip file containing the following folders and files:
- Additional_data: This contains additional files containing timescales and covariates used for analyses. Files include:
- series.csv: A .csv file containing information on geological series through time for time binning.
- Specimen_data: This contains the complete dataset, as well as additional datasets of PBDB downloads used for combination with the specimen dataset and for correlations. Files include:
- all_palaeozoic*_*binned.csv: A .csv file containing the PBDB download for all marine Palaeozoic collections, binned by stages.
- Echinodermata.csv: A .csv file containing the PBDB download for all Palaeozoic Echinodermata occurrences.
- Final_Database_for_analysis.csv: A .csv file containing the finalised dataset used for analysis.
- pbdb_data_21092023.csv: A .csv file containing the PBDB download for all Palaeozoic Echinoid occurrences.
Code/software
There are three R scripts included with this dataset:
- Dark_Functions.R: An R file containing all the custom functions necessary to run analyses.
- Dark_Setup.R: An R file containing the processing and cleaning necessary prior to running analyses.
- Dark_Main_Script.R: An R file containing the main script for running all analyses carried out in the paper.
ESM
There are two electronic supplementary materials files included with this dataset:
- ESM 1: A Word document providing additional supplementary figures, as well as information regarding the use of additional covariates.
- ESM 2: An Excel file containing supplementary tables and datasets
Access information
Other publicly accessible locations of the data:
Additional data was derived from the following sources: