Unlocking natural history collections to improve eDNA reference databases and biodiversity monitoring

Data files

Oct 16, 2025 version files 61.43 MB

data.zip

61.42 MB
README.md

9.91 KB

Abstract

Biodiversity changes due to human activities highlight the need for efficient biodiversity monitoring approaches. Environmental DNA (eDNA) metabarcoding offers a non-invasive method used for biodiversity monitoring and ecosystem assessment, but its accuracy depends on comprehensive DNA reference databases. Natural history collections often contain rare or difficult-to-obtain samples that can serve as a valuable resource to fill gaps in eDNA reference databases. Here, we discuss the utility of specimens from natural history collections in supporting future eDNA applications. Museomics—the application of -omics techniques to museum specimens—offers a promising avenue for improving eDNA reference databases by increasing species coverage. Furthermore, museomics can provide transferable methodological advancements for extracting genetic material from samples with low and degraded DNA. The integration of natural history collections, museomics, and eDNA approaches has the potential to significantly improve our understanding of global biodiversity, highlighting the continued importance of natural history collections.

Description of the data and file structure

The dataset consists of a main folder, data.zip.

Various

kit_custom_prices.xlsx - price estimate for DNA extraction and ssDNA library prep using a commercial kit or the custom protocol from Nicolas Straube.

barcodes_data

output from the cumul_barcodes_plot.R script.

species_with_barcodes.csv - list of all fishes (marine + freshwater) with a given barcode available, according to NCBI. (1) species name, (2) NCBI taxon ID, (3) date when the species sequence was first uploaded on NCBI, (4) marker of interest, (5) year the species sequence was first uploaded on NCBI.

occurence_data

contains a different type of list of species (museum, 12S availability, etc.)

combined_gbif_species.csv - output from the script museum_potential/1_process_gbif_datasets.R. Contains all the species of fish found in the main natural history collections in Europe, with data available on GBIF. Columns names: (1) order; (2) family; (3) genus; (4) species; (5) countryCode: country where the specimen was initially collected; (6) year: year when the specimen was initially collected; (7) museum_collection: natural history collection where the specimen is stored.
fishes_barcode_museum.csv - output from script museum_potential/2_create_museum_barecode_dataset.R. Contains all the fish species downloaded from FishBase (marine + freshwater) with information on which species can be found in which museum collection, and which species have a sequenced barcode or not. Columns names: (1) class; (2) order; (3) family; (4) species (the five first columns are taxonomic information about the species); (5) geo_area_fishbase: the broad geographic distribution of the species (one species can have multiple) ; (6) geo_code_fishbase: the associated area code from FishBase; (7) latitude_fishbase: the corresponding latitude; (8) longitude_fishbase: the corresponding longitude; (9) marine: is the species found in marine environment (TRUE if yes); (10) freshwater: is the species found in freshwater environment (TRUE if yes); (11) iucn: IUCN redlist category; (12-17) columns representing the different barcodes and the year when the first sequence of a given species was added to NCBI (mito is for the full mitogenome); (18) museum_collection: in which museum collection(s) the species can be found; (19) available_in_museum: generally, is the species available in a museum collection.

sra_data

contains all entries corresponding to Actinopterygii in the SRA NCBI database. It was not possible to retrieve this data locally (should be done on a server). Generated with this command line using the edirect program.

esearch -db sra -query "Actinopterygii[ORGN]" | efetch -format runinfo > actino.csv

actino.csv.gz - raw output of the command line, gzipped.
actino_selected_columns.csv - modified output containing only the following columns of interest:t ReleaseDate, TaxID, ScientificName. Generated using the following command line with the help of csvkit to properly handle empty fields (not working with awk, sed or cut)

csvformat -D "," actino.csv > actino_normalised.csv
csvcut -c 2,28,29 actino_normalised.csv > actino_selected_columns.csv

raw_data directory

Removed all the data from GBIF and IUCN (to comply with Dryad's CC0 waiver policy), keeping only the header of the file, so that people know which file format to expect to run the script. Retained only the header of each file so that users will know which format to expect and use with the script.

museum_data_db

Should contain a list of specimens found in a given natural history museum collection that are not found in GBIF. The curator provided the list, which is available on request. The following example inputs, containing only the column headers, are included to confirm the format and enable script testing.

db_firenze_chondro_subset.csv - subset chondrichthyan species in the NHM Firenze. Species: species of the specimen; family: family of the specimen; catalog: catalog number of the specimen; number_ind: number of individuals for the given species.
db_vienna_subset.csv - subset of species at the NHM Vienna. Specimen_ID: ID of the specimen, Catalog_number: catalog number of the specimen; Order: order of the specimen; Family: family of the specimen; Genus: genus of the specimen; Species: species of the specimen; MS_name: scientific name of the specimen; No_spms: number of individuals for the given ID; Type_status: is the specimen a type; Standort: location in the collection; Field_Number: ID of the specimen when collected; Country: country of collection of the specimen; Location: location of collection of the specimen; Collector: person who collected the specimen; Date_text: collection date of the specimen.

museum_data_gbif

This directory should contain lists of specimens from a given natural history museum collection, downloaded from GBIF.

citations.txt: doi - DOI for all datasets used in the study and date of download; collection: name of the collection that created the dataset; city: name of the city where the collection is found.Taxons: which taxons are included in the dataset; citation: how to cite the dataset in a publication.
example_format_gbif.csv - an example input containing only the column headers to confirm the format and enable script testing.

iucn_red_list_data

This folder should contain the lists of endangered species downloaded from the IUCN webpage.

example_format_iucn.csv - an example input containing only the column headers to confirm the format and enable script testing.

Scripts

cluster scripts directory

scripts that were used for analysis ran on the Euler cluster.

retrieve_sra_actino_csv.sh

. This script was used to retrieve all the fish species that are found in the SRA NCBI database. The output was saved in data/sra_data and subsequently used with the script.cumul_barcodes_plot.R.

cumul_barcodes directory

cumul_barcodes_plot.R

This script retrieves all NCBI entries for a given group of species and a gene/genome of interest. It then processes the data and keeps only the first (chronologically) occurrences per species. The processed data is then plotted to represent the cumulative number of new sequences for a given barcode each year. This script requires the function_retrieve_ncbi.R R function.

function_retrieve_ncbi.R

function to retrieve gene (barcode) data from NBCI for a given group of species and format the data for plotting. Function definition: retrieve_ncbi takes as input values:

gene (the gene name, e.g., "12S rRNA" or "16S rRNA", optional (for SRA search))
term (search term for NCBI, e.g., "fish", optional)
db (which NCBI database to explore, e.g., "nucleotide", "sra", mandatory)
batch_size (number of IDs to request at once, default = 200)
retmax (maximum number of records to retrieve, default = 52000)
one_occ_per_sp (if TRUE, will modify the dataframe and keep only the earliest occurrence of a species, default = TRUE).

museum_potential directory

1_process_gbif_datasets.R

A script that takes as input unmodified csv files downloaded from GBIF, renamed according to the following model <collection_taxons.csv>, and outputs a single csv file keeping only the fish species, their taxonomic information, country of origin, collection year, and natural history collection name.

The output file is redirected to the data/occurence_data directory and is called combined_gbif_species.csv.

2_create_museum_barecode_dataset.R

A script that takes as input the GBIF files generated by the museum_potential/1_process_gbif_datasets.R script, custom museum datasets, the file generated with the cumul_barcodes/cumul_barcodes_plot.R script containing information of the available barcode sequences, and the raw data downloaded from the IUCN website with all the species and their red list category (two files: marine + freshwater species). The script generates one output file data/occurence_data/fishes_barcode_museum.csv , which consists of all fish species (fresh + marine), with the availability of barcode sequence and presence in the museum collection.

3_plot_museum_potential.R

A script that calls the different functions found in the same directory to plot data related to barcode sequence availability and how museum specimens found in European collections can improve the number of available sequences. Requires as input the file occurence_data/fishes_barcode_museum.csv.

function_plot_geo_region.R

This is a function that plots the proportion of available sequences for a given barcode and the proportion that could be further sequenced if we took into account the museum specimens available in Europe. Proportions are illustrated by geographic regions. The function is used with the script 3_plot_museum_potential.R and requires as input the file occurence_data/fishes_barcode_museum.csv generated with the script 2_create_museum_barecode_dataset.R

# example usage
geo_plot_12S <- plot_sequence_availability(fish_barcode_mus, "rRNA_12S")

function_plot_iucn.R

This is a function that plots the proportion of available sequences for a given barcode and the proportion that could be further sequenced if we took into account the museum specimens available in Europe. The proportions are illustrated by the IUCN Red List category. The function is used with the script 3_plot_museum_potential.R and requires as input the file occurence_data/fishes_barcode_museum.csv generated with the script 2_create_museum_barecode_dataset.R

# example usage
iucn_plot_12S <- plot_iucn_availability(fish_barcode_mus, "rRNA_12S")

Unlocking natural history collections to improve eDNA reference databases and biodiversity monitoring

Data files

Abstract

README: Unlocking natural history collections to improve eDNA reference databases and biodiversity monitoring

Description of the data and file structure

Scripts

Methods