Unlocking natural history collections to improve eDNA reference databases and biodiversity monitoring
Abstract
Biodiversity changes due to human activities highlight the need for efficient biodiversity monitoring approaches. Environmental DNA (eDNA) metabarcoding offers a non-invasive method used for biodiversity monitoring and ecosystem assessment, but its accuracy depends on comprehensive DNA reference databases. Natural history collections often contain rare or difficult-to-obtain samples that can serve as a valuable resource to fill gaps in eDNA reference databases. Here, we discuss the utility of specimens from natural history collections in supporting future eDNA applications. Museomics—the application of -omics techniques to museum specimens—offers a promising avenue for improving eDNA reference databases by increasing species coverage. Furthermore, museomics can provide transferable methodological advancements for extracting genetic material from samples with low and degraded DNA. The integration of natural history collections, museomics, and eDNA approaches has the potential to significantly improve our understanding of global biodiversity, highlighting the continued importance of natural history collections.
Description of the data and file structure
The dataset consists of a main folder, data.zip.
Various
- kit_custom_prices.xlsx - price estimate for DNA extraction and ssDNA library prep using a commercial kit or the custom protocol from Nicolas Straube.
barcodes_data
output from the cumul_barcodes_plot.R script.
- species_with_barcodes.csv - list of all fishes (marine + freshwater) with a given barcode available, according to NCBI. (1) species name, (2) NCBI taxon ID, (3) date when the species sequence was first uploaded on NCBI, (4) marker of interest, (5) year the species sequence was first uploaded on NCBI.
occurence_data
contains a different type of list of species (museum, 12S availability, etc.)
- combined_gbif_species.csv - output from the script
museum_potential/1_process_gbif_datasets.R. Contains all the species of fish found in the main natural history collections in Europe, with data available on GBIF. Columns names: (1) order; (2) family; (3) genus; (4) species; (5) countryCode: country where the specimen was initially collected; (6) year: year when the specimen was initially collected; (7) museum_collection: natural history collection where the specimen is stored. - fishes_barcode_museum.csv - output from script
museum_potential/2_create_museum_barecode_dataset.R. Contains all the fish species downloaded from FishBase (marine + freshwater) with information on which species can be found in which museum collection, and which species have a sequenced barcode or not. Columns names: (1) class; (2) order; (3) family; (4) species (the five first columns are taxonomic information about the species); (5) geo_area_fishbase: the broad geographic distribution of the species (one species can have multiple) ; (6) geo_code_fishbase: the associated area code from FishBase; (7) latitude_fishbase: the corresponding latitude; (8) longitude_fishbase: the corresponding longitude; (9) marine: is the species found in marine environment (TRUE if yes); (10) freshwater: is the species found in freshwater environment (TRUE if yes); (11) iucn: IUCN redlist category; (12-17) columns representing the different barcodes and the year when the first sequence of a given species was added to NCBI (mito is for the full mitogenome); (18) museum_collection: in which museum collection(s) the species can be found; (19) available_in_museum: generally, is the species available in a museum collection.
sra_data
contains all entries corresponding to Actinopterygii in the SRA NCBI database. It was not possible to retrieve this data locally (should be done on a server). Generated with this command line using the edirect program.
esearch -db sra -query "Actinopterygii[ORGN]" | efetch -format runinfo > actino.csv
- actino.csv.gz - raw output of the command line, gzipped.
- actino_selected_columns.csv - modified output containing only the following columns of interest:t ReleaseDate, TaxID, ScientificName. Generated using the following command line with the help of
csvkitto properly handle empty fields (not working withawk,sedorcut)
csvformat -D "," actino.csv > actino_normalised.csv
csvcut -c 2,28,29 actino_normalised.csv > actino_selected_columns.csv
raw_data directory
Removed all the data from GBIF and IUCN (to comply with Dryad's CC0 waiver policy), keeping only the header of the file, so that people know which file format to expect to run the script. Retained only the header of each file so that users will know which format to expect and use with the script.
museum_data_db
Should contain a list of specimens found in a given natural history museum collection that are not found in GBIF. The curator provided the list, which is available on request. The following example inputs, containing only the column headers, are included to confirm the format and enable script testing.
- db_firenze_chondro_subset.csv - subset chondrichthyan species in the NHM Firenze. Species: species of the specimen; family: family of the specimen; catalog: catalog number of the specimen; number_ind: number of individuals for the given species.
- db_vienna_subset.csv - subset of species at the NHM Vienna. Specimen_ID: ID of the specimen, Catalog_number: catalog number of the specimen; Order: order of the specimen; Family: family of the specimen; Genus: genus of the specimen; Species: species of the specimen; MS_name: scientific name of the specimen; No_spms: number of individuals for the given ID; Type_status: is the specimen a type; Standort: location in the collection; Field_Number: ID of the specimen when collected; Country: country of collection of the specimen; Location: location of collection of the specimen; Collector: person who collected the specimen; Date_text: collection date of the specimen.
museum_data_gbif
This directory should contain lists of specimens from a given natural history museum collection, downloaded from GBIF.
- citations.txt: doi - DOI for all datasets used in the study and date of download; collection: name of the collection that created the dataset; city: name of the city where the collection is found.Taxons: which taxons are included in the dataset; citation: how to cite the dataset in a publication.
- example_format_gbif.csv - an example input containing only the column headers to confirm the format and enable script testing.
iucn_red_list_data
This folder should contain the lists of endangered species downloaded from the IUCN webpage.
- example_format_iucn.csv - an example input containing only the column headers to confirm the format and enable script testing.
Scripts
cluster scripts directory
scripts that were used for analysis ran on the Euler cluster.
retrieve_sra_actino_csv.sh
. This script was used to retrieve all the fish species that are found in the SRA NCBI database. The output was saved in data/sra_data and subsequently used with the script.cumul_barcodes_plot.R.
cumul_barcodes directory
cumul_barcodes_plot.R
This script retrieves all NCBI entries for a given group of species and a gene/genome of interest. It then processes the data and keeps only the first (chronologically) occurrences per species. The processed data is then plotted to represent the cumulative number of new sequences for a given barcode each year. This script requires the function_retrieve_ncbi.R R function.
function_retrieve_ncbi.R
function to retrieve gene (barcode) data from NBCI for a given group of species and format the data for plotting. Function definition: retrieve_ncbi takes as input values:
- gene (the gene name, e.g., "12S rRNA" or "16S rRNA", optional (for SRA search))
- term (search term for NCBI, e.g., "fish", optional)
- db (which NCBI database to explore, e.g., "nucleotide", "sra", mandatory)
- batch_size (number of IDs to request at once, default = 200)
- retmax (maximum number of records to retrieve, default = 52000)
- one_occ_per_sp (if TRUE, will modify the dataframe and keep only the earliest occurrence of a species, default = TRUE).
museum_potential directory
1_process_gbif_datasets.R
A script that takes as input unmodified csv files downloaded from GBIF, renamed according to the following model <collection_taxons.csv>, and outputs a single csv file keeping only the fish species, their taxonomic information, country of origin, collection year, and natural history collection name.
The output file is redirected to the data/occurence_data directory and is called combined_gbif_species.csv.
2_create_museum_barecode_dataset.R
A script that takes as input the GBIF files generated by the museum_potential/1_process_gbif_datasets.R script, custom museum datasets, the file generated with the cumul_barcodes/cumul_barcodes_plot.R script containing information of the available barcode sequences, and the raw data downloaded from the IUCN website with all the species and their red list category (two files: marine + freshwater species). The script generates one output file data/occurence_data/fishes_barcode_museum.csv , which consists of all fish species (fresh + marine), with the availability of barcode sequence and presence in the museum collection.
3_plot_museum_potential.R
A script that calls the different functions found in the same directory to plot data related to barcode sequence availability and how museum specimens found in European collections can improve the number of available sequences. Requires as input the file occurence_data/fishes_barcode_museum.csv.
function_plot_geo_region.R
This is a function that plots the proportion of available sequences for a given barcode and the proportion that could be further sequenced if we took into account the museum specimens available in Europe. Proportions are illustrated by geographic regions. The function is used with the script 3_plot_museum_potential.R and requires as input the file occurence_data/fishes_barcode_museum.csv generated with the script 2_create_museum_barecode_dataset.R
# example usage
geo_plot_12S <- plot_sequence_availability(fish_barcode_mus, "rRNA_12S")
function_plot_iucn.R
This is a function that plots the proportion of available sequences for a given barcode and the proportion that could be further sequenced if we took into account the museum specimens available in Europe. The proportions are illustrated by the IUCN Red List category. The function is used with the script 3_plot_museum_potential.R and requires as input the file occurence_data/fishes_barcode_museum.csv generated with the script 2_create_museum_barecode_dataset.R
# example usage
iucn_plot_12S <- plot_iucn_availability(fish_barcode_mus, "rRNA_12S")
Dataset for analyzing the potential of museum specimens to improve the DNA reference database
To examine the cumulative number of species sequenced for a given DNA barcode/mitochondrial genome (also referred to as mitogenome) over the years, we retrieved all data available from NCBI using the R package rentrez v1.2.3 (Winter 2017). We searched the nucleotide database for the rRNA 12S, rRNA 16S, rRNA 18S, cytochrome B (cytB), cytochrome oxidase I (COI) barcodes, as well as for the complete mitogenomes for all fish orders. In addition, we also retrieved all the fish species with available data on the sequence read archive (SRA) using the Entrez Direct (Kans 2024), which provides access to the NCBI databases from a Unix terminal window.
To highlight the potential of museum specimens for increasing the number of species with an available barcode/mitogenome sequence, we first downloaded all available datasets on the Global Biodiversity Information Facility (GBIF) listing fish specimens stored in European natural history museum collections (see table S1). Subsequently, we downloaded a list of all existing fish species using the R package rfishbase v5.0.0 (Boettiger et al. 2012) and extracted their geographic range (field AreaCode). In addition, we retrieved information about the Red List status of all fish species from the International Union for Conservation of Nature and Natural Resources (IUCN) website. All the datasets (barcodes, museum specimens, IUCN status and fish species list, and geographic range) were combined in R v.4.3.0 and subsequently plotted.
