Data from: An accessible metagenomic strategy allows for better characterization of invertebrate bulk samples
Data files
Apr 24, 2025 version files 27.30 GB
-
Add_Taxid.py
1.60 KB
-
Biomass_correction.txt
311 B
-
Data_analysis_macrobenthos.Rmd
15.25 KB
-
Macrobenthos_BPNS_v0.1.tar.gz
27.30 GB
-
macrobenthos_classification.biom
12.42 KB
-
metadata_macrobenthos_metagenomics.txt
23.56 KB
-
metadata_macrobenthos_stalen.txt
445 B
-
README.md
4.54 KB
Abstract
DNA-based techniques are a popular approach for assessing biodiversity in ecological research, especially for organisms that are difficult to detect or identify morphologically. Metabarcoding, the most established method for determining species composition and relative abundance in bulk samples, can be more sensitive and time- and cost-effective than traditional morphological approaches. However, one drawback of this method is PCR bias caused by between-species variation in the amplification efficiency of a marker gene. Metagenomics, bypassing PCR amplification, has been proposed as an alternative to overcome this bias. Several studies have already shown the promising potential of metagenomics, but they all indicate the unavailability of reference genomes for most species in any ecosystem as one of the primary bottlenecks preventing its wider implementation. This dataset is from a study that present a strategy which combines unassembled reads of low-coverage whole genome sequencing and publicly available reference genomes to construct a genomic reference database, thus circumventing high sequencing costs and intensive bioinformatic processing. This dataset was used to show that this approach is superior to metabarcoding for approximating relative biomass of macrobenthos species from bulk samples. Furthermore, these results can be obtained with a sequencing effort comparable to metabarcoding. This strategy can accelerate the implementation of metagenomics in biodiversity assessments, as it should be relatively easy to adopt by laboratories familiar with metabarcoding and can be used as an accessible alternative.
Dataset DOI: 10.5061/dryad.gqnk98szx
Description of the data and file structure
The Kraken2 custom database was created using unassembled low-coverage genome sequencing data of 24 macrobenthos species from the Belgian part of the North Sea and one publicly available assembled reference genome.
The results from the metagenomic analysis (BIOM file) were obtained by classifying shotgun sequencing reads from environmental macrobenthos bulk samples using our custom database.
Files and variables
File: macrobenthos_classification.biom
Description: BIOM format v1.0 (see https://biom-format.org/ for details) containing the number of metagenomic reads assigned to a particular taxonomic group for each environmental sample.
File: Data_analysis_macrobenthos.Rmd
Description: R markdown file to perform the statistical analysis. Annotations are provided within the code.
File: metadata_macrobenthos_stalen.txt
Description: metadata for the environmental macrobenthos bulk samples that were analysed by shotgun metagenomics
Variables
- sample: ID for each sample
- zone: zone in the Belgian part of the North Sea where the sample was taken (used to differentiate the TB samples from the other two samples)
- impact: For TB samples, indication if a sample is from a sand extraction area (IMP) or a reference area (REF). Not applicable (NA) for stations not in the TB zone
- year: Year of sampling
- total_reads: number of shotgun PE reads in metagenomic analysis
File: metadata_macrobenthos_metagenomics.txt
Description: biomass and metabarcoding data (for each species and sample) with associated metadata for bulk macrobenthos samples.
Variables
- sample: sample ID
- species: Bionomial name of the species
- ncbi_txid: NCBI tax id of the species
- dataset: indication if samples comes from the long-term (LT) or sand extraction (SE) dataset
- station: ID of the station were the sample was taken
- sand_extraction: indication if sand extraction was taking place for a given station (only applicable for SE dataset samples: reference = no extraction; impact = extraction; NA = Not Applicable).
- year: year the sample was taken (NA = Not Applicable)
- morph_indiv: number of individuals per m2 of a given species in a sample
- morph_biomass: biomass (in gram) of a species in a sample (NA = Not Available)
- metabarcoding_reads: number of metabarcoding reads assigned to a particular species in a a sample
File: Biomass_correction.txt
Description: This file is used in the statistical analysis to ‘promote’ the biomass that was assigned to Processa sp. and Glycera sp. to respectively Processa modica and Glycera alba.
Variables
- station: station ID
- Processa: biomass (in gram) assigned to Processa sp. by morphological identification for a particular station
- Glycera: biomass (in gram) assigned to Glycera sp. by morphological identification for a particular station
File: Add_Taxid.py
Description: Script for adding taxid’s to fasta sequence headers so these are recognised by Kraken2 to build a custom database.
File: Macrobenthos_BPNS_v0.1.tar.gz
Description: Kraken2 custom database used in this study to classify metagenomic reads. This contains three files necessary for the database (description on kraken2/docs/MANUAL.markdown at master · DerrickWood/kraken2 · GitHub ):
hash.k2d
: Contains the minimizer to taxon mappingsopts.k2d
: Contains information about the options used to build the databasetaxo.k2d
: Contains taxonomy information used to build the database
Code/software
The data_analysis_macrobenthos.Rmd file can be opened in R studio. The analysis was performed using R 4.3.0. The following packages were loaded when running the analysis:
- phyloseq v1.41.1
- ggplot2 v3.5.0
- ggpubr v0.6.0
The Add_taxid.py script adds NCBI taxid’s to fasta sequence headers so they can be used to build a a custom Kraken2 database. Requires Python version 3.2 or higher to run.
Access information
Other publicly accessible locations of the data:
- n/a
Data was derived from the following sources:
- Sequencing data for construction of the database and environmental bulk samples can be found on ENA project PRJEB83993.