Data from: An accessible metagenomic strategy allows for better characterization of invertebrate bulk samples

Callens, Martijn 1 ; Le Berre, Guillaume1; Van den Bulcke, Laure1; Lolivier, Marianne1; Derycke, Sofie1

Published Apr 24, 2025 on Dryad. https://doi.org/10.5061/dryad.gqnk98szx

Data files

Apr 24, 2025 version files 27.30 GB

Add_Taxid.py

1.60 KB
Biomass_correction.txt

311 B
Data_analysis_macrobenthos.Rmd

15.25 KB
Macrobenthos_BPNS_v0.1.tar.gz

27.30 GB
macrobenthos_classification.biom

12.42 KB
metadata_macrobenthos_metagenomics.txt

23.56 KB
metadata_macrobenthos_stalen.txt

445 B
README.md

4.54 KB

Abstract

DNA-based techniques are a popular approach for assessing biodiversity in ecological research, especially for organisms that are difficult to detect or identify morphologically. Metabarcoding, the most established method for determining species composition and relative abundance in bulk samples, can be more sensitive and time- and cost-effective than traditional morphological approaches. However, one drawback of this method is PCR bias caused by between-species variation in the amplification efficiency of a marker gene. Metagenomics, bypassing PCR amplification, has been proposed as an alternative to overcome this bias. Several studies have already shown the promising potential of metagenomics, but they all indicate the unavailability of reference genomes for most species in any ecosystem as one of the primary bottlenecks preventing its wider implementation. This dataset is from a study that present a strategy which combines unassembled reads of low-coverage whole genome sequencing and publicly available reference genomes to construct a genomic reference database, thus circumventing high sequencing costs and intensive bioinformatic processing. This dataset was used to show that this approach is superior to metabarcoding for approximating relative biomass of macrobenthos species from bulk samples. Furthermore, these results can be obtained with a sequencing effort comparable to metabarcoding. This strategy can accelerate the implementation of metagenomics in biodiversity assessments, as it should be relatively easy to adopt by laboratories familiar with metabarcoding and can be used as an accessible alternative.

Dataset DOI: 10.5061/dryad.gqnk98szx

Description of the data and file structure

The Kraken2 custom database was created using unassembled low-coverage genome sequencing data of 24 macrobenthos species from the Belgian part of the North Sea and one publicly available assembled reference genome.

The results from the metagenomic analysis (BIOM file) were obtained by classifying shotgun sequencing reads from environmental macrobenthos bulk samples using our custom database.

Files and variables

File: macrobenthos_classification.biom

Description: BIOM format v1.0 (see https://biom-format.org/ for details) containing the number of metagenomic reads assigned to a particular taxonomic group for each environmental sample.

File: Data_analysis_macrobenthos.Rmd

Description: R markdown file to perform the statistical analysis. Annotations are provided within the code.

File: metadata_macrobenthos_stalen.txt

Description: metadata for the environmental macrobenthos bulk samples that were analysed by shotgun metagenomics

Variables

sample: ID for each sample
zone: zone in the Belgian part of the North Sea where the sample was taken (used to differentiate the TB samples from the other two samples)
impact: For TB samples, indication if a sample is from a sand extraction area (IMP) or a reference area (REF). Not applicable (NA) for stations not in the TB zone
year: Year of sampling
total_reads: number of shotgun PE reads in metagenomic analysis

File: metadata_macrobenthos_metagenomics.txt

Description: biomass and metabarcoding data (for each species and sample) with associated metadata for bulk macrobenthos samples.

Variables

sample: sample ID
species: Bionomial name of the species
ncbi_txid: NCBI tax id of the species
dataset: indication if samples comes from the long-term (LT) or sand extraction (SE) dataset
station: ID of the station were the sample was taken
sand_extraction: indication if sand extraction was taking place for a given station (only applicable for SE dataset samples: reference = no extraction; impact = extraction; NA = Not Applicable).
year: year the sample was taken (NA = Not Applicable)
morph_indiv: number of individuals per m2 of a given species in a sample
morph_biomass: biomass (in gram) of a species in a sample (NA = Not Available)
metabarcoding_reads: number of metabarcoding reads assigned to a particular species in a a sample

File: Biomass_correction.txt

Description: This file is used in the statistical analysis to 'promote' the biomass that was assigned to Processa sp. and Glycera sp. to respectively Processa modica and Glycera alba.

Variables

station: station ID
Processa: biomass (in gram) assigned to Processa sp. by morphological identification for a particular station
Glycera: biomass (in gram) assigned to Glycera sp. by morphological identification for a particular station

File: Add_Taxid.py

Description: Script for adding taxid's to fasta sequence headers so these are recognised by Kraken2 to build a custom database.

File: Macrobenthos_BPNS_v0.1.tar.gz

Description: Kraken2 custom database used in this study to classify metagenomic reads. This contains three files necessary for the database (description on kraken2/docs/MANUAL.markdown at master · DerrickWood/kraken2 · GitHub ):

hash.k2d: Contains the minimizer to taxon mappings
opts.k2d: Contains information about the options used to build the database
taxo.k2d: Contains taxonomy information used to build the database

Code/software

The data_analysis_macrobenthos.Rmd file can be opened in R studio. The analysis was performed using R 4.3.0. The following packages were loaded when running the analysis:

phyloseq v1.41.1
ggplot2 v3.5.0
ggpubr v0.6.0

The Add_taxid.py script adds NCBI taxid's to fasta sequence headers so they can be used to build a a custom Kraken2 database. Requires Python version 3.2 or higher to run.

Access information

Other publicly accessible locations of the data:

Data was derived from the following sources:

Sequencing data for construction of the database and environmental bulk samples can be found on ENA project PRJEB83993.

Data from: An accessible metagenomic strategy allows for better characterization of invertebrate bulk samples

Data files

Abstract

README: An accessible metagenomic strategy allows for better characterization of invertebrate bulk samples

Description of the data and file structure

Files and variables

File: macrobenthos_classification.biom

File: Data_analysis_macrobenthos.Rmd

File: metadata_macrobenthos_stalen.txt

Variables

File: metadata_macrobenthos_metagenomics.txt

Variables

File: Biomass_correction.txt

Variables

File: Add_Taxid.py

File: Macrobenthos_BPNS_v0.1.tar.gz

Code/software

Access information