Data and code from: Establishing DNA-based strategies for soil biodiversity assessment: Insights from carabid beetles

Fraga Dornellas, Luísa 1 ; Mata, Vanessa2 3; Mendes, Sara1; Leitão, Ricardo1; Bratz, Marie1 4; Nascimento, Eduardo1; Costa, Joana1; Sousa, José Paulo1; Cunha, Luís1

Research facility: University of Coimbra

Published Jan 12, 2026 on Dryad. https://doi.org/10.5061/dryad.g1jwstr28

Data files

Jan 12, 2026 version files 5.64 GB

Data.zip

5.64 GB
README.md

35.60 KB

Abstract

Molecular-based methods offer valuable opportunities for assessing soil biodiversity in different ecosystems. However, their reliability and large-scale applicability depend on developing, optimizing protocols, and establishing high-quality, curated local reference databases. This study aimed to evaluate key steps in the soil macroinvertebrate metabarcoding workflow, including the sample decontamination process and the efficiency of taxa recovery. Specifically, we sought to: (1) determine the impact of sample decontamination, (2) validate species-level recovery efficiency of the metabarcoding pipeline spiked with a curated mock community of morphologically identified and barcoded carabid beetles, and (3) compare traditional morphological identification and metabarcoding for specimens’ taxonomic assignment and recovery. Our results showed that the commonly used decontamination process did not significantly impact OTU richness, suggesting it is not essential for this fauna. Compared to morphology, metabarcoding provided a more comprehensive taxonomic overview at higher-level taxa. However, validation with the mock community revealed discrepancies in species-level recovery, underscoring that its accuracy is highly contingent on the quality of the reference database. DNA metabarcoding is a currently used and promising technique for macroinvertebrate assessment regarding time, efficiency, and costs, yet reaching greater depth in taxonomic resolution. Yet, its species-level accuracy remains dependent on comprehensive and well-curated barcode reference databases. We recommend an integrative approach, combining molecular data with targeted validation, for the most robust outcomes. For this reason, we recommend the use of integrative methodologies for robust and rapid biodiversity assessments. We found that the common decontamination step is not crucial for soil macrofauna metabarcoding accuracy. Consequently, its removal streamlines sample processing. DNA metabarcoding revolutionizes soil biodiversity assessments by offering unparalleled taxonomic resolution and efficiency compared to traditional methods. Our study shows that the common decontamination step is unnecessary, streamlining workflows without compromising accuracy. However, the reliability of these molecular methods hinges on the development of curated local reference databases, underscoring the importance of integrative approaches for robust biodiversity monitoring.

https://doi.org/10.5061/dryad.g1jwstr28

Description of the data and file structure

This repository contains the data and the scripts from a study conducted as part of the CULTIVAR project in Idanha-a-Nova, Portugal. The research focused on monitoring soil macrofauna diversity through a combination of traditional morphological identification and advanced molecular techniques, such as barcoding and metabarcoding.

Study Overview

Sample Collection:
Soil macrofauna were collected from 24 plots using 216 pitfall traps set up from late November to early December 2022. Specimens were preserved in 96 % ethanol and sorted under low-power microscopy.
COI Barcode Reference Database:
Carabidae specimens were identified morphologically and DNA barcoded using Folmer primers to establish a local COI reference database. Sequences were matched to BOLD and NCBI for taxonomic assignments, and a mock community of 31 Carabid species was assembled for validation.
Metabarcoding Workflow:
Bulk community samples were decontaminated, homogenized, and DNA-extracted. A 418 bp COI fragment was amplified using a two-step PCR process and sequenced via Illumina NovaSeq. The bioinformatics pipeline used is provided in this repository as "BioinformaticPipeline.qmd".
Comparative Analysis:
Decontamination efficiency and the comparison of morphological identifications, barcoding and metabarcoding results using R-based statistical analyses are here available as "DataAnalysis.qmd". A 7 % barcode gap was employed for species delimitation, integrating project sequences with NCBI data using Assemble Species by Automatic Partitioning (ASAP), ASAP Web results are also availale.

Files and variables

BioinformaticPipeline.qmd

Quarto document used to process the raw metabarcoding sequences. It requires OBIToolsv4.3.0 , ROBIFastread v0.1.0 , VSEARCH v1.2.13, LULUv0.1.0 and boldigger v2.2.2 and the ngs_fiter.py python script. OBITools 4 was used for the sequence processing, coupled with VSEARCH and LULU (Frøslev et al., 2017) for denoising. Paired-end reads were aligned and merged (sequences that failed to overlap were removed). With ngs_fiter.py the reads were assigned to samples, and primer sequences were removed. Merged reads were collapsed into haplotypes. Singletons are then removed using . Then, each remaining haplotype was combined into a single file and once more dereplicated. PCR, sequencing error and chimeric sequences, are removed through the pipeline. The remaining sequences were then grouped by a 99 % similarity criterion to define operational taxonomic units (OTUs). Putative NUMTs (nuclear copies of the mitochondrial COI) were then identified and removed using LULU. All remaining OTUs were identified by comparing the representative OTUs of each cluster against online databases through Boldigger. Only sequences of targeted taxa went through downstream analysis.

DataAnalysis_Revised_10_10_25.qmd

Script contains all statistical analyses and data visualization used for the associated publication. These were performed in R v2023.06.0 + 421 using packages' iNEXT, v3.0.0. , for diversity estimates, ggplot2, v3.4.2 and ggpubr v0.6.0, for data visualization, and ggVennDiagram v1.12, for the creation of a Venn diagram.

File: Data.zip

Description: ZIP file containg all the data separated by type of analysis.

Missing data conventions: NA = not applicable.

This file contains three folders:

1.Morpho

Contains one excel file

`MorphoAnalysis.xlsx`

Description: file with the abundance of morphologically sorted organisms. Samples were sorted under a low-power microscope, where organisms were counted and identified to the lowest possible taxonomic level based on morphological traits through dichotomous keys.

Variable name	Description	Units / Categories / Notes
sample	Unique pitfall sample label.	Text
Agroecosystem	Agroecosystem classification of the plot.	Categorical
TotalAbundance	Total number of individuals in that trap	Count
Araneae	Number of individuals from that group per trap	Count
Archaeognatha	Number of individuals from that group per trap	Count
Blattodea	Number of individuals from that group per trap	Count
Dermaptera	Number of individuals from that group per trap	Count
Gastropoda	Number of individuals from that group per trap	Count
Hemiptera	Number of individuals from that group per trap	Count
Isopoda	Number of individuals from that group per trap	Count
Opiliones	Number of individuals from that group per trap	Count
Orthoptera	Number of individuals from that group per trap	Count
Pseudoscorpiones	Number of individuals from that group per trap	Count
Coleoptera	Number of individuals from that group per trap	Count
Hymenoptera	Number of individuals from that group per trap	Count
Myriapoda	Number of individuals from that group per trap	Count

2.OTU

Contains 3 files with the results and resources for the metabarcoding/ OTU-based molecular analyses.

`BOLDResults_curated.xlsx`

Description: Excel file containing 3 sheets:

- "BOLDigger hit"

First BOLDigger hit for each OTU, with final "keep?" decision, with NAs added as Not Aplicable

Variable name	Description
ID	OTU identifier
Phylum / Class / Order / Family / Genus / Species	Taxonomic ranks returned by BOLD
Similarity	% identity to BOLD record
Status	BOLDigger assignment status
Flags	BOLDigger flags
Keep?	YES / NO flag: whether OTU was retained for downstream analyses

- "RAW_BOLDigger hit"

Raw first-hit assignments directly exported from BOLDigger, before manual curation. NAs were not added and "Keep?" columns does not exist.

Variable name	Description
ID	OTU identifier
Phylum / Class / Order / Family / Genus / Species	Taxonomic ranks returned by BOLD
Similarity	% identity to BOLD record
Status	BOLDigger assignment status
Flags	BOLDigger flags
Keep?	YES / NO flag: whether OTU was retained for downstream analyses

- "Annotation_07-12-2023"

Full raw annotation table exported from BOLD, with the first 20 hits for each OTU.

Variable name	Description
You searched for	Query sequence submitted
Phylum / Class / Order / Family / Genus / Species / Subspecies	Taxonomic ranks
Similarity	% similarity
Status	BOLD status
Process ID	Unique identifiers for the DNA sequencing process.
Record ID	Unique identifiers for the resulting specimen record.
BOLD BIN	The Barcode Index Number associated with a specific DNA barcode sequence within the BOLD database
Sex	sex of voucher (if known)
Life stage	life stage (if known)
Country	The geographical location where the specimen was collected
Identifier	The name of the researcher who submitted or identified the specimen.
Identification Method	Method used for ID
Specimen page url	Direct web link to the individual specimen's record page

`otutable_curated.tsv`

Raw file with the number of OTU reads per sample.

`otutable_keptOTUs.xlsx`

Data analysis output. File with the number of filtered OTU reads per sample. Only the retained OTUs: crossed information from otutable_curated with the first BOLDigger hit for each OTU, with final "keep?" decision "Yes".

`sample.info.tsv` / `sample.info.xlsx`

These files describe each sample included in the metabarcoding dataset.
They contain the following variables:

Variable	Description	Units / Categories / Notes
`Sample`	Metabarcoding sample identifier corresponding to each pitfall trap.	Text
`id`	Internal identifier.	Alphanumeric
`i7 (revcom) - SIF`	Reverse complement of the i7 index used for library preparation.	Nucleotide sequence
`i5 (revcom) - SIF`	Reverse complement of the i5 index used for library preparation.	Nucleotide sequence
`i7`	i7 index tag used for demultiplexing.	Nucleotide sequence
`i7 oligo seq (5′–3′)`	Full oligonucleotide sequence for the i7 primer (forward orientation).	Nucleotide sequence
`i5`	i5 index tag used for demultiplexing.	Nucleotide sequence
`i5 oligo seq (5′–3′)`	Full oligonucleotide sequence for the i5 primer (forward orientation).	Nucleotide sequence
`Combo`	Combination of i7 and i5 indices used for multiplexing.	Text
`index`	Identifier for the specific index combination used in sequencing.	Text
`sample`	Unique pitfall sample label.	Text
`code`	Alphanumeric field code linking season of the pitfall sample (T2= Autumn).	Text
`Transect`	Transect designation within the study plot.	Text
`decontamination`	Indicates whether the sample underwent decontamination prior to extraction.	Categorical ("decontaminated", "non-decontaminated")
`SampleType`	Type of sample.	Categorical ("decontaminated", "non-decontaminated", "BPcr", "MOCK", "NegExt")
`ext.weight`	Weight of material used for DNA extraction.	grams (g)
`specimens`	Number of individual specimens in the bulk sample.	Count
`orders`	Number of taxonomic orders represented in the sample according to morphological identification.	Count
`batch`	Extraction batch identifier, batches of 24.	Integer
`extneg`	Extraction negative control identifier according to the batch.	Text / `n/a` if not applicable
`pcrneg`	PCR negative control identifier.	Text / `n/a` if not applicable
`dna-plate`	Identifier of the plate.	Text
`dna-well`	Well position within the sequencing plate.	Alphanumeric (e.g., "A01")
`mock.test`	Indicates whether the sample is part of the mock community test.	Categorical (TRUE / FALSE)
`diversity`	Classification of taxa detected per sample from morphological analysis; High diversity >7 order, Low diversity <7 orders .	Categorical ("high", "low")
`concentration`	Proportion of DNA added to the Mock Test.	Numeric
`Agroecosystem`	Agroecosystem classification of the plot.	Categorical

BioinformaticPipeline

Folder containing:

folder with the raw data (01.RawData)
- each folder within is named by sample (S1 - S235), each containg the MD5.txt file and four .fq.gz files containg the metabarcoding sequences to be used under the bioinformatic pipeline.
samplesheet.txt file.
ngs_filter.py file

3.Barcode

Folder has the resources and results related to the Carabid taxonomic identification.

ASAP_Result_Idanha.pdf

Description: A report summarizing the results of the ASAP (Assemble Species by Automatic Partitioning) analysis, used for species delimitation.

Appendix S2.xlsx

Description: Excel file with the cytochrome oxidase subunit I (COI) gene sequences for all Carabid beetle specimens included in this study. Each entry provides the GenBank accession number, species identification, and corresponding nucleotide sequence. These data form the basis for the barcode gap and species delimitation analyses presented in the main text.

Variable	Description
`Sequence_ID`	Unique identifier for each sequence
`Identification`	Species identification
`GenBank Description`	Metadata associated with GenBank entry
`Sequence`	Nucleotide sequence (COI gene)

TaxBarMet.xlsx

A detailed table summarizing taxonomic, barcoding and metabarcoding carabid species identification. It contains three sheets:

- "All"

Description: Sheet with all the species identified in this study.

Variable	Description	Units / Categories / Notes
`Phylum`	Taxonomic phylum	Text
`Class`	Taxonomic class	Text
`Order`	Taxonomic order	Text
`Family`	Taxonomic family	Text
`Genus`	Taxonomic genus	Text
`Species`	Species name	Text
`Taxonomy`	Species presence according to taxonomy	Binomial ( 0 = absence; 1 = presence)
`Barcoding`	Species presence according to barcoding	Binomial ( 0 = absence; 1 = presence)
`Metabarcoding`	Species presence according to metabarcoding	Binomial ( 0 = absence; 1 = presence)

- "VennGenus"

Description: Sheet with the genus identified by each method.

Variable	Description	Units / Categories / Notes
`Phylum`	Taxonomic phylum	Text
`Class`	Taxonomic class	Text
`Order`	Taxonomic order	Text
`Family`	Taxonomic family	Text
`Genus`	Taxonomic genus	Text
`Taxonomy`	Species presence according to taxonomy	Binomial ( 0 = absence; 1 = presence)
`Barcoding`	Species presence according to barcoding	Binomial ( 0 = absence; 1 = presence)
`Metabarcoding`	Species presence according to metabarcoding	Binomial ( 0 = absence; 1 = presence)

- "Venn\>=93%"

Description: This sheet presents the species-level identifications based on the established barcode gap threshold, including only matches with ≥ 93 % sequence identity.

Variable	Description	Units / Categories / Notes
`Phylum`	Taxonomic phylum	Text
`Class`	Taxonomic class	Text
`Order`	Taxonomic order	Text
`Family`	Taxonomic family	Text
`Genus`	Taxonomic genus	Text
`Species`	Species name	Text
`Taxonomy`	Species presence according to taxonomy	Binomial ( 0 = absence; 1 = presence)
`Barcoding`	Species presence according to barcoding	Binomial ( 0 = absence; 1 = presence)
`Metabarcoding`	Species presence according to metabarcoding	Binomial ( 0 = absence; 1 = presence)

- "Final_Classification(TableS2)"

Information used to build Table S2 from Appendix S1. According to the agreement on the methods used, DNA barcoding or DNA metabarcoding, species were classified the into "Morphospecies," "IOTUS," "MOTUs," and "Not considered," according to the established barcode gap (ID < 93 %), "0" represents absence. Accordingly:

Morphospecies: species only identified morphologically.

IOTUS (Integrated Operational Taxonomic Units): species where both morphological and molecular data was available.

MOTUs (Molecular Operational Taxonomic Units): species identification that relied only on molecular data.

Sample Collection and Processing

This study, conducted under the CULTIVAR project in Idanha-a-Nova, Portugal, monitored soil macrofauna across 24 plots, each containing nine pitfall traps. A total of 216 traps were set up from late November to early December 2022, remaining in the field for 13 to 17 days. Traps were filled with ethylene glycol as a preservative and covered with lids to prevent dilution by rain. Collected specimens were preserved in 96 % ethanol until further analysis. Under low-power microscopy, organisms were counted and identified, per sample, to the lowest possible taxonomic level using morphology-based keys and literature. These community samples were stored at 96 % ethanol.

Establishing a COI Barcode Reference Database

Carabidae specimens were identified to the species level using morphological traits. To crete a local barcode reference base, one to two identified specimens were bleached, washed, and DNA was extracted using a Qiagen DNeasy® kit. For smaller specimens, a non-destructive method was applied, while larger specimens required maceration of appendages. PCR amplification of a 710 bp COI sequence utilized Folmer primers. Amplified products were verified via agarose gel electrophoresis, purified, and sequenced through the Sanger method. Sequences were analyzed against the BOLD and NCBI databases, allowing taxonomic assignments. A mock community of 31 Carabid species was created to validate metabarcoding methods, such as the recovery efficiency.

Bulk Sample Preparation and Metabarcoding

Half of the community samples went through a decontamination process using sodium hypochlorite. All community samples were homogenized into fine powder using a Bullet Blender. DNA was extracted with an E.Z.N.A.® Tissue DNA Kit. Carabid beetles were excluded as they were part of a mock community. PCR amplification targeted a 418 bp COI fragment using primers BF3 and BR2. A two-step PCR process incorporated Illumina adapters and identification indices. Final libraries were quantified, pooled, and sequenced using the Illumina NovaSeq platform. Controls ensured no contamination during amplification.

Bioinformatics Pipeline

The pipeline employed OBITools 4 for sequence processing, VSEARCH for denoising, and LULU for filtering pseudogenes. Paired-end reads were merged, primer sequences removed, and OTUs grouped by 99 % similarity. Taxonomic assignments were made using BOLD and NCBI databases. The script is available as a .qmd document.

Comparative Analysis and Statistical Validation

Morphological and molecular identification methods were compared at the order and species levels. Statistical analyses were performed using R, (script available as a .qmd document). A barcode gap of Carabid species of 7 % was establishid to define species delimitation. This was accomplished by joining our Carabid sequences with more than 600 sequences retrieved from NCBI (sequence accession number available in the dataset), using MEGA and genetic distances calculated via the K80 model. This allowed the creation of a Venn diagram for comparing species and genus recovery efiency of barcoding and metabarcoding compared to morphology.