Data from: Integrating megabarcoding and metabarcoding to unlock diversity and distribution data shortfalls in dark taxa
Data files
Jun 24, 2026 version files 731.90 MB
-
Figure_S1.tif
1.92 MB
-
README.md
23.46 KB
-
Supplementary_Data_1.zip
354 MB
-
Supplementary_Data_2.fasta
141.33 KB
-
Supplementary_Data_3.zip
2.17 MB
-
Supplementary_Data_4.zip
12.33 MB
-
Supplementary_Data_5.zip
359.37 MB
-
Supplementary_Data_6.zip
947.98 KB
-
Table_S1.xlsx
23.15 KB
-
Table_S2.xlsx
185.05 KB
-
Table_S3.xlsx
570.04 KB
-
Table_S4.xlsx
21.43 KB
-
Table_S5.xlsx
196.35 KB
Abstract
Persistent biodiversity data shortfalls undermine our capacity to detect species, map their distributions, and characterise their spatial genetic structure, limiting robust biogeographic analyses and the development of effective conservation strategies, particularly in hyperdiverse invertebrate groups where hidden diversity remains largely undocumented. This study develops and demonstrates the potential of an integrated high-throughput sequencing (HTS) framework to improve the representation of hidden diversity in regional species inventories and to help close critical gaps in our understanding of species distributions and genetic diversity from a conservation biogeography perspective. Focusing on the Canary Islands (Spain), the workflow combines megabarcoding of more than 4000 mesofauna specimens to generate a curated species-level molecular reference library with community DNA metabarcoding of 168 soil samples, enabling consistent taxonomic assignment across insular landscapes and increasing the spatial and genetic resolution of occurrence data. We identified 145 species of mites and springtails, including 49 species newly recorded for the archipelago and numerous genetically distinct lineages likely representing undescribed taxa. Integration of the barcode library with metabarcoding data produced 1,440 species occurrences, revealing extensive distributional gaps, multiple range expansions, and strong within-island phylogeographic structuring, indicating prevalent diversification at fine spatial scales. These results highlight a deep, taxonomically broad underestimation of soil biodiversity and demonstrate that this integrative approach provides a transferable model for advancing the biogeography, evolutionary understanding, and conservation of dark and cryptic taxa across broad taxonomic and conservation-relevant contexts.
Dataset DOI: 10.5061/dryad.j9kd51ct5
Description of the data and file structure
This README file was generated on 2026-01-28 by Irene Santos-Perdomo, associated to Santos-Perdomo et al. (2026) Integrating megabarcoding and metabarcoding to unlock diversity and distribution data shortfalls in dark taxa
Five tables and six data files have been upload. Tables provide complementary information of the localities of the MGBC and MTBC samples; primers associated with each voucher; results of the three BLASTs performed for the initial dataset; a summary of the statistics generated for the species found in the MTBC libraries in both scripts; and taxonomy assigned to each voucher/sequence of the final dataset. Supplementary Data 1 and 2 correspond to 'fasta' files from Genbank and BOLD Systems to run two of the BLASTs analyses. Supplementary Data 3 and 4 are folders that includes necessary inputs to run general and specific scripts, respectively. Supplementary Data 5 and 6 are folders that includes outputs generated in each script.
Files and variables
File: Figure_S1.tif
Description: Figure S1. Schematic representation of the overlap between mitochondrial COI fragments used in metabarcoding and megabarcoding approaches. The standard megabarcoding fragment (658 bp), amplified with the degenerate primers Fol-degen-for and Fol-degen-rev, is shown in relation to the shorter metabarcoding fragment (418 bp), generated using the primers Ill-B-F and Fol-degen-rev. The figure illustrates the shared region between both amplicons, highlighting the extent of overlap and their correspondence within the full-length fragment.
File: Table_S1.xlsx
Description: 'MGBC localities'
Variables
| Column | Entry | Value | Unit | Explanation |
|---|---|---|---|---|
| A | Sample_name | Name | Name assigned to each sample, corresponding to the different pool samples from which vouchers were selected | |
| B | Sample_code | Name | Code assigned to each sample | |
| C | Locality | Name | ||
| D | Municipality | Name | ||
| E | Province | Name | ||
| F | Island | Name | FV: Fuerteventura; TF: Tenerife; LG: La Gomera | |
| G | Sample_date | Date number | Day/month/year | |
| H | Habitat | Name | ||
| I | X | Number | Decimal degrees | Geographic coordinates, decimal degrees |
| J | Y | Number | Decimal degrees | Geographic coordinates, decimal degrees |
Description: 'MTBC localities'
Variables
| Column | Entry | Value | Unit | Explanation |
|---|---|---|---|---|
| A | Sample_code | Name | Code assigned to each sample corresponding to the different MTBC libraries | |
| B | Y | Number | Decimal degrees | Geographic coordinates, decimal degrees |
| C | X | Number | Decimal degrees | Geographic coordinates, decimal degrees |
| D | Island | Name | FV: Fuerteventura; GC: Gran Canaria; TF: Tenerife; LG: La Gomera; LP: La Palma; HI: El Hierro |
File: Table_S2.xlsx
Description: 'MGBC_primers'
Variables
| Column | Entry | Value | Unit | Explanation |
|---|---|---|---|---|
| A | Voucher_code | Alphanumeric Code | Code assigned to each voucher, consisting in 'sci' (soil canary islands) plus correlative numbers | |
| B | TagFsequence | DNA sequence | DNA sequence of each tag of the Forward primer, 13 bp length | |
| C | TagRsequence | DNA sequence | DNA sequence of each tag of the Reverse primer, 13 bp length | |
| D | PrimerF | DNA sequence | DNA sequence of each Reverse primer, 23 bp length | |
| E | PrimerR | DNA sequence | DNA sequence of each Forward primer, 23 bp length |
File: Table_S3.xlsx
Description: 'MGBC_BLASTs'
Variables
| Column | Entry | Value | Unit | Explanation |
|---|---|---|---|---|
| A | Voucher_code | Alphanumeric Code | Code assigned to each voucher, consisting in 'sci' (soil canary islands) plus correlative numbers | |
| B-J | GenBank BLAST | Name and percentage | BLAST results from GenBank database, including taxonomic information from "Domain" to "Species" levels, and finally the percentage of similarity; if empty, voucher without sequence | |
| K-S | COInr BLAST | Name and percentage | BLAST results from COInr database, including taxonomic information from "Domain" to "Species" levels, and finally the percentage of similarity; if empty, voucher without sequence | |
| T-AB | BOLD BLAST | Name and percentage | BLAST results from BOLD database, including taxonomic information from "Domain" to "Species" levels, and finally the percentage of similarity; if empty, voucher without sequence | |
| AC-AD | BLAST_Final_ID | Name and percentage | BLAST final assignation after the comparation between the three blast; if empty, voucher without sequence |
Note: Empty cells in Table_S3 indicate absent data
File: Table_S4.xlsx
Description: 'Summary_statistics_species_MTBC'
Variables
| Column | Entry | Value | Unit | Explanation |
|---|---|---|---|---|
| A | Species_Reference | Name | Species reference from MGBC dataset that has been found in the MTBC libraries | |
| B | TAXA | Name | Acari/Collembola | Taxon, 'Acari' or 'Collembola', to which species belongs |
| C | BIOTA_new_record | Name | yes/no | New species records for the Canary Islands, indicated as 'yes', and species already recorded for the archipelago, indicated as 'no' |
| D | N.individuals | Numeric | Number of records calculated as the sumatory of the number of locations where each haplotype is found | |
| E | N.localities | Numeric | Number of localities in which each species is found | |
| F | N.haplotypes | Numeric | Number of haplotypes that each species includes | |
| G | Hap.diversity | Numeric | Hd; hap.div function of R package pegas for the estimator described in Nei and Tajima (1981) | |
| H | Nuc.diversity | Numeric | π; nuc.div function pegas for the estimator described in Nei (1987) | |
| I | Dist.min | Numeric | Minimum pairwise genetic distance between haplotypes per species | |
| J | Dist.max | Numeric | Maximum pairwise genetic distance between haplotypes per species | |
| K | Dist.mean | Numeric | Mean pairwise genetic distance between haplotypes per species | |
| L | Mean.hap.loc | Numeric | Mean number of haplotypes per locality | |
| M | Species.biota | Name | Species name registered in BIOTA for that species (same name as MGBC references was assigned for new species records for the Canary Islands) | |
| N | N.biota | Numeric | Number of distributional records registered on BIOTA (500x500 m grid cells) for each species | |
| O | N.metabarcoding | Numeric | Number of distributional records generated from MTBC libraries (500x500 m grid cells) for each species | |
| P | N.increment | Numeric | Number of newly distributional records for the Canary Islands associated to each species | |
| Q | Perc.increment | Numeric | Percentage | Percentage calculated as N.increment / N.biota; 'Inf' for N.biota = 0 |
Note: Empty cells in Table_S4 indicate absent data
File: Table_S5.xlsx
Description: 'MGBC_Final_taxonomy'
Variables
| Column | Entry | Value | Unit | Explanation |
|---|---|---|---|---|
| A | Voucher_code | Alphanumeric Code | Code assigned to each voucher, consisting in 'sci' (soil canary islands) plus correlative numbers | |
| B | Sample_code | Alphanumeric Code | Code assigned to each sample | |
| C | Order | Name | Taxonomic order assigned to the voucher in the 'Identified barcodes database' | |
| D | Family | Name | Taxonomic family assigned to the voucher in the 'Identified barcodes database' | |
| E | Genus | Name | Taxonomic genus assigned to the voucher in the 'Identified barcodes database' | |
| F | Species | Name | Taxonomic species assigned to the voucher in the 'Identified barcodes database' | |
| G | BIOTA_new_record | Name | yes/no | New species records for the Canary Islands, indicated as 'yes', and species already recorded for the archipelago, indicated as 'no' |
| H | BIOTA_biogeographic_status | Name | NP, NS, END, IP | NP: probable native; NS: secure native; END: endemic; IP: probable introduced |
Note: Empty cells in Table_S5 indicate absent data
File: Supplementary_Data_1.zip
Description: NCBI dataset
| Part of name separate by " " (*) | Entry | Value | Explanation |
|---|---|---|---|
| 1 | Sequence_code | Alphanumeric code | Sample code associated to each sequence |
| 2 | Taxa | Name | Taxonomic identification associated to each sequence |
| 3 | Sequenced gene | cytochrome oxidase subunit I (COI) gene' | Gene sequenced |
| 4 | Coding DNA Sequence | partial cds' | Sequence without either the start (5' end) or stop (3' end) codon |
| 5 | Gene origin | mitochondrial' | Sequence correspond to a mitochondrial gene |
| (*) | sequence name is structured into five main blocks, including " " between blocks elements |
File: Supplementary_Data_2.fasta
Description: BOLD sequences added to Dataset CBG.R1.21-Mar-2024
| Part of name separate by "|" | Entry | Value | Explanation |
|---|---|---|---|
| 1 | BOLD_ID | Alphanumeric Code | Code assigned to the sequences when they were uploaded to BOLD Systems |
| 2 | Taxonomic ID | Name | Taxonomic identification associated to each sequence |
File: Supplementary_Data_3.zip
Description: inputs to run general script
| File name | Type of file | Value | Explanation |
|---|---|---|---|
| Coord_CanaryLaurel | Text | Name and numeric | MTBC coordinates associated to each library |
| Metabarcoding_CanaryLaurel | Fasta | DNA sequences | MTBC libraries |
| MGBC_4761_418pb_20251010 | Fasta | DNA sequences | MGBC references barcodes |
| line_RUN_4761_MTBC | Text | Name and numeric | Example of a line to execute the script from the Windows command system (CMD) |
| MGBC_general_script | R script | Script | General R script generated in this study |
File: Supplementary_Data_4.zip
Description: inputs to run specific script
| File name | Type of file | Value | Explanation |
|---|---|---|---|
| BIOTA_ACA_CLL_20251013 | Text | Name and numeric | Occurrences of Acari and Collembola species already registered in BIOTA |
| MGBC_BIOTA_correspondences_script1_91species_20251013 | Text | Name | Correspondence of species names between BIOTA (column 'Species_BIOTA') and our references dataset (column 'Reference') |
| grid | Folder | Geographic files | Folder which contains six files necessary to represent 500x500 m cells in maps to draw BIOTA occurrences from each species |
| Integrative_table | Text | Name and numeric | Output generated in general script with MGBC and MTBC occurrences information |
| MGBC_specific_script.R | R script | Script | Specific R script generated in this study |
File: Supplementary_Data_5.zip
Description: Examples of outputs generated in the general script
| File name | Type of file | Value | Explanation |
|---|---|---|---|
| MTBC_haplotypes_total | Haplotypes networks (Parsimony Network (TCS); Minimum Spanning Tree (MST); Randomized Minimum Spanning Tree (MST); Median-Joining Network (MJN)) generated for each species from MTBC libraries, compiled in only one PDF file; each species names indicated as maps titles; HI: El Hierro, LP: La Palma, LG: La Gomera, TF: Tenerife, GC: Gran Canaria, FV: Fuerteventura, MV: Median vector | ||
| MTBC_occurrences_total | Occurrences maps generated for each species from MTBC libraries, compiled in only one PDF file; each species names indicated as map titles; HI: El Hierro, LP: La Palma, LG: La Gomera, TF: Tenerife, GC: Gran Canaria, FV: Fuerteventura | ||
| MTBC_trees_total | Phylogenetic trees (Unweighted Pair Group Method with Arithmetic Mean (UPGMA), Neighbour-joining (NJ), Minimum Evolution (ME), Ordinary Least Squares (OLS), Maximum Parsimony (MP), Maximum Likelihood (ML)) generated for each species from MTBC libraries, compiled in only one PDF file; each species names indicated as maps titles; HI: El Hierro, LP: La Palma, LG: La Gomera, TF: Tenerife, GC: Gran Canaria, FV: Fuerteventura |
File: Supplementary_Data_6.zip
Description: outputs generated in the specific script
| File name | Type of file | Value | Explanation |
|---|---|---|---|
| x 91 "Genus_species" maps | 91 PDF files which corresponds to each MTBC species distribution maps that shows occurrences in BIOTA and new distributional records founded in MTBC libraries; HI: El Hierro, LP: La Palma, LG: La Gomera, TF: Tenerife, GC: Gran Canaria, FV: Fuerteventura; Green squares corresponds to BIOTA records, and blue dots to MTBC records. |
Code/software
R scripts, submitted in Supplementary Data 3 and 4, include a description of each of the functions, packages, and different steps that they execute and provide.
