Multi‐barcoding‐based Gastropoda identification using hierarchical attention network with staged curriculum learning
Data files
Jan 28, 2026 version files 6.78 GB
-
data.zip
17.06 MB
-
embedding.zip
6.54 GB
-
pretrained_models.zip
225 MB
-
README.md
4.18 KB
Abstract
This dataset contains multiple types of DNA barcoding sequences of COI, 16S, H3, 18S, ITS1, and ITS2 of Gastropoda species accessed from GenBank and BOLD, and the corresponding the RNA secondary structure embedding of the 6 barcoding types and the DNA barcoding embedding of COI sequences. The RNA secondary structure embeddings are extracted using the RNA foundation model ERNIE-RNA (https://github.com/Bruce-ywj/ERNIE-RNA) (Yin et al. 2025, Nature Communications, 16: 10076). The DNA barcoding embedding of COI sequences is extracted using the DNA barcoding foundation model BarcodeMAE (https://github.com/bioscan-ml/BarcodeMAE) (Safari et al. 2025, arXiv, 2502.18405). These sequences and embeddings are used for training, validation, testing, and independent testing data for a deep learning model named SnailBaLLsp, which is developed for multi‐barcoding‐based Gastropoda identification using hierarchical attention network with staged curriculum learning.
Dataset DOI: 10.5061/dryad.ttdz08m9c
Description of the data and file structure
Datasets for training, validation, testing, and independent testing contain several columns representing the information for sequence source, length, and hierarchical taxonomy names in several taxonomy level. The COI data has the full records of all taxonomy levels. The other five types of barcoding data have partial samples consistent with the COI data, which have the index corresponding to sample index of COI samples.
The embedding data are extracted by ERNIE-RNA (Yin et al. 2025, Nature Communications, 16: 10076) (for all the six barcoding types) and BarcodeMAE (Safari et al. 2025, arXiv, 2502.18405) (only for COI barcoding), respectively. The protocol of embedding extraction could be found in the manual of ERNIE-RNA (https://github.com/Bruce-ywj/ERNIE-RNA) and BarcodeMAE (https://github.com/bioscan-ml/BarcodeMAE).
Files and variables
File: data.zip
Description:
data_Train_Val_Test - Gastropoda sequence data for COI with full records of taxonomy levels, and 16S, H3, 18S, ITS1, and ITS2 with corresponding sample index of partial records. These sequences are accessed from GenBank by 2024-12-31. Within the folder, each data table has a sequence accession number, sequence, multi-level classification, and other corresponding information. 'trainval_idx.npy' and 'test_idx.npy' represent indices of samples in the training-validating set and the test set partitioned in the model, and can be read using numpy in Python.
data_Independent - Gastropoda sequence data for COI with full records of taxonomy levels, and 16S, H3, 18S, ITS1, and ITS2 with corresponding sample index of partial records. These sequences are accessed from GenBank from 2025-01-01 to 2025-05-23 (https://www.ncbi.nlm.nih.gov/), and from BOLDistilled (Prosser et al. 2025, Molecular Ecology Resources, 25, e70043, http://doi.org/10.5281/zenodo.15442656) by March 2025. The folder 'Indep_3146_samples' represents samples from the independent testing dataset formed by merging Independent Testing 1 (GenBank) and Independent Testing 2 (BOLD), where all samples have complete labels across every taxonomy level.
Case_Study_Gastropoda - Gastropoda sequence data for COI with full records of taxonomy levels, and 16S, H3, and 18S with corresponding sample index of partial records. These sequences are accessed from GenBank from 2025-06-01 to 2025-10-31.
Case_Study_Bivalvia - Bivalvia sequence data for COI with full records of taxonomy levels, and 16S, H3, 18S, ITS1, and ITS2 with corresponding sample index of partial records. These sequences are accessed from GenBank by 2025-11-20, and from BOLDistilled (Prosser et al. 2025, Molecular Ecology Resources, 25, e70043, http://doi.org/10.5281/zenodo.15442656) by March 2025.
File: embedding.zip
Description: RNA secondary structure embedding for all the six barcoding types and DNA barcoding embedding for only COI barcoding used for model training, validation, and testing. Files can be read by numpy and torch in python.
File: pretrained_models.zip
Description: The models pretrained at every stage. Models can be loaded by torch and sklearn in python.
Code/software
The embedding data could be view through numpy and torch package in python.
Access information
Other publicly accessible locations of the data:
Data was derived from the following sources:
