Data from: Data fusion for integrative species identification using deep learning
Abstract
DNA analyses have revolutionized species identification and taxonomic work. Yet, persistent challenges arise from little differentiation among species and considerable variation within species, particularly among closely-related groups. While images are commonly used as an alternative modality for automated identification tasks, their usability is limited by the same concerns. An integrative strategy, fusing molecular and image data through machine learning, holds significant promise for fine-grained species identification. However, a systematic overview and rigorous statistical testing concerning molecular and image preprocessing and fusion techniques, including practical advice for biologists, are missing so far. We introduce a machine learning scheme that integrates both molecular and morphological data for species identification. Initially, we systematically assess and compare three different DNA arrangement and two encoding methods. Later, artificial neural networks are used to extract visual and molecular features, and we propose strategies for fusing this information. Specifically, we investigate three strategies: I) fusing directly after feature extraction, II) fusing features that passed through a fully connected layer after feature extraction, and III) fusing the output scores of both unimodal models. We systematically and statistically evaluate these strategies for four eukaryotic datasets, including two plant (Asteraceae, Poaceae) and two animal families (Lycaenidae, Coccinellidae) using Leave-One-Out Cross-Validation. In addition, we developed an approach to understand molecular- and image-specific identification failure. Aligned sequences with nucleotides encoded as vectors of decimal numbers achieved the highest identification accuracy among DNA data preprocessing techniques in all four datasets. Fusing molecular and visual features directly after feature extraction yielded the best results for three out of four datasets (52-99%). Overall, combining DNA with image data significantly increased accuracy in three out of four datasets, with plant datasets showing the most substantial improvement (Asteraceae: +19%, Poaceae: +13.6%). Even for Lycaenidae with high identification accuracy based on molecular data (>96%), a statistically significant improvement was observed (+2.1%). Detailed analysis of confused samples shows that DNA tends to identify the genus correctly, but fails to recognize the species. This shortcoming is alleviated by including morphological data into the training, hinting towards a hierarchical role of modalities. We systematically showed and explained, for the first time, that optimal preprocessing and integration of molecular and image data offers significant benefits, particularly for genetically similar and morphologically indistinguishable species, enhancing species identification by reducing modality-specific failure rates and information gaps. Our results can inform integration efforts for various organism groups, improving automated identification across a wide range of eukaryotic species.
https://doi.org/10.5061/dryad.4qrfj6qjk
Description of the data and file structure
Data
The data folder contains the records and alignment files for each of the four datasets used in this study (i.e., Asteraceae, Poaceae, Coccinellidae, Lycaenidae). The text file contains the following information about the records: 'record_id' as a uniquely assigned custom ID for the record; 'species_name' is the name of the species; 'taxonomy' is the taxonomic information linked to the species provided by NCBI; 'genbank_accession' is the GenBank accession provided by NCBI and is included for completeness; 'image_url' is the original URL of the record's image; 'image_rights_holder' is the rights holder of the image if provided by GBIF. Sequences in the respective fasta files can be linked to their records via their unique record ID (e.g., 'BOLD642' in column 'record_id' and in fasta header '>BOLD642 rbcLa Antennaria_densifolia').
List of data files
Asteraceae/Asteraceae.tsv
Details relevant information about 970 Asteraceae records. Information on the meaning of each column can be found in the Data section.
Asteraceae/Asteraceae_rbcLa.fa
Contains 970 Asteraceae sequences of the marker rbcLa used for training/evaluation. Sequences in the respective fasta files can be linked to their records via their unique record ID (e.g., 'BOLD642' in column 'record_id' and in fasta header '>BOLD642 rbcLa Antennaria_densifolia').
Poaceae/Poaceae.tsv
Details relevant information about 1118 Poaceae records. Information on the meaning of each column can be found in the Data section.
Poaceae/Poaceae_rbcLa.fa
Contains 1118 Poaceae sequences of the marker rbcLa used for training/evaluation. Sequences in the respective fasta files can be linked to their records via their unique record ID (e.g., 'BOLD11384' in column 'record_id' and in fasta header '>BOLD11384 rbcLa Elymus_caninus').
Coccinellidae/Coccinellidae.tsv
Details relevant information about 1092 Coccinellidae records. Information on the meaning of each column can be found in the Data section.
Coccinellidae/Coccinellidae_COI.fa
Contains 1092 Coccinellidae sequences of the marker COI-5P used for training/evaluation. Sequences in the respective fasta files can be linked to their records via their unique record ID (e.g., 'BOLD1389' in column 'record_id' and in fasta header '>BOLD1389 COI-5P Coccinella_trifasciata').
Lycaenidae/Lycaenidae.tsv
Details relevant information about 5520 Lycaenidae records. Information on the meaning of each column can be found in the Data section.
Lycaenidae/Lycaenidae_COI.fa
Contains 5520 Lycaenidae sequences of the marker COI-5P used for training/evaluation. Sequences in the respective fasta files can be linked to their records via their unique record ID (e.g., 'BOLD9074' in column 'record_id' and in fasta header '>BOLD9074 COI-5P Jamides_bochus').
Practical guide
Setup environment
Python environment setup:
# 1st option
# create environment with all dependencies via conda
conda env create -f environment.yml
# 2nd option
# create and activate virtual python environment
conda create -n integrative_dl
conda activate integrative_dl
# install required packages
python3 -m pip install -r requirements.txt
For BLAST install BLAST+ v2.15.0. You can either install it into your conda environment (https://anaconda.org/bioconda/blast) or by following the instructions at https://www.ncbi.nlm.nih.gov/books/NBK569861/.
In addition, you will need to have R installed (v4.3.1). The pipeline will install the required packages itself.
Example usage
Let us assume we want to train our models on a dataset that contains data of the Asteraceae family. We will crawl DNA data and images from GenBank, BOLD, and GBIF first by using the scripts within src/dataset_collection found in scripts.zip.
cd src/01_dataset_collection
python main.py --query 'Asteraceae' --project-dir ${YOUR_PROJECT_DIRECTORY}
We will apply manual image filtering (the script will prompt us to do so) by starting a jupyter notebook server and running the code within the second to last cell.
DO NOT RUN qc.add_train() the first time you are prompted to manually check the images.
Don't forget to add your job name, the marker that the pipeline chose during dataset collection and the directory that contains your project in the respective fields before running the cell.
jupyter notebook quality_filtering.ipynb
Afterwards, we will run the main script within src/01_dataset_collection again. Keep in mind that manual filtering needs to be done twice and that, both times, the main scripts needs to be run afterwards.
Next, we will train models based on a traditional train/val split (where two samples per species will be used for validation) and LOOCV.
cd src/02_classification
python main.py --job-id 'Asteraceae' --root-dir ${YOUR_PROJECT_DIRECTORY}
The results will be stored in a results_${run_index}.tsv file. The run index is determined by an argument that can be given to src/02_classification/main.py to run multiple LOOCV runs in parallel on the same machine - with shared logs to allocate samples to each of the processes.
To leverage this functionality, --runs needs to be set to >1, i.e. the number of parallel LOOCV runs. The argument --run-index then determines the index of the current run. For instance, --runs 4 --run-index 2 means that there are 4 parallel runs in total and the current run is the third (due to 0-based indexing).
To evaluate the influence of, e.g., DNA sequence length within the training dataset, we can leverage src/03_GLM/ModelOptimizer.R. For a more flexible approach to regression modeling, consider using LazyModeler, which builds upon ModelOptimizer.R.
To generate plots and statistics based on the results, we can run each of the R scripts within src/04_results_evaluation. Again, the required packages will be automatically installed. Remember to set the directory that the results are stored in (the variable is called base_dir).
Detailed description of scripts (hosted on Zenodo)
Dataset collection
The scripts in the scripts.zip folder are responsible for collecting a dataset based on a query (e.g., a family). Records will be crawled from BOLD and GENBANK, while images are gathered from BOLD, GBIF, and (in case of in situ images) a local Flora Capture folder. The only scripts that need to be actively started are the main script and the jupyter notebook for manual quality filtering. The main script will ask for the user to check the quality at some point. The jupyter notebook will then need to be used and the main script restarted. There are in total 2 times that the jupyter notebook will be needed.
Command example:
python main.py --query 'Asteraceae'
This will crawl preserved specimens for Asteraceae, with a minimum number of 40 species in the dataset and at least 4 entries per species. The script uses 8 processes by default.
R/genetic_distance.R
Calculates a distance matrix based on sequence identity for a given alignment file and writes the matrix in tsv format to a provided output path.
quality_filtering.ipynb
Jupyter notebook used to semi-manually filter images by iteratively printing images and asking for (dis)approval. This is the only script apart from main.py that needs to be actively run. When run for the second time, the script will add a column used for splitting training and validation data.
barcode_filtering.py
Contains methods for filtering DNA data. This includes filtering based on clusters and their gap and SNP content, which uses vsearch, snp-sites and vcftools as third-party tools. This script also provides the information needed to choose the best barcode based on the metrics described in the Methods section.
commons.py
Includes methods that are shared between multiple classes. Besides a log method, methods in this script are mainly related to image crawling, downloading, and checking.
data_bold.py
This script is responsible for downloading data from BOLD and formatting the data for easier access by methods further down the pipeline.
data_ncbi.py
Similarly, this script is responsible for downloading data from NCBI and formatting the data for easier access by methods further down the pipeline.
data_padding.py
We search BOLD for combined records, i.e., records with a genetic sequence and an attached image. NCBI, however, does not provide images. Therefore, we need to add images to the records that are not linked to an image already. In this study, we used Flora images alongside in situ images for the Poaceae dataset. Thus, there are methods for searching for images on disk. Other methods crawl images from GBIF, either by searching for in situ iNaturalist images or by searching for preserved specimens and attaching the URLs to the records. When BOLD records are removed based on their barcode during barcode filtering, the attached images are re-used where possible.
dataset.py
This file contains all methods handling the compilation of a dataset. It starts both BOLD and NCBI data crawling, applies the threshold multiple times, chooses the best marker based on the results of the barcode filtering, and checks for image and GenBank accession duplicates.
main.py
Main script file and the only script apart from quality_filtering.ipynb that needs to be actively run by the user. This script takes all user arguments for running the complete pipeline to create a dataset based on a) a given taxonomic group, b) a file with GenBank accession numbers, or c) a BOLD container. For detailed information on the available user arguments, please refer to the file or run main.py --help.
server_prep.py
The methods within this script prepare the genetic and image data for consumption by the methods in 02_classification. This preparation includes the encoding and aligning/SNP-reduction of the genetic sequences.
stats.py
Responsible for printing information about a given dataset. This includes genetic distances within the dataset and overall dataset information such as the number of species.
Classification
This folder contains the scripts for the training of the ML models for species identification. Unless random forest is not to be applied, the random forest grid search needs to be run first. Then, the main script is the one that needs to be actively started. All other scripts are then automatically included and run.
barcodejpg_dataset.py
Contains the dataset class for the image+DNA dataset that is given to the DL model. If the DL is trained on either the DNA or both data types, the DNA is directly loaded into memory to save time during training.
blast.py
Contains methods to run BLAST both in with a traditional train-val split and as leave-one-out cross validation.
early_stopping.py
Small class that adds early stopping to training the DL model.
main.py
Main script file and the only script that needs to be actively run by the user. A list of available parameters can be accessed by running 'python main.py --help'. The script includes ways to parallelize runs using a single or multiple machines. Additionally, the user can choose which processing, encoding, and classifier options should be considered when running the pipeline. Another option is the application of LOOCV or k-fold CV.
model_bar_resnet_sequential_embedding.py
Same as model_bar_resnet.py with small adjustments to handle ordinal/sequential DNA encoding. The main adjustment is the addition of a layer that transforms the fractional encoding of the DNA into a DL-driven ordinal encoding.
model_bar_resnet.py
Here, the DL model that handles the DNA is defined. The content mainly originates from the pytorch GitHub, with small adjustments to the channel size and the classifier.
model_barimg.py
This script defines the model used for the fusion experiments. It uses the CNNs of both the barcode/DNA and image models, defines the classifier and how the two input types are forwarded through the network.
model_img.py
This script defines the model used for the images. It directly uses the pytorch-defined ResNet-50 and adds a custom classifier.
random_forest_grid_search.py
This script is not automatically run when calling main.py. It is used to determine the parameters to be used during the main training by applying grid search and training and evaluating on the training subsets of the datasets.
train_parent.py
This script contains a wrapper class that starts both traditional and leave-one-out training. For LOO, it automatically prepares the dataset while considering the subset size of 4 that is used in this study. It also checks which of the preprocessings performed best and skips already trained-on validation indices.
train.py
This is the main script for training. It first checks if the specific training was already done or is currently handled by another process. Then it sets up the models and model parameters, trains the specified models (separate and fused), saves the models and results, and creates a history plot visualizing information about the accuracy and loss during each epoch.
GLM - ModelOptimizer.R
Here, the R script for automatic GLM model simplification resides. It can remove autocorrelations and uses a list of coefficients sorted by their relevance in order to remove not significantly correlated predictors.
Results evaluation
Contains the scripts that were used for plot creation.
barcode_preprocessing.R
Produces a figure for comparison of DNA preprocessing methods (arrangement & encoding).
confusion_w_genetic_distances.R
Analysis of genetic distances of samples that were misidentified by at least one model (either unimodal or multimodal). Resulting figure contains information on duplicates and identification levels.
fusion_methods.R
Produces a figure for comparison of unimodal models and multimodal models using different fusion methods.
supporting_confusion_heatmap.R
Yields a figure that shows confusion rates and bias towards either inter- or intrageneric confusion alongside information on sample sizes.
Access information
Data was derived from the following sources:
