BIMAGES: Bivalve images for morphological analysis and genetic estimation study
Data files
Jul 25, 2024 version files 35.21 MB
-
edge_lengths.csv
44.67 KB
-
meta.pkl
17.66 MB
-
meta.tsv
17.49 MB
-
README.md
10.88 KB
Abstract
Reconstructing the tree of life and understanding the relationships of taxa are core questions in evolutionary and systematic biology. The main advances in this field in the last decades were derived from molecular phylogenetics; however, for most species, molecular data are not available. Here, we explore the applicability of two deep learning methods – supervised classification approaches and unsupervised similarity learning – to infer organism relationships from specimen images. As a basis, we assembled an image dataset covering 4144 bivalve species belonging to 74 families across all orders and subclasses of the extant Bivalvia, with molecular phylogenetic data being available for all families and a complete taxonomic hierarchy for all species. The suitability of this dataset for deep learning experiments was evidenced by an ablation study resulting in almost 80% accuracy for identifications on the species level. Three sets of experiments were performed using our dataset. First, we included taxonomic hierarchy and genetic distances in a supervised learning approach to obtain predictions on several taxonomic levels simultaneously. Here, we stimulated the model to consider features shared between closely related taxa to be more critical for their classification than features shared with distantly related taxa, imprinting phylogenetic and taxonomic affinities into the architecture and training procedure. Second, we used transfer learning and similarity learning approaches for zero-shot experiments to identify the higher-level taxonomic affinities of test species that the models had not been trained on. The models assigned the unknown species to their respective genera with approximately 48% and 67% accuracies. Lastly, we used unsupervised similarity learning to infer the relatedness of the images without prior knowledge of their taxonomic or phylogenetic affinities. The results indicated a reasonable similarity between visual appearance and genetic relationships at the higher taxonomic levels. The correlation was 0.6 for the most species-rich subclass, the Imparidentia, and ranged from 0.5 to 0.7 for the orders with the most images. Overall, the correlation between visual similarity and genetic distances at the family level was 0.78. However, fine-grained reconstructions based on the observed correlation, such as sister-taxa relationships, require further work. Overall, our results broaden the applicability of automated taxon identification systems and provide a new avenue for estimating phylogenetic relationships from specimen images.
Inferring Taxonomic Affinities and Genetic Distances Using Morphological Features Extracted from Specimen Images: a Case Study with a Bivalve dataset
This preprocessed fine-grained labeled dataset contains 71,888 images of 4,144 species in 884 genera, 74 families, 26 orders, and six subclasses; the phylogenetic study by Bieler et al. (2014) covers all 74 families.
Description of the data and file structure
Metadata and labels are located inside the meta.tsv file. The code is located in the respective code folder.
Sharing/Access information
All images can be accessed through the h5 database files.
Data was derived from the following sources:
Code/Software
Dataset File Description
Please refer to the readme.txt
files within the zip archives for detailed instructions.
Contents
- images.zip: contains the images.
- h5.zip: contains all experimental splits in h5 format.
- meta.pkl: Pickled pandas object containing the meta data.
- meta.tsv: Tab-separated meta data file containing all meta data.
- edge_lengths.csv: Comma-separated file that contains the distances published by Blier et al.
- code_task_1.zip:
- Multi-level taxonomic identification task.
- Includes code for training and evaluating sequential and parallel multi-head networks, with and without genetic distances target.
- code_task_2.zip:
- Zero-Shot categorization task.
- Evaluates network performance in categorizing an unseen taxon within its higher taxonomic group (e.g., an unseen genus in the correct family).
- Note: This file is large due to the extensive meta data split files.
- genus_splits.zip:
- Additional meta data split files for the Zero-Shot categorization task
- Archive contains the meta data splits needed for the training of task 2.
- Extract the archive to ./code_task_2/genus_splits
- code_task_3.zip:
- Similarity learning experiment code.
- Focuses on learning visual similarity and comparing it to the original for regression analysis.
The image dataset was obtained from three main sources: data aggregation platforms such as GBIF and iDigBio, natural history museums, and websites of shell dealers and private enthusiasts (see Appendix in manuscript).
To maximize the images' information density and reduce noise and potential bias caused by objects other than bivalves, all images were subject to an automated image segmentation process to decompose them into individual items. Only images showing the inner or outer lateral side of the shells were kept. When necessary, images were rotated into the correct scientific position with the hinge line up to the best possible extent by steps of 90°.
One of the authors (SK) evaluated the identification of each image based on his taxonomic expertise and removed all images considered as incorrectly identified. To update the taxonomic assignment of each species and to re-assign synonymized names, all names were checked against the World Register of Marine Species (WoRMS), and each image was labeled according to the currently accepted name and taxonomic hierarchy (species, genus, and family) indicated by WoRMS. Images of species not found in WoRMS were removed.
The assignment to the taxonomic level of the order follows WoRMS, and if no order was available, the superfamily indicated in WoRMS was used instead. Our assignment to a subclass does not follow WoRMS; instead, we applied the more traditional classification into Protobranchia, Pteriomorphia, Palaeoheterodonta, Archiheterodonta, Anomalodesmata, and Imparidentia, used in many recent phylogenetic publications on the Bivalvia.