Data from: Integrating deep learning derived morphological traits and molecular data for total-evidence phylogenetics: lessons from digitized collections
Data files
Dec 18, 2024 version files 1.08 MB
-
fasta_files.zip
107.82 KB
-
Figure_7_Trees.zip
277.88 KB
-
GenBank_accession_numbers_of_all_sequences.xlsx
15.16 KB
-
molecular_only_best.nex
4.05 KB
-
README.md
3.01 KB
-
reference_phylogeny.nex
1.81 KB
-
rove_stratified_split.csv
530.86 KB
-
Table_1_Trees.zip
45.05 KB
-
Table_3_Trees.zip
86.94 KB
-
total_evidence_best_tripd4.nex
5.61 KB
Abstract
Deep learning has previously shown success in automatically generating morphological traits which carry a phylogenetic signal. In this paper, we explore combining molecular data with deep learning derived morphological traits from images of pinned insects to generate total-evidence phylogenies and we reveal challenges. Deep learning derived morphological traits, while informative, underperform when used in isolation compared to molecular analyses. However, they can improve molecular results in total evidence settings. We use a dataset of rove beetle images to compare the effect of different dataset splits and deep metric loss functions on morphological and total evidence results. We find a slight preference for the cladistic dataset split and contrastive loss function. Additionally, we explore the effect of varying the number of genes used in inference and find that different gene combinations provide the best results when used on their own vs in total evidence analysis. Despite the promising nature of integrating deep learning techniques with molecular data, challenges remain regarding the strength of the phylogenetic signal and the resource demands of data acquisition. We suggest that future work focus on improved trait extraction and the development of disentangled networks to better interpret the derived traits, thus expanding the applicability of these methods in phylogenetic studies.
README: Integrating Deep Learning Derived Morphological Traits and Molecular Data for Total-Evidence Phylogenetics: Lessons from Digitized Collections
https://doi.org/10.5061/dryad.9cnp5hqqq
This dataset includes all supplemental material for the paper 'Integrating Deep Learning Derived Morphological Traits and Molecular Data for Total-Evidence Phylogenetics: Lessons from Digitized Collections'
Description of the data and file structure
The following gives a description of the files included in this dataset:
- GenBank accession numbers of all sequences.xlsx: This excel file contains all the genbank accession numbers for the molecular data used in this paper. Column A lists the species in question, the other columns represent the 7 different genes included. Note that not all genes were available for all species. This is discussed in the paper.
- rove_stratified_split.csv: This csv file contains the dataset split for the 'Stratified' dataset compared in the results, so the exact same results could be produced.
- reference_phylogeny.nex: This nexus file contains the reference tree for the Rove-Tree-11 dataset at a genus level.
- total_evidence_best_tripd4.nex: This nexus file contains the tree for the best total evidence model, trained using the triplet loss function with a seed of 4.
- molecular_only_best.nex: This nexus file contains the tree for the best molecular model, trained using only the molecular data.
- Table_S1_PCA.pdf: Contains a table of alternative CMean values obtained after PCA is applied to the traits.
- Table_1_Trees.zip: This zip file contains all trees associated with results obtained related to Table 1 in the paper. File naming convention is {DATASET}*{MODEL}*{MODEL PARAM}*{LOSS FUNCTION}*{DATE}.tre. The dataset 'ROVEGENUS' specifies the clade dataset, and 'ROVEGENUSSTRATIFIED' specifies the stratified dataset.
- Table_3_Trees.zip: This zip file contains all trees associated with results obtained related to Table 3 in the paper. File naming convention is {DATASET}*{MODEL}*{MODEL PARAM}*{LOSS FUNCTION}*{DATE}.tre. The dataset 'ROVEGENUS' specifies the clade dataset, and 'ROVEGENUSSTRATIFIED' specifies the stratified dataset.
- Figure_7_Trees.zip: This zip file contains two folders. One (molecular only ablations) for the inference using only the molecular data, and one (total evidence ablations) for the total evidence results. These are all the trees used in the gene ablation study in figure 7. Each file is named based on the genes included in the inference model. Ie '28S_ArgK.tre' used genes 28S and ArgK.
- fasta_files.zip: This zip file contains the aligned sequences for each genus. The file name indicates which gene is being used, ie 28S.fasta is the 28S gene. MOLECULAR_ONLY.fasta contains all genes concatenated.
- Example_TNT_Script.zip: This zip file contains an example TNT script and example data file, demonstrating how exactly the maximum parsimony inference was completed.
Methods
Images of specimens associated with this dataset can be found in the Rove-Tree-11 dataset (https://doi.org/10.17894/ucph.39619bba-4569-4415-9f25-d6a0ff64f0e3).
Molecular data was gathered from Genbank and aligned using MAFFT 7. Original Genbank accession numbers are provided. Alignments were concatenated with FASconCAT-G. Partition scheme and model selection were obtained using PartitionFinder 2.1.1.
Trees were obtained via Maximum Parsimony using TNT. Example TNT scripts are provided.