Data and code from: Unsupervised machine learning for species discovery in Eurytoma and Phylloxeroxenus (Hymenoptera: Eurytomidae) parasitoids of oak gall wasps
Data files
May 12, 2026 version files 119.08 MB
-
README.md
4.91 KB
-
species_discovery.zip
119.08 MB
Abstract
Species discovery (inferring species limits de novo, without a priori hypotheses) from genetic data has become more common as molecular tools have expanded and has been a helpful initial step in tackling the taxonomic impediment for small insects. Often species discovery involves a single locus (e.g., mtCOI) but the accessibility of techniques for large sub genomic sequencing projects (1000s of loci) makes it possible to approach molecular species discovery with more robust datasets. Here, we test unsupervised machine learning (UML) methods for species discovery on a set of UCE loci for a large collection of parasitic wasps reared from North American oak galls, all initially thought to be in genus Eurytoma Illiger. UML methods produced species hypotheses that largely aligned with those that emerge from a commonly used mtCOI-based species partitioning method, and that also tended to match existing species descriptions. Results revealed a new genus-level association with oak galls (Phylloxeroxenus Ashmead) hidden among the Eurytoma, two distinct lineages of Eurytoma including a new lineage of Eurytoma more closely related to the South American genus Kavayva Zhang, Gates, & Silvestre, evidence for one or more cryptic Eurytoma species, and a mix of generalist and specialist host ranges. We make recommendations for how best to employ UML methods to similar datasets.
species_discovery.zip is a ZIP archive containing the files for UML and STRUCTURE analyses. Its contents are described below.
Overview
The pipeline is broken into several steps, each with a corresponding sub-directory.
Each sub-directory contains the following:
- A specific README that details the purpose of each script, and the expected input/output.
_logsdirectory that stdout, stderr, and logging info will generally be directed to._scriptsdirectory which contains scripts.- Scripts numbered based on the order in which they are intended to be run.
- In general, script variables, including file paths, are hard-coded near the start of the file. These will need to be changed accordingly in order to run properly.
- All .job files were run on UIowa's Argon HPC which runs CentOS Linux and uses the Sun Grid Engine (SGE) queue scheduler system.
- Since intermediates can be recreated with the provided scripts and UCE alignments, I have not included all intermediate files. However, I have included one or two intermediate files per script as examples.
0_UCE_assembly
Contains the aligned assemblies for each UCE loci for the focal Eurytoma and Phylloxeroxenus samples (mafft-nexus-edge-trimmed-gblocks-clean).
This directory also contains the partition, PHYLIP, and .treefile files for the ML phylogeny (ML_phylogeny).
1_matrix_preparation
First, samples are separated by sex.
Males are processed to remove heterozygous loci.
Females and processed males are recombined and sorted into clade partitions.
12 datasets are created for downstream analysis based on 4 clade partitions
(ABC, A, B, C) x 3 levels of completeness (75%, 95%, 100%).
2_SNP_extraction
SNPs are extracted from each dataset matrix and reformatted to meet the input requirements for the various downstream analyses. Namely:
- STRUCTURE: STRUCTURE input format with males encoded as diploids with one allele missing at each locus
- UML (RF & t-SNE): STRUCTURE input format with males encoded as diploids that are homozygous at each locus.
- VAE: One-hot encoded format with males encoded as diploids that are homozygous at each locus. Each allele is given a value of 0.5.
3_STRUCTURE
STRUCTURE analysis is performed on each dataset at K values ranging from 2-10, and with 5 replicates for each K. Results are summarized using the StructureSelector web server.
4_UML
To determine a perplexity value for t-SNE, a grid search is performed on the data and evaluated by visual inspection. RF and t-SNE are then performed via a single R script which performs 10 replicates on each dataset.
5_VAE
VAE is performed on the one-hot encoded files via Python script, and results are clustered using DBSCAN in a second script.
collection_info.csv
Alongside the pipeline sub-directories is a CSV file called collection_info.csv that contains collection information for the focal 102 specimens. Column descriptions are as follows:
col_code- Numeric code for collection event used in lab recordsgall_code- Numeric code for the host gall type used in lab recordsemerge_code- Alphanumeric code for an emergence event used in lab recordssample_ID- Individual specimen identifier used in the paper and throughout bioinformatic work flow. Format is 'Eur_[col_code][gall_code][emerge_code]'true_sex- The verified sex of the specimen.m= male,f= femaleclade- Alphanumeric code corresponding to the clade which the specimen belongs to on the ML phylogeny.UCE_loci- Total number of UCE loci recovered during processinghet_loci- Number of UCE that were categorized as heterozygous following phasing. Heterozygous loci were removed from male samples for subsequent analysis.hom_loci- Number of UCE that were categorized as homozygous following phasing.morphospecies- The identity of the specimen based on morphological assessment of proxy specimens for the same alphanumeric clade.gall- Species of host gall the specimen was reared from.gen- Sexual or asexual generation of the gall inducer.agamic= gall produced by asexual generation,sexgen= gall produced by sexual generation.plant organ- Plant tissue/organ where the gall was located.col_date- Date the gall was collected.emerge_date- Date the specimen emerged from the gall.emerge_year- Date the specimen emerged from the gall, year only.tree section- Section of oak that the host tree belonged to.tree_species- Species name of the host tree.state- Two letter postal abbreviation of US state where gall was collected.location- Name of approximate location where gall was collected.lat- Approximate latitude of collection location in decimal degree format.long- Approximate longitude of collection location in decimal degree format.
