Skip to main content

Describing biodiversity in the genomics era: Description of a new species of Nearctic Cynipidae gall wasp and its genome

Cite this dataset

Ferreira Pinto Brandão-Dias, Pedro et al. (2023). Describing biodiversity in the genomics era: Description of a new species of Nearctic Cynipidae gall wasp and its genome [Dataset]. Dryad.


Gall wasps (Hymenoptera: Cynipidae) specializing on live oaks in the genus Quercus (subsection Virentes) are a relatively diverse and well-studied community with 14 species described to date, albeit with incomplete information on their biology, life history, and genetic structure. Incorporating an integrative taxonomic approach, we combine morphology, phenology, behavior, genetics, and genomics to describe a new species, Neuroterus valhalla sp. nov. The alternating generations of this species induce galls on the catkins and stem nodes of Quercus virginiana and Q. geminata in the southern United States. We describe both generations in the species’ life cycle, and primarily use samples from a population in the center of Houston, Texas, thus serving as an example of the undescribed biodiversity still present in well-traveled urban centers. In parallel, we present a draft assembly of the N. valhalla genome providing a direct link between the type specimen and reference genome. The genome of N. valhalla is the smallest reported to date within the tribe Cynipini, providing an important comparative constrast to the otherwise large genome size of Cynpids. While relatively small, the genome was found to be composed of >64% repetitive elements, including 43% unclassified repeats and 11% retrotransposons. A preliminary ab-initio and homology-based annotation revealed 32,005 genes, and a subsequent orthogroup analysis grouped 18,044 of these to 8,186 orthogroups, with some evidence for high levels of gene duplications within Cynipidae. A mitochondrial barcode phylogeny linked each generation of the new species and a phylogenomic ultraconserved element (UCEs) phylogeny indicates that the new species groups with other Nearctic Neuroterus. However, both phylogenies present the genus Neuroterus in North America as polyphyletic.

README: Supplementary data on the "Describing biodiversity in the genomics era: Description of a new species of Nearctic Cynipidae gall wasp and its genome"

Files in this folder include:

Fasta file of the Neuroterus valhalla assembly version used in the manuscript.

A gff filing containing the gene annotation in the N.valhalla genome. See methods for further details.

A gff file containing the annotation of the repeat sequences in the genome. See methods for further details.

A fasta file containing the repeat sequences families from the N. valhalla genome. See methods for further details.

Tree file for constructing figure 2 (the genetic barcoding COI phylogeny). See methods and supplementary materials for further details on treebuilding methods and sequences used.

A Fasta file containing the alignment of all sequences used in figure 2 (the genetic barcoding COI phylogeny)

Tree file for constructing figure 6 (the UCE barcoding phylogeny). See methods for further details.

UCE phylip file that was trimmed at 95% lognormal cutoff from spruceup. It contains sequences used to build the UCE phylogeny (figure 6)

For further questions or if additional files are required to replicate the results presented, feel free to contact the authors directly.




To sequence N. valhalla’s genome, we extracted whole genomic DNA from a single female from the catkin generation using the DNeasy Blood and Tissue Kit (Qiagen, Germany) as described above. A paired-end sequencing library was then constructed with the Illumina TruSeq kit (Illumina, USA) using the standard adapters, and sequencing was performed at Genewiz (New Jersey, USA) on an Illumina X-Ten sequencing platform. Raw data was submitted to Genbank Short Read Archive (Accession number SRX7007139). Then, fastq files were filtered and trimmed by Trimmomatic v.0.39 (Bolger et al., 2014), first removing sliding windows of 5 nucleotides with quality average below 20, followed by a hard-trailing trim of nucleotides with quality below 15. Trimmed reads were then assembled de novo by SPAdes v3.14.0 (Bankevich et al., 2012) using kmer-sizes of 27,37,47,57,67,77,87,97,107,117 and 127. Assembly quality and completeness were accessed by quast v5.0.2 (Gurevich et al., 2013) and BUSCO v4.0.6 (Simão et al., 2015).

For annotation, repetitive sequences were inferred using RECON (Bao et al., 2002), RepeatScout (Price et al., 2005) and LTR_retriver (Ou & Jiang, 2018), as applied by RepeatModeler (Smit & Hubley, 2015). Then, these sequences were classified with RepeatClassifier according to RepBase v25.04 (Bao et al., 2015), and representation of these sequences in the genome were accessed by RepeatMasker v4.0 (Smit et al., 2015), which also masked the assembly for subsequent gene annotation. Augustus (Stanke et al., 2006) was applied for Ab-initio gene prediction, using Nasonia vitripennis (Walker) (Werren et al., 2010) training parameters, the closest organism with a training set available. Then, for protein-based annotation, we used a database containing all known genes from the most closely related organism with an annotated genome, Belonocnema kinseyi Weld (GenBank 17056478), and used Exonerate v.2.2 (Slater & Birney, 2005). Both annotations were then combined using EvidenceModeler v1.1.1 (Haas et al., 2008), with equal weight to either approach. The final .gff file was then filtered and analyzed with gFACs v1.0 (Caballero & Wegrzyn, 2019) for gene statistics. 
Gene ortholog analysis was performed using Orthofinder (Emms & Kelly, 2019). The analysis encompassed nine annotated genomes, including Belonocnema kinseyi; every proteome available in UniProt database belonging to the Infraorder Parasitoida (three total, including above mentioned N. vitripennis); Apis mellifera L. and Drosophila melanogaster Meigen for their status as most well characterized insect genomes; and Arabidopsis thaliana (L.) Heynh. and Homo sapiens L. as outgroups.


To build the UCE phylogeny, we downloaded the UCE assemblies from all Cynipini and selected outgroups from other tribes from Blaimer et al. (2020) and assembled the contigs using SPAdes. Additionally, UCE loci were extracted from all available Cynipini genomes on NCBI not already present in Blaimer et al. (2020) dataset, including N. valhalla, following tutorial III of the PHYLUCE pipeline, using the Hym-V2P probe set developed by Branstetter et al. (2017). All assemblies were aligned using MAFFT and trimmed using Gblocks v0.91b-2 (Castresana, 2000) using the following settings: b1=0.5, b2=0.5, b3=12, b4=7. Then, we used Spruceup 0.95 lognormal distribution or manual cutoff of select samples to remove any potentially misaligned regions as they can produce exaggerated branch lengths (Borowiec, 2019). We selected the 50% complete matrix as the final dataset and inferred the maximum likelihood phylogeny using IQ-TREE using best models for each locus selected by ModelFinder. To assess nodal support, we performed 1000 ultrafast bootstrap replicates (UFBoot2, Hoang et al., 2017), along with “-bnni” to reduce risk of overestimating branch supports; and a Shimodaira-Hasegawa approximate likelihood-rate test (SH-aLRT, Guindon et al., 2010) with 1000 replicates. Only nodes with support values of UFBoot2 ≥ 95 and SH-aLRT ≥80 were considered robust.