Integrating UCE phylogenomics with traditional taxonomy reveals a trove of New World Syscia species (Formicidae, Dorylinae)

Branstetter, Michael G.1 ; Longino, John T.2

Published Dec 29, 2021 on Dryad. https://doi.org/10.5061/dryad.08kprr50s

Data files

Dec 29, 2021 version files 713.76 MB

COI-Analysis-Files.zip

31.42 KB
Contigs-Trinity.zip

659.41 MB
README.txt

1.31 KB
UCE-Alignments-Unfiltered-FASTA.zip

19.46 MB
UCE-Analysis-Files.zip

34.87 MB

Abstract

The ant genus Syscia is part of the cryptic ant fauna inhabiting leaf litter and rotten wood in the Asian and American tropics. It is a distinct clade within the Dorylinae, the subfamily from which army ants arose. Prior to this work the genus comprised seven species, each known from a single or very few collections. Extensive collecting in Middle America revealed an unexpected and challenging diversity of morphological forms. Locally distinct forms could be identified at many sites but assignment of specimens to species spanning multiple sites was problematic. To improve species delimitation, Ultra-Conserved Element (UCE) phylogenomic data were sequenced for all forms, both within and among sites, and a phylogeny was inferred. Informed by phylogeny, species delimitation was based on monophyly, absence of within-clade sympatry, and a subjective degree of morphological uniformity. UCE phylogenomic results for 130 specimens were complemented by analysis of mitochondrial COI (DNA barcode) data for an expanded taxon set. The resulting taxonomy augments the number of known species in the New World from 3 to 57. We describe and name 31 new species, and 23 species are assigned morphospecies codes pending improved specimen coverage. Queens may be fully alate or brachypterous, and there is a wide variety of intercaste female forms. Identification based on morphology alone is very difficult due to continuous character variation and high similarity of phylogenetically distant species. An identification aid is provided in the form of a set of distribution maps and standard views, with species ordered by size.

We have deposited data and results files that support the molecular phylogenetic analyses presented in the study. Raw Illumina reads and contigs representing UCE loci have been deposited at the NCBI Sequence Read Archive and GenBank, respectively (BioProject# PRJNA615631). All newly generated COI sequences have been deposited at GenBank (MT267540-MT267668). Here we have deposited the concatenated UCE matrix, the COI matrix, all Trinity contigs, all tree files, unfiltered alignment files, and additional data analysis files (partitioning schemes, log files). The methods used to generate these data are described below and in the accompanying paper.

DNA sequence generation: We selected 130 specimens for inclusion in molecular phylogenomic analysis (Table S1): 128 Syscia and two outgroup specimens from the genus Ooceraea. All sequence data were newly generated for this study, except for 5 samples, for which data were extracted from Oxley et al. (2014; Genome), Branstetter et al. (2017), and Borowiec (2019) (see Table S1). Vouchers were designated for each extraction and may be the same specimen (non-destructive DNA extraction) or with varying degrees of subjectivity from the same nest, collection series, or rarely, population. Full voucher specimen details are in Supplementary Material, Table S2.

To examine species boundaries and phylogenetic relationships among species and populations, we employed the UCE approach to phylogenomics (Faircloth et al. 2012, Faircloth et al. 2015, Branstetter et al. 2017), a method that combines targeted enrichment of ultraconserved elements (UCEs) with multiplexed, next-generation sequencing. All UCE molecular work was performed following the UCE methodology described in Branstetter et al. (2017). Briefly, the process involves DNA extraction, sample QC, DNA fragmentation (400-600 bp), library preparation, library pooling (equimolar pools of 10 or 11 samples), UCE enrichment, qPCR quantification, final pooling (up to 102 samples per sequencing pool), and sequencing. All sequencing was performed on an Illumina HiSeq 2500 instrument (2x125 bp v4 chemistry; Illumina Inc., San Diego, CA) by the University of Utah genomics core facility. To enrich UCE loci, we used an ant-customized bait set (“ant-specific hym-v2”) that includes 9,898 baits (120 mer) targeting 2,524 UCE loci shared across Hymenoptera and a set of legacy markers (data not used) (Branstetter et al. 2017). The ability of this bait set to successfully enrich UCE loci and resolve relationships in ants has been demonstrated in several studies (Branstetter et al. 2017, Pierce et al. 2017, Ward and Branstetter 2017, Blaimer et al. 2018, Branstetter and Longino 2019, Longino and Branstetter 2020).

UCE matrix assembly: After sequencing, the University of Utah bioinformatics core demultiplexed the data using bcl2fastq v1.8 (Illumina, 2013) and made the data available for download. Once received, the sequence data were cleaned, assembled and aligned using PHYLUCE v1.6 (Faircloth 2016), which includes a set of wrapper scripts that facilitates batch processing of large numbers of samples. Within the PHYLUCE environment, we used the programs ILLUMIPROCESSOR v2.0 (Faircloth 2013), which incorporates TRIMMOMATIC (Bolger et al. 2014), for quality trimming raw reads, TRINITY v2013-02-25 (Grabherr et al. 2011) for de novo assembly of reads into contigs, and LASTZ v1.0 (Harris 2007) for identifying UCE contigs from all contigs. All optional PHYLUCE settings were left at default values for these steps. For the bait sequences file needed to identify and extract UCE contigs, we used the ant-specific hym-v2 bait file. To calculate assembly statistics, including sequencing coverage, we used scripts from the PHYLUCE package (phyluce_assembly_get_trinity_coverage and phyluce_assembly_get_trinity_coverage_for_uce_loci) that call the programs BWA v 0.7.7 (Li and Durban 2010) and GATK v3.8 (McKenna et al. 2010).

After extracting UCE contigs, we aligned each UCE locus using a stand-alone version of the program MAFFT v7.130b (Katoh and Standley 2013) and the L-INS-i algorithm. We then used a PHYLUCE wrapper to trim flanking regions and poorly aligned internal regions using the program GBLOCKS (Talavera and Castresana 2007). The program was run with reduced stringency parameters (b1:0.5, b2:0.5, b3:12, b4:7). We then used another PHYLUCE script to filter the initial set of alignments so that each alignment was required to include data for ≥ 90% of taxa. This resulted in a final set of 1,388 alignments and 1,035,633 bp of sequence data for analysis. To calculate summary statistics for the final data matrix, we used a script from the PHYLUCE package (phyluce_align_get_align_summary_data). Information related to UCE sequencing and assembly results can be found in Supplemental Material, Table S3. All steps, including the phylogenetic analyses described below, were performed on a multicore Linux workstation (40 CPUs and 512 Gb of memory).

Phylogenomic analysis: To partition the UCE data for phylogenetic analysis, we used the Sliding-Window Site Characteristics based on entropy method (SWSC-EN; Tagliacollo and Lanfear 2018), which breaks UCE loci into three regions, corresponding to the right flank, core, and left flank. The theoretical underpinning of the approach comes from the observation that UCE core regions are conserved, while the flanking regions become increasingly more variable (Faircloth et al. 2012). After running the SWSC-EN algorithm, the resulting data subsets were analyzed using PARTITIONFINDER2 (Lanfear et al. 2012, Lanfear et al. 2017). For this analysis we used the rclusterf algorithm, AICc model selection criterion, and the GTR+G model of sequence evolution. The resulting best-fit partitioning scheme included 1,126 data subsets and had a significantly better log likelihood than alternative partitioning schemes (SWSC-EN: -5,608,249.502; By Locus: -5,639,169.680; Unpartitioned: -5,731,679.666).

Using the SWSC-EN partitioning scheme, we inferred phylogenetic relationships of Syscia with the likelihood-based program IQ-TREE v1.5.5 (Nguyen et al. 2015). For the analysis we selected the “-spp” option for partitioning (linked branch lengths but allowing each partition to have its own evolutionary rate) and the GTR+F+G4 model of sequence evolution. To assess branch support, we performed 1,000 replicates of the ultrafast bootstrap approximation (UFB) (Minh et al. 2013, Hoang et al. 2018) and 1,000 replicates of the branch-based, SH-like approximate likelihood ratio test (Guindon et al. 2010). For these support measures, values ≥ 95% and ≥ 80%, respectively, signal that a clade is supported.

COI barcode analysis: Due to the high abundance of mitochondrial DNA in samples and the less-than-perfect efficiency of target enrichment methods, Cytochrome Oxidase I (COI) sequence data, and sometimes entire mitochondrial genomes (see Ströher et al. 2016) are often generated as a byproduct of the UCE sequencing process. To provide a separate assessment of species identities, possibly with more samples included, we extracted COI sequences from our UCE enriched samples and combined them with Syscia COI sequences downloaded from the BOLD database (Ratnasingham and Hebert 2007) (Accessed 16 May 2019). To extract COI from UCE data, we downloaded a complete 658 bp barcode sequence of a Costa Rican Syscia specimen from BOLD (Process ID ACGAE095-10, identified by us as S. benevidesae, one of the new species in this work) and used this as the bait input sequence for a PHYLUCE program (phyluce_assembly_match_contigs_to_barcodes) that extracts COI sequence from bulk sets of contigs.

After extracting COI sequence from UCE sample data, we downloaded accessible barcode sequences from BOLD following a series of steps. First, using the BOLD workbench interface, we searched for all records matching the taxonomy search term “Syscia” or “Cerapachys”. We then copied all of the resulting Barcode Index Numbers (BINs) and performed a second search using these numbers in the identifiers field. This approach recovers taxonomically mislabeled samples because BINs group sequences into units by sequence similarity, not name (Ratnasingham & Hebert 2013). All returned sequences were downloaded examined, and subsequently filtered to remove Old World specimens and entries with no sequence data. We also removed a misidentified sample from Madagascar and a sequence mined from GenBank that had no accompanying specimen data. Because some of the remaining sequences included private, unpublished data, we contacted data owners for permission to use the private sequences in our analyses.

We combined the final set of BOLD sequences with the successfully extracted COI sequences from UCE samples and aligned the data using MAFFT. We visually inspected the resulting alignment for signs of pseudogenes/numts (e.g. presence of stop codons, indels, or highly divergent sequence) or other anomalies using MESQUITE v3.51 (Maddison and Maddison 2018). The final matrix was partitioned by codon position and analyzed with IQ-TREE using GTR+F+G4, 1,000 ultrafast bootstrap replicates, and 1,000 SH-like replicates. Following a preliminary analysis of all samples, we discovered that a set of 79 putative “Cerapachys” samples actually belonged to the phylogenetically distinct genus Neocerapachys. Consequently, we removed these samples from our data set and updated determinations in BOLD. Sample information for the final set of 86 BOLD specimens included in our analysis is available in Supplemental Material, Table S4.

Integrating UCE phylogenomics with traditional taxonomy reveals a trove of New World Syscia species (Formicidae, Dorylinae)

Data files

Abstract

Methods