Skip to main content
Dryad logo

Amino acids (AA) all genes for: Beyond Drosophila: resolving the rapid radiation of schizophoran flies with phylotranscriptomics


Bayless, Keith et al. (2020), Amino acids (AA) all genes for: Beyond Drosophila: resolving the rapid radiation of schizophoran flies with phylotranscriptomics, Dryad, Dataset,



The largest radiation of animal life since the end Cretaceous extinction event 66 million years ago is that of schizophoran flies: a third of fly diversity including Drosophila lab fruit flies, house flies, and many other well and poorly known true flies. Rapid diversification has hindered previous attempts to elucidate the phylogenetic relationships among major schizophoran clades. A robust phylogenetic hypothesis for the major lineages containing these 55,000 described species would be critical to understand the processes that contributed to the diversity of these agriculturally, medically, and forensically important flies. We use protein encoding sequence data from transcriptomes, including 3,145 genes from 70 species, representing all superfamilies, to improve the resolution of this previously intractable phylogenetic challenge.


Our results support a paraphyletic acalyptrate grade including a monophyletic Calyptratae and the monophyly of half of the acalyptrate superfamilies. The primary branching framework of Schizophora is well supported for the first time, revealing the primarily parasitic Pipunculidae and Sciomyzoidea s.l. as successive sister groups to the remaining Schizophora. Ephydroidea, Drosophila’s superfamily, is the sister group of Calyptratae. Sphaeroceroidea has modest support as the sister to all non-sciomyzoid Schizophora. We define two novel lineages corroborated by morphological traits, the Modified Oviscapt Clade containing Tephritoidea, Nerioidea, and other families, and the Cleft Pedicel Clade containg Calyptratae, Ephydroidea, and other families. Support values remain low among a challenging subset of lineages, including Diopsidae. The placement of these families remained uncertain in both concatenated maximum likelihood and multi-species coalescent approaches Rogue taxon removal was effective in increasing support values compared with strategies that maximize gene coverage or minimize missing data.


Dividing most acalyptrate fly groups into four major lineages is supported consistently across analyses. Understanding the fundamental branching patterns of schizophoran flies provides a foundation for future comparative research on the genetics, ecology, and biocontrol.


Sample collection, preservation, and transcriptome sequencing

Novel transcriptome data for this manuscript originated from three sources, 1,000 Insect Transcriptome Evolution Project (1KITE), North Carolina State University (NCSU), and the National University of Singapore (NUS). The laboratory and data processing workflows were similar and compatible for data from all three sources. Generally, to preserve tissue for RNA sequencing, specimens were collected live into RNAlater and stored at -20 °C, or into 95% ethanol and stored at 80 °C. Their cuticle was broken to allow the preservative to penetrate the exoskeleton and enter the muscle tissue. Samples were examined in an ice bath under dissecting microscopes to verify vouchers and perform identifications based on museum comparisons and primary literature. Extractions were performed with the RNeasy kit (Qiagen, Valencia, CA) on thoracic tissue unless the flies were very small. New transcriptome samples underwent library preparation using the NEBNext (New England Biosciences, Ipswich, MA, USA) Ultra RNA Library Prep Kit for Illumina kit, following the manufacturers guidelines. RNA was bound to Agencourt AMPure XP Beads (Beckman Coulter, Inc., Brea, CA, USA) on a magnetic plate and the sample underwent a series of washes. A reverse transcription reaction was performed, followed by a PCR enrichment, yielding a size-selected non-directional cDNA library that was sequenced as paired-end reads on an Illumina system (Illumina, San Diego, CA, USA). Double indexes were used where possible to reduce sample misidentification during demultiplexing. Samples were multiplexed with up to eight taxa per lane on Illumina MiSeq and 22 taxa per lane on Illumina HiSeq. Double indexes were used where possible to reduce sample misspecification during demultiplexing. Read quality was checked with FastQC v. 0.11.5 to assess whether further trimming was necessary. Trimmomatic v. 0.32 was used to remove adapter contamination and low-quality sequences. Trinity v. 2.2 and 2.4 were used to assemble the reads into contigs.

Orthology search and Alignment

We used an ortholog reference set comprising 3,145 gene single copy protein coding genes, termed “Mecopterida”. This set includes annotated genomes and transcriptomes from five reference species: Drosophila melanogaster, Glossina morsitans, Aedes aegypti, Bombyx mori (silkworm moth, an outgroup), and Danaus plexippus (monarch butterfly, an outgroup) from OrthoDB7. Orthograph was used to assign orthology to all target taxa using the relaxed reciprocal blast hit criterion and otherwise default settings.

Stop codons were masked and each gene was aligned individually with the L-INS-i algorithm implemented in MAFFT v. 2.273 [42]. Outlier sequences were identified and realigned or removed. Ambiguously aligned positions were identified with Aliscore v. 2.0 and removed with Alicut v 2.1. Genes with no information content were identified with MARE v. 0.1.2 and removed. Pal2Nal v. 14 correlated nucleotides to the amino acid-based alignment.

Usage Notes

See BeyondDrosophilaDataDryadReadme.txt file for exact explanation of each file.

Alignments analyzed for this project: 

1: Amino acids (AA) all genes

2: AA Mare reduced 1130 genes

3: AA Mare reduced 1130 genes, Alistat reduced to 80% coverage by site

4: Nucleotides all genes, 3rd codon positions removed (All codon positions remain in the file)

5: Nucleotides reduced 1130 genes, 3rd codon positions removed (All codon positions remain in the file)


National Science Foundation, Award: DEB-1257960