Data from: Phylogenomics illuminates the complex evolutionary history of Bibionomorpha (Diptera)

Data files

Mar 09, 2026 version files 595.69 MB

Abstract

Bibionomorpha is a large and diverse dipteran infraorder, of which its composition and family-level relationships have long been debated. In this study, we constructed a phylogenomic tree of Bibionomorpha using an extensive dataset of transcriptomic, genomic, and anchored hybrid enrichment data. To further investigate the evolutionary timeline of the group, we also generated a fossil-calibrated timetree using data-driven calibration priors. Different data types and models were utilized to produce a robust backbone phylogeny. Bibionomorpha, comprising Bibionoidea, Sciaroidea, Anisopodoidea, and Scatopsoidea, is recovered as the sister group to Brachycera. Axymyiidae and Perissommatidae, which have been considered part of Bibionomorpha by some authorities, are instead recovered as sister to the Bibionomorpha + Brachycera. We discuss recalcitrant nodes within the infraorder, particularly regarding the placements of Bolitophilidae and Cecidomyiidae. Phylogenetic network analysis suggests a possible reticulation event for Bolitophilidae, while mitochondrial data support a highly sex-specific hybridization event between ancestral Cecidomyiidae and Sciaridae + Diadocidiidae group. Timetree analyses suggest a Lower Triassic or deeper origin of Bibionomorpha, with implications of ancient explosive radiations.

Dataset DOI: 10.5061/dryad.vq83bk472

Description of the data and file structure

Sequence assemblies, alignments, tree files, and log files accompanying the paper: "Phylogenomics illuminate complex evolutionary history of Bibionomorpha".

Files and variables

File: transcriptome_assemblies.zip

Description: Collection of transcriptome assemblies generated in the current study.

We used transcriptomes sequenced as part of the 1K Insect Transcriptome Evolution (1KITE) Project (Misof et al., 2014) together with the publicly available transcriptome and genome sequences from GenBank. 1KITE transcriptome samples were collected in RNAlater and were sequenced at BGI (Beijing Genomic Institute, China) with the Illumina HiSeq 2000 (Illumina, San Diego, CA, USA) platform, following the protocol of Misof et al. (2014) and Peters et al. (2017).

Raw transcriptome reads were then assembled using multi-assembler pipeline TransPi v 1.3.0 (Rivera-Vicéns et al., 2021) to obtain de novo consensus transcriptome assemblies from raw reads. Briefly, raw reads were first checked with FastQC v 0.11.9 (Andrews, 2010), followed by filtration and adapter removal via fastp v 0.24.0 (Chen, 2023). Processed reads were then assembled with rnaSPAdes v 3.15.3 (Bushmanova et al., 2019), Trans-ABySS v 2.0.1 (Robertson et al., 2010), SOAPdenovo-Trans v 1.03 (Xie et al., 2014), Velvet v 1.2.10/Oases v 0.2.09 (Zerbino & Birney, 2008; Schulz et al., 2012) using k-mer set C of Rivera-Vicéns et al. (2021), and also using Trinity v 2.15.2 (Grabherr et al., 2011). Resulting assemblies were reduced to consensus assemblies with EvidentialGene v 2019.05.14 (Gilbert, 2013; 2019) and then decontaminated using MCSC (Lafond-Lapalme et al., 2017) to remove transcripts that were putatively non-arthropod in origin.

File naming convention

(family name) _(genus name)_ (species epithet)_T_decont.fasta

e.g., Anisopodidae_Sylvicola_dubius_T_decont.fasta.

T_decont stands for "transcriptome", "decontaminated".

File: AHE_assemblies.zip

Description: Collection of AHE assemblies generated in the current study.

We also used anchored hybrid enrichment (AHE; Lemmon et al., 2012) data in addition to transcriptome data for phylogenetic analyses. DNA was extracted from ethanol-preserved samples using the OmniPrep™ for Tissue Kit (Cat. #786–395; G-Biosciences®, USA), following the manufacturer's instructions. We followed previously published methods of Young et al. (2016) for library construction, using the same probes and procedures. Final reads were then assembled using SOAPdenovo 2 v r241 (Luo et al., 2012).

Fine naming convention

(family name) _(genus name)_ (species epithet)_AHE.fasta

e.g., Canthyloscelidae_Synneuron_decipiens_AHE.fasta.

File: Analysis_n_scripts.zip

Description: Analysis files including alignments, tree files, logs, etc.

og_filtering

|__ data

|__ Bibionomorpha_OrthologousMatrix.txt

|__functions

|__ og_filter.R

|__ threshold.R

|__ result

|__ og_Bibionomorpha.txt

|__ og_Bibionomorpha_noOut.txt

|__ og_filtering.R

|__ pre_filtering.R

This folder includes scripts and results to perform orthogroup (OG) filtering, as described in the main article. "data" folder contains output of OMA standalone v2.6.0 (Altenhoff et al, 2019) ran on high-quality reference genomes of Bibionomorpha, which was used as an input for the OG filtering pipeline. "function" folder includes R functions to perform OG filtering, and the "result" folder includes output of OG filtering. "og_Bibionomorpha_noOut.txt" (filtering ran excluding the outgroup, Drosophila melanogaster) was used for actual further downstream analyses. "og_filtering.R" script calls functions in the "function" folder and run the actual OG filtering. "pre_filtering.R" performs pre-filtering of OGs prior to OG filtering, by setting lenght threshold and taxonomic coverage for each OG. The resulting list of OGs were further filtered according to the procedure detailed in the main article, and those OG alignments were then used for further downstream analyses.
phylogenetic_analysis_genomic_concat

|__ AA_Partitioned

|__ AA_Partitioned_noAHE

|__ AA_Subsample_noRF

|__ AA_Mixture_Subsample_noRF

|__ AA_Partitioned_Subsample_noRF

|__ AA_Subsample_RF

|__ AA_Mixture_Subsample_RF

|__ AA_Partitioned_Subsample_RF

|__ NT2_Partitioned

|__ NT2_Partitioned_noAHE

|__ NT2_Subsample_noRF

|__ NT2_Mixture_Subsample_noRF

|__ NT2_Partitioned_Subsample_noRF

|__ NT2_Subsample_RF

|__ NT2_Mixture_Subsample_RF

|__ NT2_Partitioned_Subsample_RF

This folder includes results of phylogenetic analyses done using IQ-TREE 3 v 3.0.1 (Wong et al., 2025) for concatenated genomic data. "AA" stands for amino acid alignment, and "NT2" stands for nucleotide alignment using only second codon position. "Partitioned" stands for partitioned analysis, and "Mixture" stands for either profile mixture (AA) or nucleotide mixture (NT2) analysis. "noAHE" stands for analyses done excluding the AHE data. "Subsample" stands for analyes done using subsampled data using genesortR (Mongiardino Koch, 2021; Mongiardino Koch & Thompson, 2021). "RF" and "noRF" stands for analyses done with or without using the topological filtering option implemented in genesortR.

Each folder includes input alignment file (*.fasta), partition information file (if applicable, *.nex), .iqtree file listing model selection results (only for Partitioned and Partitioned_noAHE folders), resulting phylogeny from 3 replicate runs (.runtrees), and the final tree with the best log-likelihood (.treefile). AA Mixture folders additionally include file listing site-specific frequency (.sitefreq) and profile mixture specification (udm_hogenom_0064_lclr_iqtree.nex; from Schrempf et al., 2020).
phylogenetic_analysis_genomic_MSC

|__ AA

|__ gene_tree_variance

|__ NT2

|__ NT12

This folder includes results of phylogenetic analyses done using wASTRAL v1.22.3.7 (Zhang & Mirarab, 2022) implemented in ASTER (Zhang et al., 2025). "AA" stands for amino acid alignment, and "NT2" stands for nucleotide alignment using only second codon position, and "NT12" stands for nucleotide alignment using first and second codon positions. "AA", "NT2", and "NT12" each contains input ".trees" file including list of gene trees inferred using IQ-TREE 3 v 3.0.1, and "tree" file for the multi-species coalescent tree inferred using wASTRAL.

"gene_tree_variance" includes imputed gene trees for AA, NT2, and NT12 (.trees files) obtained using tripVote v1.2 (Mai & Mirarab, 2022), and results of gene tree topological-dispersion calculation ("dist_TV_result.txt") done using R package TreeDist v 2.9.2 (Smith, 2022), as detailed in the main article.
phylogenetic_analysis_mitochondrial

|__ AA_Mixture

|__ AA_Partitioned

|__ NT_codon12_Mixture

|__ NT_codon12_Partitioned

|__ NT_codon123_Partitioned

|__ Tree_TopoTest

  |__ AA_Mix_treetest

  |__ AA_Partitioned_treetest

  |__ NT12_Mixture_treetest

  |__ NT12_Partitioned_treetest

  |__ NT123_Partitioned_treetest

This folder includes results of phylogenetic analyses done using IQ-TREE 3 v 3.0.1 (Wong et al., 2025) for mitochondrial data. Naming convention for files in each foler is identical to "2. phylogenetic_analysis_genomic_concat", except for the "Tree_TopoTest" folder. "Tree_TopoTest" folder includes result of tree topology tests done for each alignment, which is listed in the .iqtree file.
phylogenetic_network

This folder includes results of phylogenetic network analyses done for AA gene trees using MSCquartets v 3.2 (Rhodes et al., 2021). "AA_MSCquartet_alpha.R" is a script used for the analysis. "combined_AA_bibio.trees" includes list of AA gene trees which was used as an input, and Network_.tree/nexml includes phylogenetic networks inferred using specific alpha value threshold, and ToB_.tree includes tree of blobs inferred using specific alpha value threshold.
BBB

This folder includes results of fossil-based age distribution estimation using rootBBB v 0.2 (Silvestro et al., 2021; Carlisle et al., 2023). "extant_species.txt" includes list of number of extant species for each clade of interest, and "fossil_counts.txt" includes number of fossil species per time bin for each clade of interest. These two files were used as an input for rootBBB, and the resulting MCMC log files are named as "(clade name)_ mcmc _(seed number)_f0.95_qvar.log".
MCMCtree

|__ AA_Sub_RF_Mixture

|__ BSS_1

|__ BSS_2

|__ NT2_Sub_RF_Mixture

|__ BSS_1

|__ BSS_2

This folder includes results of phylogenetic dating done using IQ2MC (Demotte et al., 2025) and MCMCtree (Yang, 2007) pipeline implemented in IQ-TREE 3 v 3.0.1 and PAML v 4.10.8.iq2mc. We generated two timetrees for "AA_Sub_RF_Mixture" and "NT2_Sub_RF_Mixture" trees. Each folder includes input and output of the IQ2MC pipeline: .fa is an input alignment, .treefile is an input tree topology, and profile mixture specification (only for AA) file "udm_hogenom_0064_lclr_iqtree.nex". The outputs of IQ2MC pipeline are: .iqtree file reporting summary of the analysis, dummy.phy file which is a dummy alignment file that is fed to MCMCtree, .ctl file for run specification for MCMCtree, calculated hessian matrix (.hessian), and input tree file for MCMCtree (.nwk file).

BSS_1 and BSS_2 includes MCMCtree results done using two different fossil calibration sets. Inputs from IQ2MC pipeline is contained in each folder, but with different calibration prior specification for .nwk file. MCMC logs are printed in .log file, with run report listed in .out file. Resulting timetree is summarized in a file "FigTree.tre".

References

Altenhoff, A. M., Levy, J., Zarowiecki, M., Tomiczek, B., Vesztrocy, A. W., Dalquen, D. A., Müller, S., Telford, M. J., Glover, N. M., Dylus, D., & Dessimoz, C. (2019). OMA standalone: orthology inference among public and custom genomes and transcriptomes. Genome Research, 29(7), 1152-1163. https://doi.org/10.1101/gr.243212.118

Andrews, S. (2010). FastQC: A quality control tool for high throughput sequence data. (Accessed Nov. 17, 2024). https://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Bushmanova, E., Antipov, D., Lapidus, A., & Prjibelski, A. D. (2019). rnaSPAdes: a de novo transcriptome assembler and its application to RNA-Seq data. GigaScience, 8(9), giz100. https://doi.org/10.1093/gigascience/giz100

Carlisle, E., Janis, C. M., Pisani, D., Donoghue, P. C. J., & Silvestro, D. (2023). A timescale for placental mammal diversification based on Bayesian modeling of the fossil record. Current Biology, 33(15), 3073-3082. https://doi.org/10.1016/j.cub.2023.06.016

Chen, S. (2023). Ultrafast one‐pass FASTQ data preprocessing, quality control, and deduplication using fastp. iMeta, 2(2), e107. https://doi.org/10.1002/imt2.107

Demotte, P., Panchaksaram, M., Kumarasinghe, H., Ly-Trong, N., dos Reis, M., & Minh, B. Q. (2025). IQ2MC: a new framework to infer phylogenetic time trees using IQ-TREE 3 and MCMCtree with mixture models. EcoEvoRxiv https://doi.org/10.32942/X2CD2X

Gilbert, D. (2013). Gene-omes built from mRNA seq not genome DNA. 7th annual arthropod genomics symposium. Notre Dame. F1000Research 2013, 5, 1695. https://doi.org/10.7490/f1000research.1112594.1

Gilbert, D. G. (2019). Longest protein, longest transcript or most expression, for accurate gene reconstruction of transcriptomes?. bioRxiv. https://doi.org/10.1101/829184

Grabherr, M. G., Haas, B. J., Yassour, M., Levin, J. Z., Thompson, D. A., Amit, I., Adiconis, X., Fan, L., Raychowdhury, R., Zeng, Q., Chen, Z., Mauceli, E., Hacohen, N., Gnirke, A., Rhind, N., di Palma, F., Birren, B. W., Nusbaum, C., Linblad-Toh, K., Friedman, N., & Regev, A. (2011). Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature Biotechnology, 29(7), 644-652. https://doi.org/10.1038/nbt.1883

Lafond-Lapalme, J., Duceppe, M. O., Wang, S., Moffett, P., & Mimee, B. (2017). A new method for decontamination of de novo transcriptomes using a hierarchical clustering algorithm. Bioinformatics, 33(9), 1293-1300. https://doi.org/10.1093/bioinformatics/btw793

Lemmon, A. R., Emme, S. A., & Lemmon, E. M. (2012). Anchored hybrid enrichment for massively high-throughput phylogenomics. Systematic Biology, 61(5), 727-744. https://doi.org/10.1093/sysbio/sys049

Luo, R., Liu, B., Xie, Y., Li, Z., Huang, W., Yuan, J., He, G., Chen, Y., Pan, Q., Liu, Y., Tang, J., Wu, G., Zhang, H., Shi, Y., Liu, Y., Yu, C., Wang, B., Lu, Y., Han, C., Cheung, D. W., Yiu, S. M., Peng, S., Zhu, X., Liu, G., Liao, X., Li, Y., Yang, H., Wang, J., Lam, T. W., & Wang, J. (2012). SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience, 4(1), 2047-217X-1-18. https://doi.org/10.1186/s13742-015-0069-2

Mai, U., & Mirarab, S. (2022). Completing gene trees without species trees in sub-quadratic time. Bioinformatics, 38(6), 1532-1541. https://doi.org/10.1093/bioinformatics/btab875

Misof, B., Liu, S., Meusemann, K., Peters, R. S., Donath, A., Mayer, C., Frandsen, P. B., Ware, J., Flouri, T., Beutel, R. G., Niehuis, O., Peterson, M., Izquierdo-Carrasco, F., Wappler, T., Rust, J., Aberer, A. J., Aspöck, U., Aspöck, H., Bartel, D., Blanke, A., Berger, S., Böhm, A., Buckley, T. R., Calcott, B., Chen, J., Friedrich, F., Fukui, M., Fujita, M., Greve, C., Grobe, P., Gu, S., Huang, Y., Jermiin, L. S., Kawahara, A. Y., Krogmann, L., Kubiak, M., Lanfear, R., Letsch, H., Li, Y., Li, Z., Li, J., Lu, H., Machida, R., Mashimo, Y., Kapli, P., McKenna, D. D., Meng, G., Nakagaki, Y., Navarrete-Heredia, J. L., Ott, M., Ou, Y., Pass, G., Podsiadlowski, L., Pohl, H., von Reumont, B. M., Schütte, K., Sekiya, K., Shimizu, S., Slipinski, A., Stamatakis, A., Song, W., Su, X., Szucsich, N. U., Tan, M., Tan, X., Tang, M., Tang, J., Timelthaler, G., Tomizuka, S., Trautwein, M., Tong, X., Uchifune, T., Walzl, M. G., Wiegmann, B. M., Wilbrandt, J., Wipfler, B., Wong, T. K. F., Wu, Q., Wu, G., Xie, Y., Yang, S., Yang, Q., Yeates, D. K., Yoshizawa, K., Zhang, Q., Zhang, R., Zhang, W., Zhang, Y., Zhao, J., Zhou, C., Zhou, L., Ziesmann, T., Zou, S., Li, Y., Xu, X., Zhang, Y., Yang, H., Wang, J., Wang, J., Kjer, K., & Zhou, X. (2014). Phylogenomics resolves the timing and pattern of insect evolution. Science, 346(6210), 763-767. https://doi.org/10.1126/science.1257570

Mongiardino Koch, N. (2021). Phylogenomic subsampling and the search for phylogenetically reliable loci. Molecular Biology and Evolution, 38(9), 4025-4038. https://doi.org/10.1093/molbev/msab151

Mongiardino Koch, N., & Thompson, J. R. (2021). A total-evidence dated phylogeny of Echinoidea combining phylogenomic and paleontological data. Systematic Biology, 70(3), 421-439. https://doi.org/10.1093/sysbio/syaa069

Peters, R. S., Krogmann, L., Mayer, C., Donath, A., Gunkel, S., Meusemann, K., Kozlov, A., Podsiadlowski, L., Petersen, M., Lanfear, R., Diez, P. A., Heraty, J., Kjer, K. M., Klopfstein, S., Meier, R., Polidori, C., Schmitt, T., Liu, S., Zhou, X., Wappler, T., Rust, J., Misof, B., & Niehuis, O. (2017). Evolutionary history of the Hymenoptera. Current Biology, 27(7), 1013-1018. https://doi.org/10.1016/j.cub.2017.01.027

Rhodes, J. A., Baños, H., Mitchell, J. D., & Allman, E. S. (2021). MSCquartets 1.0: quartet methods for species trees and networks under the multispecies coalescent model in R. Bioinformatics, 37(12), 1766-1768. https://doi.org/10.1093/bioinformatics/btaa868

Rivera‐Vicéns, R. E., Garcia‐Escudero, C. A., Conci, N., Eitel, M., & Wörheide, G. (2022). TransPi—a comprehensive TRanscriptome ANalysiS PIpeline for de novo transcriptome assembly. Molecular Ecology Resources, 22(5), 2070-2086. https://doi.org/10.1111/1755-0998.13593

Robertson, G., Schein, J., Chiu, R., Corbett, R., Field, M., Jackman, S. D., Mungall, K., Lee, S., Okada, H. M., Qian, J. Q., Griffith, M., Raymond, A., Thiessen, N., Cezard, T., Butterfield, Y. S., Newsome, R., Chan, S. K., She, R., Varhol, R., Kamoh, B., Prabhu, A. L., Tam, A., Zhao, Y., Moore, R. A., Hirst, M., Marra, M. A., Jones, S. J. M., Hoodless, P. A., & Birol, I. (2010). De novo assembly and analysis of RNA-seq data. Nature Methods, 7(11), 909-912. https://doi.org/10.1038/nmeth.1517

Schrempf, D., Lartillot, N., & Szöllősi, G. (2020). Scalable empirical mixture models that account for across-site compositional heterogeneity. Molecular Biology and Evolution, 37(12), 3616-3631. https://doi.org/10.1093/molbev/msaa145

Schulz, M. H., Zerbino, D. R., Vingron, M., & Birney, E. (2012). Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics, 28(8), 1086-1092. https://doi.org/10.1093/bioinformatics/bts094

Silvestro, D., Bacon, C. D., Ding, W., Zhang, Q., Donoghue, P. C. J., Antonelli, A., & Xing, Y. (2021). Fossil data support a pre-Cretaceous origin of flowering plants. Nature Ecology & Evolution, 5(4), 449-457. https://doi.org/10.1038/s41559-020-01387-8

Smith, M. R. (2022). Robust analysis of phylogenetic tree space. Systematic Biology, 71(5), 1255-1270. https://doi.org/10.1093/sysbio/syab100

Wong, T. K., Ly-Trong, N., Ren, H., Baños, H., Roger, A. J., Susko, E., Bielow, C., De Maio, N., Goldman, N., Hahn, M. W., Huttley, G., Lanfear, R., & Minh, B. Q. (2025). IQ-TREE 3: phylogenomic inference software using complex evolutionary models. EcoEvoRxiv https://doi.org/10.32942/X2P62N

Xie, Y., Wu, G., Tang, J., Luo, R., Patterson, J., Liu, S., Huang, W., He, G., Gu, S., Li, S., Zhou, X., Lam, T. W., Li, Y., Xu, X., Wong, G. K. S., & Wang, J. (2014). SOAPdenovo-Trans: de novo transcriptome assembly with short RNA-Seq reads. Bioinformatics, 30(12), 1660-1666. https://doi.org/10.1093/bioinformatics/btu077

Young, A. D., Lemmon, A. R., Skevington, J. H., Mengual, X., Ståhls, G., Reemer, M., Jordaens, K., Kelso, S., Lemmon, E. M., Hauseer, M., De Meyer, M, Misof, B., & Wiegmann, B. M. (2016). Anchored enrichment dataset for true flies (order Diptera) reveals insights into the phylogeny of flower flies (family Syrphidae). BMC Evolutionary Biology, 16, 143. https://doi.org/10.1186/s12862-016-0714-0

Zerbino, D. R., & Birney, E. (2008). Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Research, 18(5), 821-829. https://doi.org/10.1101/gr.074492.107

Zhang, C., & Mirarab, S. (2022). Weighting by gene tree uncertainty improves accuracy of quartet-based species trees. Molecular Biology and Evolution, 39(12), msac215. https://doi.org/10.1093/molbev/msac215

Zhang, C., Nielsen, R., & Mirarab, S. (2025). ASTER: A package for large-scale phylogenomic reconstructions. Molecular Biology and Evolution, 42(8), msaf172. https://doi.org/10.1093/molbev/msaf172