Data from: Phylogenomics illuminates the complex evolutionary history of Bibionomorpha (Diptera)
Data files
Mar 09, 2026 version files 595.69 MB
-
AHE_assemblies.zip
22.31 MB
-
Analysis_n_scripts.zip
402.10 MB
-
README.md
20.84 KB
-
transcriptome_assemblies.zip
171.26 MB
Abstract
Bibionomorpha is a large and diverse dipteran infraorder, of which its composition and family-level relationships have long been debated. In this study, we constructed a phylogenomic tree of Bibionomorpha using an extensive dataset of transcriptomic, genomic, and anchored hybrid enrichment data. To further investigate the evolutionary timeline of the group, we also generated a fossil-calibrated timetree using data-driven calibration priors. Different data types and models were utilized to produce a robust backbone phylogeny. Bibionomorpha, comprising Bibionoidea, Sciaroidea, Anisopodoidea, and Scatopsoidea, is recovered as the sister group to Brachycera. Axymyiidae and Perissommatidae, which have been considered part of Bibionomorpha by some authorities, are instead recovered as sister to the Bibionomorpha + Brachycera. We discuss recalcitrant nodes within the infraorder, particularly regarding the placements of Bolitophilidae and Cecidomyiidae. Phylogenetic network analysis suggests a possible reticulation event for Bolitophilidae, while mitochondrial data support a highly sex-specific hybridization event between ancestral Cecidomyiidae and Sciaridae + Diadocidiidae group. Timetree analyses suggest a Lower Triassic or deeper origin of Bibionomorpha, with implications of ancient explosive radiations.
Dataset DOI: 10.5061/dryad.vq83bk472
Description of the data and file structure
Sequence assemblies, alignments, tree files, and log files accompanying the paper: "Phylogenomics illuminate complex evolutionary history of Bibionomorpha".
Files and variables
File: transcriptome_assemblies.zip
Description: Collection of transcriptome assemblies generated in the current study.
We used transcriptomes sequenced as part of the 1K Insect Transcriptome Evolution (1KITE) Project (Misof et al., 2014) together with the publicly available transcriptome and genome sequences from GenBank. 1KITE transcriptome samples were collected in RNAlater and were sequenced at BGI (Beijing Genomic Institute, China) with the Illumina HiSeq 2000 (Illumina, San Diego, CA, USA) platform, following the protocol of Misof et al. (2014) and Peters et al. (2017).
Raw transcriptome reads were then assembled using multi-assembler pipeline TransPi v 1.3.0 (Rivera-Vicéns et al., 2021) to obtain de novo consensus transcriptome assemblies from raw reads. Briefly, raw reads were first checked with FastQC v 0.11.9 (Andrews, 2010), followed by filtration and adapter removal via fastp v 0.24.0 (Chen, 2023). Processed reads were then assembled with rnaSPAdes v 3.15.3 (Bushmanova et al., 2019), Trans-ABySS v 2.0.1 (Robertson et al., 2010), SOAPdenovo-Trans v 1.03 (Xie et al., 2014), Velvet v 1.2.10/Oases v 0.2.09 (Zerbino & Birney, 2008; Schulz et al., 2012) using k-mer set C of Rivera-Vicéns et al. (2021), and also using Trinity v 2.15.2 (Grabherr et al., 2011). Resulting assemblies were reduced to consensus assemblies with EvidentialGene v 2019.05.14 (Gilbert, 2013; 2019) and then decontaminated using MCSC (Lafond-Lapalme et al., 2017) to remove transcripts that were putatively non-arthropod in origin.
File naming convention
(family name) _(genus name)_ (species epithet)_T_decont.fasta
e.g., Anisopodidae_Sylvicola_dubius_T_decont.fasta.
T_decont stands for "transcriptome", "decontaminated".
File: AHE_assemblies.zip
Description: Collection of AHE assemblies generated in the current study.
We also used anchored hybrid enrichment (AHE; Lemmon et al., 2012) data in addition to transcriptome data for phylogenetic analyses. DNA was extracted from ethanol-preserved samples using the OmniPrep™ for Tissue Kit (Cat. #786–395; G-Biosciences®, USA), following the manufacturer's instructions. We followed previously published methods of Young et al. (2016) for library construction, using the same probes and procedures. Final reads were then assembled using SOAPdenovo 2 v r241 (Luo et al., 2012).
Fine naming convention
(family name) _(genus name)_ (species epithet)_AHE.fasta
e.g., Canthyloscelidae_Synneuron_decipiens_AHE.fasta.
File: Analysis_n_scripts.zip
Description: Analysis files including alignments, tree files, logs, etc.
-
og_filtering
|__ data
|__ Bibionomorpha_OrthologousMatrix.txt
|__functions
|__ og_filter.R
|__ threshold.R
|__ result
|__ og_Bibionomorpha.txt
|__ og_Bibionomorpha_noOut.txt
|__ og_filtering.R
|__ pre_filtering.R
This folder includes scripts and results to perform orthogroup (OG) filtering, as described in the main article. "data" folder contains output of OMA standalone v2.6.0 (Altenhoff et al, 2019) ran on high-quality reference genomes of Bibionomorpha, which was used as an input for the OG filtering pipeline. "function" folder includes R functions to perform OG filtering, and the "result" folder includes output of OG filtering. "og_Bibionomorpha_noOut.txt" (filtering ran excluding the outgroup, Drosophila melanogaster) was used for actual further downstream analyses. "og_filtering.R" script calls functions in the "function" folder and run the actual OG filtering. "pre_filtering.R" performs pre-filtering of OGs prior to OG filtering, by setting lenght threshold and taxonomic coverage for each OG. The resulting list of OGs were further filtered according to the procedure detailed in the main article, and those OG alignments were then used for further downstream analyses.
-
phylogenetic_analysis_genomic_concat
|__ AA_Partitioned
|__ AA_Partitioned_noAHE
|__ AA_Subsample_noRF
|__ AA_Mixture_Subsample_noRF
|__ AA_Partitioned_Subsample_noRF
|__ AA_Subsample_RF
|__ AA_Mixture_Subsample_RF
|__ AA_Partitioned_Subsample_RF
|__ NT2_Partitioned
|__ NT2_Partitioned_noAHE
|__ NT2_Subsample_noRF
|__ NT2_Mixture_Subsample_noRF
|__ NT2_Partitioned_Subsample_noRF
|__ NT2_Subsample_RF
|__ NT2_Mixture_Subsample_RF
|__ NT2_Partitioned_Subsample_RF
This folder includes results of phylogenetic analyses done using IQ-TREE 3 v 3.0.1 (Wong et al., 2025) for concatenated genomic data. "AA" stands for amino acid alignment, and "NT2" stands for nucleotide alignment using only second codon position. "Partitioned" stands for partitioned analysis, and "Mixture" stands for either profile mixture (AA) or nucleotide mixture (NT2) analysis. "noAHE" stands for analyses done excluding the AHE data. "Subsample" stands for analyes done using subsampled data using genesortR (Mongiardino Koch, 2021; Mongiardino Koch & Thompson, 2021). "RF" and "noRF" stands for analyses done with or without using the topological filtering option implemented in genesortR.
Each folder includes input alignment file (*.fasta), partition information file (if applicable, *.nex), .iqtree file listing model selection results (only for Partitioned and Partitioned_noAHE folders), resulting phylogeny from 3 replicate runs (.runtrees), and the final tree with the best log-likelihood (.treefile). AA Mixture folders additionally include file listing site-specific frequency (.sitefreq) and profile mixture specification (udm_hogenom_0064_lclr_iqtree.nex; from Schrempf et al., 2020).
-
phylogenetic_analysis_genomic_MSC
|__ AA
|__ gene_tree_variance
|__ NT2
|__ NT12
This folder includes results of phylogenetic analyses done using wASTRAL v1.22.3.7 (Zhang & Mirarab, 2022) implemented in ASTER (Zhang et al., 2025). "AA" stands for amino acid alignment, and "NT2" stands for nucleotide alignment using only second codon position, and "NT12" stands for nucleotide alignment using first and second codon positions. "AA", "NT2", and "NT12" each contains input ".trees" file including list of gene trees inferred using IQ-TREE 3 v 3.0.1, and "tree" file for the multi-species coalescent tree inferred using wASTRAL.
"gene_tree_variance" includes imputed gene trees for AA, NT2, and NT12 (.trees files) obtained using tripVote v1.2 (Mai & Mirarab, 2022), and results of gene tree topological-dispersion calculation ("dist_TV_result.txt") done using R package TreeDist v 2.9.2 (Smith, 2022), as detailed in the main article.
-
phylogenetic_analysis_mitochondrial
|__ AA_Mixture
|__ AA_Partitioned
|__ NT_codon12_Mixture
|__ NT_codon12_Partitioned
|__ NT_codon123_Partitioned
|__ Tree_TopoTest
|__ AA_Mix_treetest
|__ AA_Partitioned_treetest
|__ NT12_Mixture_treetest
|__ NT12_Partitioned_treetest
|__ NT123_Partitioned_treetest
This folder includes results of phylogenetic analyses done using IQ-TREE 3 v 3.0.1 (Wong et al., 2025) for mitochondrial data. Naming convention for files in each foler is identical to "2. phylogenetic_analysis_genomic_concat", except for the "Tree_TopoTest" folder. "Tree_TopoTest" folder includes result of tree topology tests done for each alignment, which is listed in the .iqtree file.
-
phylogenetic_network
This folder includes results of phylogenetic network analyses done for AA gene trees using MSCquartets v 3.2 (Rhodes et al., 2021). "AA_MSCquartet_alpha.R" is a script used for the analysis. "combined_AA_bibio.trees" includes list of AA gene trees which was used as an input, and Network_.tree/nexml includes phylogenetic networks inferred using specific alpha value threshold, and ToB_.tree includes tree of blobs inferred using specific alpha value threshold.
-
BBB
This folder includes results of fossil-based age distribution estimation using rootBBB v 0.2 (Silvestro et al., 2021; Carlisle et al., 2023). "extant_species.txt" includes list of number of extant species for each clade of interest, and "fossil_counts.txt" includes number of fossil species per time bin for each clade of interest. These two files were used as an input for rootBBB, and the resulting MCMC log files are named as "(clade name)_ mcmc _(seed number)_f0.95_qvar.log".
-
MCMCtree
|__ AA_Sub_RF_Mixture
|__ BSS_1
|__ BSS_2
|__ NT2_Sub_RF_Mixture
|__ BSS_1
|__ BSS_2
This folder includes results of phylogenetic dating done using IQ2MC (Demotte et al., 2025) and MCMCtree (Yang, 2007) pipeline implemented in IQ-TREE 3 v 3.0.1 and PAML v 4.10.8.iq2mc. We generated two timetrees for "AA_Sub_RF_Mixture" and "NT2_Sub_RF_Mixture" trees. Each folder includes input and output of the IQ2MC pipeline: .fa is an input alignment, .treefile is an input tree topology, and profile mixture specification (only for AA) file "udm_hogenom_0064_lclr_iqtree.nex". The outputs of IQ2MC pipeline are: .iqtree file reporting summary of the analysis, dummy.phy file which is a dummy alignment file that is fed to MCMCtree, .ctl file for run specification for MCMCtree, calculated hessian matrix (.hessian), and input tree file for MCMCtree (.nwk file).
BSS_1 and BSS_2 includes MCMCtree results done using two different fossil calibration sets. Inputs from IQ2MC pipeline is contained in each folder, but with different calibration prior specification for .nwk file. MCMC logs are printed in .log file, with run report listed in .out file. Resulting timetree is summarized in a file "FigTree.tre".
References
Altenhoff, A. M., Levy, J., Zarowiecki, M., Tomiczek, B., Vesztrocy, A. W., Dalquen, D. A., Müller, S., Telford, M. J., Glover, N. M., Dylus, D., & Dessimoz, C. (2019). OMA standalone: orthology inference among public and custom genomes and transcriptomes. Genome Research, 29(7), 1152-1163. https://doi.org/10.1101/gr.243212.118
Andrews, S. (2010). FastQC: A quality control tool for high throughput sequence data. (Accessed Nov. 17, 2024). https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Bushmanova, E., Antipov, D., Lapidus, A., & Prjibelski, A. D. (2019). rnaSPAdes: a de novo transcriptome assembler and its application to RNA-Seq data. GigaScience, 8(9), giz100. https://doi.org/10.1093/gigascience/giz100
Carlisle, E., Janis, C. M., Pisani, D., Donoghue, P. C. J., & Silvestro, D. (2023). A timescale for placental mammal diversification based on Bayesian modeling of the fossil record. Current Biology, 33(15), 3073-3082. https://doi.org/10.1016/j.cub.2023.06.016
Chen, S. (2023). Ultrafast one‐pass FASTQ data preprocessing, quality control, and deduplication using fastp. iMeta, 2(2), e107. https://doi.org/10.1002/imt2.107
Demotte, P., Panchaksaram, M., Kumarasinghe, H., Ly-Trong, N., dos Reis, M., & Minh, B. Q. (2025). IQ2MC: a new framework to infer phylogenetic time trees using IQ-TREE 3 and MCMCtree with mixture models. EcoEvoRxiv https://doi.org/10.32942/X2CD2X
Gilbert, D. (2013). Gene-omes built from mRNA seq not genome DNA. 7th annual arthropod genomics symposium. Notre Dame. F1000Research 2013, 5, 1695. https://doi.org/10.7490/f1000research.1112594.1
Gilbert, D. G. (2019). Longest protein, longest transcript or most expression, for accurate gene reconstruction of transcriptomes?. bioRxiv. https://doi.org/10.1101/829184
Grabherr, M. G., Haas, B. J., Yassour, M., Levin, J. Z., Thompson, D. A., Amit, I., Adiconis, X., Fan, L., Raychowdhury, R., Zeng, Q., Chen, Z., Mauceli, E., Hacohen, N., Gnirke, A., Rhind, N., di Palma, F., Birren, B. W., Nusbaum, C., Linblad-Toh, K., Friedman, N., & Regev, A. (2011). Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature Biotechnology, 29(7), 644-652. https://doi.org/10.1038/nbt.1883
Lafond-Lapalme, J., Duceppe, M. O., Wang, S., Moffett, P., & Mimee, B. (2017). A new method for decontamination of de novo transcriptomes using a hierarchical clustering algorithm. Bioinformatics, 33(9), 1293-1300. https://doi.org/10.1093/bioinformatics/btw793
Lemmon, A. R., Emme, S. A., & Lemmon, E. M. (2012). Anchored hybrid enrichment for massively high-throughput phylogenomics. Systematic Biology, 61(5), 727-744. https://doi.org/10.1093/sysbio/sys049
Luo, R., Liu, B., Xie, Y., Li, Z., Huang, W., Yuan, J., He, G., Chen, Y., Pan, Q., Liu, Y., Tang, J., Wu, G., Zhang, H., Shi, Y., Liu, Y., Yu, C., Wang, B., Lu, Y., Han, C., Cheung, D. W., Yiu, S. M., Peng, S., Zhu, X., Liu, G., Liao, X., Li, Y., Yang, H., Wang, J., Lam, T. W., & Wang, J. (2012). SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience, 4(1), 2047-217X-1-18. https://doi.org/10.1186/s13742-015-0069-2
Mai, U., & Mirarab, S. (2022). Completing gene trees without species trees in sub-quadratic time. Bioinformatics, 38(6), 1532-1541. https://doi.org/10.1093/bioinformatics/btab875
Misof, B., Liu, S., Meusemann, K., Peters, R. S., Donath, A., Mayer, C., Frandsen, P. B., Ware, J., Flouri, T., Beutel, R. G., Niehuis, O., Peterson, M., Izquierdo-Carrasco, F., Wappler, T., Rust, J., Aberer, A. J., Aspöck, U., Aspöck, H., Bartel, D., Blanke, A., Berger, S., Böhm, A., Buckley, T. R., Calcott, B., Chen, J., Friedrich, F., Fukui, M., Fujita, M., Greve, C., Grobe, P., Gu, S., Huang, Y., Jermiin, L. S., Kawahara, A. Y., Krogmann, L., Kubiak, M., Lanfear, R., Letsch, H., Li, Y., Li, Z., Li, J., Lu, H., Machida, R., Mashimo, Y., Kapli, P., McKenna, D. D., Meng, G., Nakagaki, Y., Navarrete-Heredia, J. L., Ott, M., Ou, Y., Pass, G., Podsiadlowski, L., Pohl, H., von Reumont, B. M., Schütte, K., Sekiya, K., Shimizu, S., Slipinski, A., Stamatakis, A., Song, W., Su, X., Szucsich, N. U., Tan, M., Tan, X., Tang, M., Tang, J., Timelthaler, G., Tomizuka, S., Trautwein, M., Tong, X., Uchifune, T., Walzl, M. G., Wiegmann, B. M., Wilbrandt, J., Wipfler, B., Wong, T. K. F., Wu, Q., Wu, G., Xie, Y., Yang, S., Yang, Q., Yeates, D. K., Yoshizawa, K., Zhang, Q., Zhang, R., Zhang, W., Zhang, Y., Zhao, J., Zhou, C., Zhou, L., Ziesmann, T., Zou, S., Li, Y., Xu, X., Zhang, Y., Yang, H., Wang, J., Wang, J., Kjer, K., & Zhou, X. (2014). Phylogenomics resolves the timing and pattern of insect evolution. Science, 346(6210), 763-767. https://doi.org/10.1126/science.1257570
Mongiardino Koch, N. (2021). Phylogenomic subsampling and the search for phylogenetically reliable loci. Molecular Biology and Evolution, 38(9), 4025-4038. https://doi.org/10.1093/molbev/msab151
Mongiardino Koch, N., & Thompson, J. R. (2021). A total-evidence dated phylogeny of Echinoidea combining phylogenomic and paleontological data. Systematic Biology, 70(3), 421-439. https://doi.org/10.1093/sysbio/syaa069
Peters, R. S., Krogmann, L., Mayer, C., Donath, A., Gunkel, S., Meusemann, K., Kozlov, A., Podsiadlowski, L., Petersen, M., Lanfear, R., Diez, P. A., Heraty, J., Kjer, K. M., Klopfstein, S., Meier, R., Polidori, C., Schmitt, T., Liu, S., Zhou, X., Wappler, T., Rust, J., Misof, B., & Niehuis, O. (2017). Evolutionary history of the Hymenoptera. Current Biology, 27(7), 1013-1018. https://doi.org/10.1016/j.cub.2017.01.027
Rhodes, J. A., Baños, H., Mitchell, J. D., & Allman, E. S. (2021). MSCquartets 1.0: quartet methods for species trees and networks under the multispecies coalescent model in R. Bioinformatics, 37(12), 1766-1768. https://doi.org/10.1093/bioinformatics/btaa868
Rivera‐Vicéns, R. E., Garcia‐Escudero, C. A., Conci, N., Eitel, M., & Wörheide, G. (2022). TransPi—a comprehensive TRanscriptome ANalysiS PIpeline for de novo transcriptome assembly. Molecular Ecology Resources, 22(5), 2070-2086. https://doi.org/10.1111/1755-0998.13593
Robertson, G., Schein, J., Chiu, R., Corbett, R., Field, M., Jackman, S. D., Mungall, K., Lee, S., Okada, H. M., Qian, J. Q., Griffith, M., Raymond, A., Thiessen, N., Cezard, T., Butterfield, Y. S., Newsome, R., Chan, S. K., She, R., Varhol, R., Kamoh, B., Prabhu, A. L., Tam, A., Zhao, Y., Moore, R. A., Hirst, M., Marra, M. A., Jones, S. J. M., Hoodless, P. A., & Birol, I. (2010). De novo assembly and analysis of RNA-seq data. Nature Methods, 7(11), 909-912. https://doi.org/10.1038/nmeth.1517
Schrempf, D., Lartillot, N., & Szöllősi, G. (2020). Scalable empirical mixture models that account for across-site compositional heterogeneity. Molecular Biology and Evolution, 37(12), 3616-3631. https://doi.org/10.1093/molbev/msaa145
Schulz, M. H., Zerbino, D. R., Vingron, M., & Birney, E. (2012). Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics, 28(8), 1086-1092. https://doi.org/10.1093/bioinformatics/bts094
Silvestro, D., Bacon, C. D., Ding, W., Zhang, Q., Donoghue, P. C. J., Antonelli, A., & Xing, Y. (2021). Fossil data support a pre-Cretaceous origin of flowering plants. Nature Ecology & Evolution, 5(4), 449-457. https://doi.org/10.1038/s41559-020-01387-8
Smith, M. R. (2022). Robust analysis of phylogenetic tree space. Systematic Biology, 71(5), 1255-1270. https://doi.org/10.1093/sysbio/syab100
Wong, T. K., Ly-Trong, N., Ren, H., Baños, H., Roger, A. J., Susko, E., Bielow, C., De Maio, N., Goldman, N., Hahn, M. W., Huttley, G., Lanfear, R., & Minh, B. Q. (2025). IQ-TREE 3: phylogenomic inference software using complex evolutionary models. EcoEvoRxiv https://doi.org/10.32942/X2P62N
Xie, Y., Wu, G., Tang, J., Luo, R., Patterson, J., Liu, S., Huang, W., He, G., Gu, S., Li, S., Zhou, X., Lam, T. W., Li, Y., Xu, X., Wong, G. K. S., & Wang, J. (2014). SOAPdenovo-Trans: de novo transcriptome assembly with short RNA-Seq reads. Bioinformatics, 30(12), 1660-1666. https://doi.org/10.1093/bioinformatics/btu077
Young, A. D., Lemmon, A. R., Skevington, J. H., Mengual, X., Ståhls, G., Reemer, M., Jordaens, K., Kelso, S., Lemmon, E. M., Hauseer, M., De Meyer, M, Misof, B., & Wiegmann, B. M. (2016). Anchored enrichment dataset for true flies (order Diptera) reveals insights into the phylogeny of flower flies (family Syrphidae). BMC Evolutionary Biology, 16, 143. https://doi.org/10.1186/s12862-016-0714-0
Zerbino, D. R., & Birney, E. (2008). Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Research, 18(5), 821-829. https://doi.org/10.1101/gr.074492.107
Zhang, C., & Mirarab, S. (2022). Weighting by gene tree uncertainty improves accuracy of quartet-based species trees. Molecular Biology and Evolution, 39(12), msac215. https://doi.org/10.1093/molbev/msac215
Zhang, C., Nielsen, R., & Mirarab, S. (2025). ASTER: A package for large-scale phylogenomic reconstructions. Molecular Biology and Evolution, 42(8), msaf172. https://doi.org/10.1093/molbev/msaf172
