Data from: Evaluating UCE data adequacy and integrating uncertainty in a comprehensive phylogeny of ants

Borowiec, Marek 1 ; Zhang, Miles2; Neves, Karen3; Ramalho, Manuela4; Fisher, Brian5; Lucky, Andrea6; Moreau, Corrie7

Published Dec 28, 2024 on Dryad. https://doi.org/10.5061/dryad.547d7wmhb

Abstract

While some relationships in phylogenomic studies have remained stable since the Sanger sequencing era, many challenging nodes remain, even with genome-scale data. Incongruence or lack of resolution in the phylogenomic era is frequently attributed to inadequate data modeling and analytical issues that lead to systematic biases. However, few studies investigate the potential for random error or establish expectations for the level of resolution achievable with a given empirical dataset, and integrate uncertainties across methods when faced with conflicting results.

Ants are the most species-rich lineage of social insects and one of the most ecologically important terrestrial animals. Consequently, ants have garnered significant research attention, including their systematics. Despite this, there has been no comprehensive genus-level phylogeny of the ants inferred using genomic data that thoroughly evaluates both signal strength and incongruence. In this study, we provide insight into and quantify uncertainty across the ant tree of life by utilizing the most taxonomically comprehensive Ultraconserved Elements dataset of ants to date, including 277 (81%) of recognized ant genera from all 16 extant subfamilies, and representing over 98% of described species. We use simulations to establish expectations for resolution, identify branches with less-than-expected concordance, and dissect the effects of data and model selection on recalcitrant nodes.

Simulations show that hundreds of loci are needed to resolve recalcitrant nodes on our genus-level ant phylogeny. This demonstrates the continued role of random error in phylogenomic studies. Our analyses provide a comprehensive picture of support and incongruence across the ant phylogeny, while offering a more nuanced depiction of uncertainty and significantly expanding generic sampling. We use a consensus approach to integrate uncertainty across different analyses and find that assumptions about root age exert substantial influence on divergence dating. Our results suggest that advancing the understanding of ant phylogeny will require not only more data but also more refined phylogenetic models. We also provide a workflow for identifying under-supported nodes in concatenation analyses, outline a pragmatic way to reconcile conflicting results in phylogenomics, and introduce a user-friendly locus selection tool for divergence dating.

https://doi.org/10.5061/dryad.547d7wmhb

This dataset is associated with Borowiec et al. 2025 publication in Systematic Biology titled "Evaluating UCE Data Adequacy and Integrating Uncertainty in a Comprehensive Phylogeny of Ants". It contains supplementary information and data including supplementary figures, tables, assembled contig sequences, trimmed and untrimmed locus alignments, concatenated alignments, log and tree files of phylogenetic, simulation, and visualization analyses, and custom code used.

Description of the data and file structure

Directory structure of supplementary data:

├── Supplementary_figures.pdf # Supplementary figures referred to in the manuscript.
├── Supplementary_tables.xlsx # Supplementary tables referred to in the manuscript.
├── 1-assembled-contigs # Assembled SPADES contigs for newly and previously (publicly available) sequenced samples. It contains two subfolders, previously-sequenced and newly-sequenced, corresponding to contig sequences assembled from publicly available data and raw sequence data generated for this study, respectively. Each *.contigs.fasta file contains contigs in FASTA format.
├── 2-loci # Single-locus sequence files, including unaligned, aligned untrimmed, and trimmed aligned FASTAs. These files are organized in subdirectories aligned-trimmed, aligned-untrimmed, protein-coding, and unaligned. Within the protein-coding directory, containing protein-coding sequences extracted from UCEs, there are further subdirectories corresponding to amino acid alignments (aa directory), nucleotide alignments (nt directory), and nucleotide alignments with 3rd codon positions removed (nt-no3rd-pos directory). Every file is a FASTA sequence file with a *.fasta extension.
├── 3-loci-trees-and-logs # Individual locus IQ-TREE analysis logs (*.log files) and maximum likelihood trees (*.treefile files).
├── 4-concatenated-alignments # FASTAs (**.fasta), partition files (*.part), and SWSC best scheme (*.best_scheme) files for concatenated alignments.
├── 5-concat-trees-and-logs # Log (*.txt) and NEWICK tree (*.newick) files for all concatenated analyses.
├── 6-consensus # Bootstrap trees (*.newick files except Consensus_tree.newick) in NEWICK format for the consensus tree analysis, as well as the IQ-TREE analysis log (Consensus_log.txt) and output consensus NEWICK tree (Consensus_tree.newick).
├── 7-emp-vs-sim-comparison # Trees, data, and plots comparing empirical and simulated alignments. emp* files were generated using empirical data, while sim* files are based on simulated alignments. all-loci-unpartitioned-gtrfg.treefile represents the tree under which sequences were simulated. *-loci-trees files contain trees in NEWICK format inferred from replicate alignments, following naming convention condition-matrix_locus_number-loci-trees, meaning that emp-1000-loci-trees file, for example, contains the trees inferred from 12 replicates of alignments containing 1,000 empirical loci. The directory also contains *csv files with data that was used to generate plots comparing empirical and simulated replicates. average_bootstrap* files contain average bootstrap values computed on empirical and simulated alignment replicates. *dists* files contain tree distances computed using the Kuhner and Felsenstein (1994) branch score. *RFs* files contain topological distances computed using the Penny and Hendy (1985, originally from Robinson and Foulds 1981) method. The plots corresponding to the data files with bootstrap, branch, and topological distances are available as SVG and PDF files. These were generared with the emp_sym_comparisons.R script, available in directory 11-scripts. Subdirectory replicates contains IQ-TREE logs (*.log) and output trees (*.treefile) for each empirical and simulated data replicate generated and analyzed.
├── 8-simulations # This directory contains simulated alignments (*.fasta) in FASTA format, with numbers (e.g., uce-12065)corresponding to empirical UCE locus alignment numbers.
├── 9-concordance # Concordance factor analyses on empirical and simulated data. all-loci directory contains analyses of all loci based on upartitioned analyses, comparing empirical and simulated alignments, while consensus has concordance factor analysis based on the consensus tree. IQ-TREE analysis files in both directories are those with *.cf.* in file name. PDF files correspond to visualizations of various site and gene concordance metrics and their derivatives computed with the concord_stats.R script in 11-scripts. Directories emp_ind_genetrees_GTRFG4 and sim_ind_genetrees_GTRFG4 contain IQ-TREE analysis logs (*.log) and trees (*.treefile) for empirical and simulated loci, respectively, inferred under GTR+F+G4 model.
├── 10-MCMCTree # Divergence dating analysis files. MCMCTree_creator.R is a script to create node calibrated-tree for MCMCTree analysis input using consensus topology. The output of this script is the MCMCTree_annotated.tre file. Directories root-prior* contain analyses under different root prior settings. Subdirectories within them correspond to analyses under independent and correlated clock models. In each directory, 100best.phy is the input alignment in PHYLIP format, MCMCTree_annotated.tre is the input node-calibrated tree, *.ctl files are input control files for likelihood approximation calculation and divergence dating runs, out* files are output logs/posterior files, and FigTree* files are output trees in NEWICK and PDF formats. File FigTree-root-prior-129to158-independent.tre corresponds to chronogram in Figure 6.
└── 11-scripts # Custom scripts used in data analysis and manuscript writing.
         ├── concord_stats.R # Comparison of concordance between empirical and simulated data.
         ├── emp_sym_comparisons.R # Plotting of various properties of empirical and simulated data.
         ├── ILS_compute.R # Checking for incomplete lineage sorting assumptions violations in concordance data.
         ├── introgression.R # Modified R script from Cunha et al. 2021 (https://doi.org/10.1093/sysbio/syab071) to test for ancient introgression
         ├── kinda_date.R # Locus selection for divergence dating.
         ├── MCMCTree_creator.R # Code for creating node calibrations on consensus tree for divergence estimation.
         ├── plot_chronograms.R # Chronogram plotting functions.
         ├── plot_phylograms.R # Phylogram plotting function.
         └── simulate_alignments.py # Wrapper for simulating alignments using IQ-TREE output, seq-gen, and empirical locus missing data patterns.
├── 12-DiscoVista # Tree files and results of visualizations of support and conflict in gene trees (gene-trees* directories) and concatenated (species-trees* directories) analyses. Directories contain results from all empirical loci (*all* directories) and only empirical loci passing symmetry tests (*symtest* directories). The clade-annotations.txt contains clade definitions for the analysis of support. In each directory, files correspond to intermediate and output files generated by DiscoVista, with results visualized in PDF files.
└── 13-introgression # Files and results of delta-statistic/introgression tests analyses. Three subdirectories, emp_resample, sym_resample, and sim_resample, contain files with *.cf.stat extension containing concordance factors calculated from resampling gene trees (with replacement) 2,000 times. They correspond to three datasets, containing all empirical loci, only empirical loci passing symmetry tests, and simulated loci, respectively. Similarly emp-cf.cf.stat, sym-cf.cf.stat, and sim-cf.cf.stat files contain concordance factor statistics for the corresponding data, generated using IQ-TREE. most_discord.csv is the output file for nodes with more than 5% discordant gene trees to investigate introgression. Z_score_test.csv file contains standardized z-score of observed deltas relative to the null distribution created with the resampling. One-tailed test looking for CDF_P>0.95 (pvalue<0.05) as evidence of introgression.

Sharing/Access information

This data has also been placed on Zenodo:

https://doi.org/10.5281/zenodo.14503635

Code/Software

The most recent version of the R script kinda_date with documentation can be found on GitHub.

Data from: Evaluating UCE data adequacy and integrating uncertainty in a comprehensive phylogeny of ants

Data files

Abstract

README: Evaluating UCE Data Adequacy and Integrating Uncertainty in a Comprehensive Phylogeny of Ants

Description of the data and file structure

Sharing/Access information

Code/Software