Data from: Evaluating UCE data adequacy and integrating uncertainty in a comprehensive phylogeny of ants
Data files
Dec 28, 2024 version files 6.44 GB
-
1-assembled-contigs.zip
4.05 GB
-
10-MCMCTree.zip
855.24 MB
-
11-scripts.zip
14.19 KB
-
12-DiscoVista.zip
200.25 MB
-
13-introgression.zip
48.03 MB
-
2-loci.zip
562.60 MB
-
3-loci-trees-and-logs.zip
47.04 MB
-
4-concatenated-alignments.zip
161.12 MB
-
5-concat-trees-and-logs.zip
56.42 MB
-
6-consensus.zip
7.40 MB
-
7-emp-vs-sim-comparison.zip
242.89 MB
-
8-simulations.zip
27.61 MB
-
9-concordance.zip
180.03 MB
-
README.md
9 KB
Abstract
While some relationships in phylogenomic studies have remained stable since the Sanger sequencing era, many challenging nodes remain, even with genome-scale data. Incongruence or lack of resolution in the phylogenomic era is frequently attributed to inadequate data modeling and analytical issues that lead to systematic biases. However, few studies investigate the potential for random error or establish expectations for the level of resolution achievable with a given empirical dataset, and integrate uncertainties across methods when faced with conflicting results.
Ants are the most species-rich lineage of social insects and one of the most ecologically important terrestrial animals. Consequently, ants have garnered significant research attention, including their systematics. Despite this, there has been no comprehensive genus-level phylogeny of the ants inferred using genomic data that thoroughly evaluates both signal strength and incongruence. In this study, we provide insight into and quantify uncertainty across the ant tree of life by utilizing the most taxonomically comprehensive Ultraconserved Elements dataset of ants to date, including 277 (81%) of recognized ant genera from all 16 extant subfamilies, and representing over 98% of described species. We use simulations to establish expectations for resolution, identify branches with less-than-expected concordance, and dissect the effects of data and model selection on recalcitrant nodes.
Simulations show that hundreds of loci are needed to resolve recalcitrant nodes on our genus-level ant phylogeny. This demonstrates the continued role of random error in phylogenomic studies. Our analyses provide a comprehensive picture of support and incongruence across the ant phylogeny, while offering a more nuanced depiction of uncertainty and significantly expanding generic sampling. We use a consensus approach to integrate uncertainty across different analyses and find that assumptions about root age exert substantial influence on divergence dating. Our results suggest that advancing the understanding of ant phylogeny will require not only more data but also more refined phylogenetic models. We also provide a workflow for identifying under-supported nodes in concatenation analyses, outline a pragmatic way to reconcile conflicting results in phylogenomics, and introduce a user-friendly locus selection tool for divergence dating.
README: Evaluating UCE Data Adequacy and Integrating Uncertainty in a Comprehensive Phylogeny of Ants
https://doi.org/10.5061/dryad.547d7wmhb
This dataset is associated with Borowiec et al. 2025 publication in Systematic Biology titled "Evaluating UCE Data Adequacy and Integrating Uncertainty in a Comprehensive Phylogeny of Ants". It contains supplementary information and data including supplementary figures, tables, assembled contig sequences, trimmed and untrimmed locus alignments, concatenated alignments, log and tree files of phylogenetic, simulation, and visualization analyses, and custom code used.
Description of the data and file structure
Directory structure of supplementary data:
├── Supplementary_figures.pdf
# Supplementary figures referred to in the manuscript.
├── Supplementary_tables.xlsx
# Supplementary tables referred to in the manuscript.
├── 1-assembled-contigs
# Assembled SPADES contigs for newly and previously (publicly available) sequenced samples. It contains two subfolders, previously-sequenced
and newly-sequenced
, corresponding to contig sequences assembled from publicly available data and raw sequence data generated for this study, respectively. Each *.contigs.fasta
file contains contigs in FASTA format.
├── 2-loci
# Single-locus sequence files, including unaligned, aligned untrimmed, and trimmed aligned FASTAs. These files are organized in subdirectories aligned-trimmed
, aligned-untrimmed
, protein-coding
, and unaligned
. Within the protein-coding
directory, containing protein-coding sequences extracted from UCEs, there are further subdirectories corresponding to amino acid alignments (aa
directory), nucleotide alignments (nt
directory), and nucleotide alignments with 3rd codon positions removed (nt-no3rd-pos
directory). Every file is a FASTA sequence file with a *.fasta
extension.
├── 3-loci-trees-and-logs
# Individual locus IQ-TREE analysis logs (*.log
files) and maximum likelihood trees (*.treefile
files).
├── 4-concatenated-alignments
# FASTAs (**.fasta
), partition files (*.part
), and SWSC best scheme (*.best_scheme
) files for concatenated alignments.
├── 5-concat-trees-and-logs
# Log (*.txt
) and NEWICK tree (*.newick
) files for all concatenated analyses.
├── 6-consensus
# Bootstrap trees (*.newick
files except Consensus_tree.newick
) in NEWICK format for the consensus tree analysis, as well as the IQ-TREE analysis log (Consensus_log.txt
) and output consensus NEWICK tree (Consensus_tree.newick
).
├── 7-emp-vs-sim-comparison
# Trees, data, and plots comparing empirical and simulated alignments. emp*
files were generated using empirical data, while sim*
files are based on simulated alignments. all-loci-unpartitioned-gtrfg.treefile
represents the tree under which sequences were simulated. *-loci-trees
files contain trees in NEWICK format inferred from replicate alignments, following naming convention condition-matrix_locus_number-loci-trees
, meaning that emp-1000-loci-trees
file, for example, contains the trees inferred from 12 replicates of alignments containing 1,000 empirical loci. The directory also contains *csv
files with data that was used to generate plots comparing empirical and simulated replicates. average_bootstrap*
files contain average bootstrap values computed on empirical and simulated alignment replicates. *dists*
files contain tree distances computed using the Kuhner and Felsenstein (1994) branch score. *RFs*
files contain topological distances computed using the Penny and Hendy (1985, originally from Robinson and Foulds 1981) method. The plots corresponding to the data files with bootstrap, branch, and topological distances are available as SVG and PDF files. These were generared with the emp_sym_comparisons.R
script, available in directory 11-scripts
. Subdirectory replicates
contains IQ-TREE logs (*.log
) and output trees (*.treefile
) for each empirical and simulated data replicate generated and analyzed.
├── 8-simulations
# This directory contains simulated alignments (*.fasta
) in FASTA format, with numbers (e.g., uce-12065
)corresponding to empirical UCE locus alignment numbers.
├── 9-concordance
# Concordance factor analyses on empirical and simulated data. all-loci
directory contains analyses of all loci based on upartitioned analyses, comparing empirical and simulated alignments, while consensus
has concordance factor analysis based on the consensus tree. IQ-TREE analysis files in both directories are those with *.cf.*
in file name. PDF files correspond to visualizations of various site and gene concordance metrics and their derivatives computed with the concord_stats.R
script in 11-scripts
. Directories emp_ind_genetrees_GTRFG4
and sim_ind_genetrees_GTRFG4
contain IQ-TREE analysis logs (*.log
) and trees (*.treefile
) for empirical and simulated loci, respectively, inferred under GTR+F+G4 model.\
├── 10-MCMCTree
# Divergence dating analysis files. MCMCTree_creator.R
is a script to create node calibrated-tree for MCMCTree analysis input using consensus topology. The output of this script is the MCMCTree_annotated.tre
file. Directories root-prior*
contain analyses under different root prior settings. Subdirectories within them correspond to analyses under independent and correlated clock models. In each directory, 100best.phy
is the input alignment in PHYLIP format, MCMCTree_annotated.tre
is the input node-calibrated tree, *.ctl
files are input control files for likelihood approximation calculation and divergence dating runs, out*
files are output logs/posterior files, and FigTree*
files are output trees in NEWICK and PDF formats. File FigTree-root-prior-129to158-independent.tre
corresponds to chronogram in Figure 6.
└── 11-scripts
# Custom scripts used in data analysis and manuscript writing.
├── concord_stats.R
# Comparison of concordance between empirical and simulated data.
├── emp_sym_comparisons.R
# Plotting of various properties of empirical and simulated data.
├── ILS_compute.R
# Checking for incomplete lineage sorting assumptions violations in concordance data.
├── introgression.R
# Modified R script from Cunha et al. 2021 (https://doi.org/10.1093/sysbio/syab071) to test for ancient introgression
├── kinda_date.R
# Locus selection for divergence dating.
├── MCMCTree_creator.R
# Code for creating node calibrations on consensus tree for divergence estimation.
├── plot_chronograms.R
# Chronogram plotting functions.
├── plot_phylograms.R
# Phylogram plotting function.
└── simulate_alignments.py
# Wrapper for simulating alignments using IQ-TREE output, seq-gen, and empirical locus missing data patterns.
├── 12-DiscoVista
# Tree files and results of visualizations of support and conflict in gene trees (gene-trees*
directories) and concatenated (species-trees*
directories) analyses. Directories contain results from all empirical loci (*all*
directories) and only empirical loci passing symmetry tests (*symtest*
directories). The clade-annotations.txt
contains clade definitions for the analysis of support. In each directory, files correspond to intermediate and output files generated by DiscoVista, with results visualized in PDF files.
└── 13-introgression
# Files and results of delta-statistic/introgression tests analyses. Three subdirectories, emp_resample
, sym_resample
, and sim_resample
, contain files with *.cf.stat
extension containing concordance factors calculated from resampling gene trees (with replacement) 2,000 times. They correspond to three datasets, containing all empirical loci, only empirical loci passing symmetry tests, and simulated loci, respectively. Similarly emp-cf.cf.stat
, sym-cf.cf.stat
, and sim-cf.cf.stat
files contain concordance factor statistics for the corresponding data, generated using IQ-TREE. most_discord.csv
is the output file for nodes with more than 5% discordant gene trees to investigate introgression. Z_score_test.csv
file contains standardized z-score of observed deltas relative to the null distribution created with the resampling. One-tailed test looking for CDF_P>0.95 (pvalue<0.05) as evidence of introgression.
Sharing/Access information
This data has also been placed on Zenodo:
Code/Software
The most recent version of the R script kinda_date
with documentation can be found on GitHub.