Data for: Branch length transforms using optimal tree metric matching
Data files
May 08, 2026 version files 7.64 GB
-
ASTRALIII-extended.tar.gz
6.10 GB
-
bacterial_dataset.tar.gz
1.59 MB
-
README.md
12.14 KB
-
S200-perturbed.tar.gz
232.84 MB
-
TCMM-ASTRAL-Outlier-Results.tar.gz
556.06 MB
-
TCMM-ASTRAL-SU-Results.tar.gz
289.93 MB
-
TCMM-HGT-SU-Results.tar.gz
124.43 MB
-
TreeShrink_data.tar.gz
75.25 MB
-
WoL.tar.gz
260.09 MB
Abstract
The abundant discordance between evolutionary relationships across the genome has rekindled interest in methods for comparing and averaging trees on a shared leaf set. However, compared to tree topology, where much progress has been made, handling branch lengths has been more challenging. Species tree branch lengths can be measured in various units, often different from gene trees. Moreover, rates of evolution change across the genome, the species tree, and specific branches of gene trees. These factors compound the stochasticity of coalescence times and estimation noise, making branch lengths highly heterogeneous across the genome. For many downstream applications in phylogenomic analyses, branch lengths are as important as the topology, and yet, existing tools to compare and combine weighted trees are limited. In this paper, we address the question of matching one tree to another, accounting for their branch lengths. We define a series of computational problems called Topology-Constrained Metric Matching (TCMM) that seek to transform the branch lengths of a query tree based on a reference tree. We show that TCMM problems can be solved in quadratic time and memory using a linear algebraic formulation coupled with dynamic programming preprocessing. While many applications can be imagined for this framework, we explore two applications in this paper: embedding leaves of gene trees in Euclidean space to find outliers potentially indicative of estimation errors and summarizing gene tree branch lengths onto the species tree. In these applications, our method, when paired with existing methods, increases their accuracy at limited computational expense.
Branch Length Estimation:
1) S100 Dataset (Tabatabaee et al. 2023):
The original dataset is available at https://github.com/ytabatabaee/CASTLES-paper. We provide the results of our experiments in the TCMM-ASTRAL-SU-Results.tar.gz directory.
Here is the description of each file in this directory:
TCMM-ASTRAL-SU-Results/*/truegenetrees: True gene trees for each replicate.TCMM-ASTRAL-SU-Results/*/fasttree_genetrees_[length]_non: Estimated gene trees fromlengthbp-long sequences.TCMM-ASTRAL-SU-Results/*/castlespro_truegenetrees_s_tree.trees: True species tree with SU branch lengths assigned by CASTLES-Pro.TCMM-ASTRAL-SU-Results/*/castlespro_fasttree_genetrees_[length]_non_s_tree.trees: Estimated species tree fromTCMM-ASTRAL-SU-Results/*/fasttree_genetrees_[length]_nonwith SU branch lengths assigned by CASTLES-Pro.TCMM-ASTRAL-SU-Results/*/TCMM_castlespro_truegenetrees_s_tree_lam_[lambda].trees: True species tree with SU branch lengths assigned by TCMM (lambda =[lambda]). The input species tree and gene trees to TCMM areTCMM-ASTRAL-SU-Results/*/castlespro_truegenetrees_s_tree.treesandTCMM-ASTRAL-SU-Results/*/truegenetrees, respectively.TCMM-ASTRAL-SU-Results/*/TCMM_castlespro_truegenetrees_s_tree_lam_best.trees: True species tree with SU branch lengths assigned by TCMM (automatic lambda selection). The input species tree and gene trees to TCMM areTCMM-ASTRAL-SU-Results/*/castlespro_truegenetrees_s_tree.treesandTCMM-ASTRAL-SU-Results/*/truegenetrees, respectively.TCMM-ASTRAL-SU-Results/*/TCMM_castlespro_fasttree_genetrees_[length]_non_s_tree_lam_[lambda].trees: Estimated species tree with SU branch lengths assigned by TCMM (lambda =[lambda]). The input species tree and gene trees to TCMM areTCMM-ASTRAL-SU-Results/*/castlespro_fasttree_genetrees_[length]_non_s_tree.treesandTCMM-ASTRAL-SU-Results/*/fasttree_genetrees_[length]_non, respectively.TCMM-ASTRAL-SU-Results/*/TCMM_castlespro_fasttree_genetrees_[length]_non_s_tree_lam_best.trees: Estimated species tree with SU branch lengths assigned by TCMM (automatic lambda selection). The input species tree and gene trees to TCMM areTCMM-ASTRAL-SU-Results/*/castlespro_fasttree_genetrees_[length]_non_s_tree.treesandTCMM-ASTRAL-SU-Results/*/fasttree_genetrees_[length]_non, respectively.
2) S100 Dataset (Extended model conditions):
This dataset is an extended version of the S100 Dataset to include more model conditions. There are six model conditions in this dataset (ASTRALIII-lowILS, ASTRALIII-medILS, ASTRALIII-hiILS, ASTRALIII-medILSHighHet, ASTRALIII-medILSHighHetHg0, and ASTRALIII-medILSHighHetHg2.5). All model conditions are in ASTRALIII-extended.tar.gz directory.
Here is the description of each file in this directory:
ASTRALIII-extended/[moel]/[model].command: The Simphy command used to simulate the dataset.ASTRALIII-extended/[model]/[model].params: The parameters used in the Simphy simulation.ASTRALIII-extended/[model]/*/s_tree.trees: True species tree in substitution unit.ASTRALIII-extended/[model]/*/truegenetrees: True gene trees.ASTRALIII-extended/[model]/*/castles-pro_truegenetrees_s_tree.trees: CASTLES-Pro species tree with estimated branch lengths in substitution unit.ASTRALIII-extended/[model]/*/erable_s_tree.trees.length.nwk: ERaBLE species tree with estimated branch lengths in substitution unit.ASTRALIII-extended/[model]/*/fastme-2.1.6.2_s_tree.trees: FastME species tree with estimated branch lengths in substitution unit.ASTRALIII-extended/[model]/*/TCMM_castlespro/TCMM_castlespro_truegenetrees_lam_[lambda].trees: TCMM species tree with estimated branch lengths in substitution unit for each gene tree using lambda=[lambda].ASTRALIII-extended/[model]/*/TCMM_castlespro/trunc_jenks_mean_castles_pro_results_lam_[lambda].trees: TCMM species tree with estimated branch lengths in substitution unit after outlier removal and summarization for lambda=[lambda].
3) HGT Dataset:
The results of the simulated dataset is provided in the TCMM-HGT-SU-Results.tar.gz.
Here is the description of each file in this directory:
TCMM-HGT-SU-Results/[model]/*/estimatedgenetre: Estimated gene trees.TCMM-HGT-SU-Results/[model]/*/castlespro_estimatedgenetre_s_tree.trees: Species tree with SU branch lengths assigned by CASTLES-Pro.TCMM-HGT-SU-Results/[model]/*/TCMM_castlespro_estimatedgenetre_lam_[lambda].trees: Species tree with SU branch lengths assigned by TCMM (lambda =[lambda]). The input species tree and gene trees to TCMM areTCMM-HGT-SU-Results/[model]/*/castlespro_estimatedgenetre_s_tree.treesandTCMM-HGT-SU-Results/[model]/*/estimatedgenetre, respectively.TCMM-HGT-SU-Results/[model]/*/TCMM_castlespro_estimatedgenetre_lam_best.trees: Species tree with SU branch lengths assigned by TCMM (lambda = automatic lambda selection). The input species tree and gene trees to TCMM areTCMM-HGT-SU-Results/[model]/*/castlespro_estimatedgenetre_s_tree.treesandTCMM-HGT-SU-Results/[model]/*/estimatedgenetre, respectively.
4) Bacterial Dataset (Moody et al. 2022):
The results of the prokaryotic dataset is provided in the bacterial_dataset.tar.gz directory. The original dataset can be found at https://doi.org/10.6084/m9.figshare.13395470.
Here is the description of each file in this directory:
bacterial_dataset/core_genes.tre: Estimated core gene trees.bacterial_dataset/non_ribosomal_genes.tre: Estimated non-robisomal gene trees.bacterial_dataset/castles_pro_core_genes.tre: Estimated species tree from core gene trees with SU branch lengths assigned by CASTLES-Pro.bacterial_dataset/castles_pro_non_ribosomal.tre: Estimated species tree from non-ribosomal gene trees with SU branch lengths assigned by CASTLES-Pro.bacterial_dataset/per_gene_castles_pro_core_genes_lam_[lambda].trees: Estimated species tree from core gene trees with SU branch lengths assigned by TCMM (lambda =[lambda]). The input species tree and gene trees to TCMM arebacterial_dataset/castles_pro_core_genes.treandbacterial_dataset/core_genes.tre, respectively.bacterial_dataset/per_gene_castles_pro_non_ribosomal_genes_lam_[lambda].trees: Estimated species tree from non-ribosomal gene trees with SU branch lengths assigned by TCMM (lambda =[lambda]). The input species tree and gene trees to TCMM arebacterial_dataset/castles_pro_non_ribosomal.treandbacterial_dataset/non_ribosomal_genes.tre, respectively.
5) WoL Dataset (Zhu et al. 2019):
The results of the WoL dataset is provided in the WoL.tar.gz directory. The original dataest can be found at https://github.com/biocore/wol/tree/master/data.
Here is the description of each file in this directory:
WoL/castles_pro_wol.tre: Estimated species tree with SU branch lengths assigned by CASTLES-Pro.WoL/genetrees/p[geneID].nwk: Estimated individual gene trees.WoL/TCMM-results/p[geneID]_lam_[lambda].trees: Estimated species tree with SU branch lengths assigned by TCMM (lambda =[lambda]). The input species tree and gene trees to TCMM areWoL/castles_pro_wol.treandWoL/genetrees/p[geneID].nwk, respectively.
Outlier Detection:
1) S200-perturbed Dataset:
The outlier detection results for this dataset are provided in S200-perturbed.tar.gz directory. Here is a description of each file in this directory:
S200-perturbed/true_outliers.txt: The ground truth outliers for this dataset. Columns are replicate, gene tree number, and outlier species, respectively.S200-perturbed/outliers_gene_vs_gene_k1.5.csv: The outliers outputted by PhylteR and TCMM+PhylteR fork=1.5(the value used in the paper). The columns of this file show replicate number, gene tree number (with 1 being the first gene tree), outlier species, lambda, and outlier detection threshold k.S200-perturbed/TreeShrink_outliers.txt: The outliers reported by TreeShrink. Columns are replicate number, lambda, gene tree number, outlier species, and p-value. We only considered rows withp-value < 0.05as outliers.S200-perturbed/*/s_tree.tree: True species tree in the unit of number of generations.S200-perturbed/*/truegenetrees_100: True gene trees (the first 100 from the original dataset).S200-perturbed/*/modifiedTrees_100: Modified gene trees after introducing outliers.S200-perturbed/*/modifiedTrees_100_lam_[lambda].trees: TCMM output on modeified gene trees (lambda = [lambda]). Note that [lambda] = true means original gene trees before running TCMM.
2) S100 Dataset (Tabatabaee et al. 2023):
The outlier detection results for the S100 dataset are provided in the TCMM-ASTRAL-Outlier-Results.tar.gz directory. Here is a description of each file in this directory:
TCMM-ASTRAL-Outlier-Results/*/truegenetrees_100: The first 100 true gene trees.TCMM-ASTRAL-Outlier-Results/*/fasttree_genetrees_[length]_non_100: The first 100 estimated gene trees fromlengthbp-long sequences.TCMM-ASTRAL-Outlier-Results/*/truegenetrees_100_lam_[lambda].trees: Modified true gene trees by TCMM (lambda = [lambda]). The input gene trees to TCMM areTCMM-ASTRAL-SU-Results/*/outlier_detection/truegenetrees_100.TCMM-ASTRAL-Outlier-Results/*/fasttree_genetrees_[length]_non_100_lam_[lambda].trees: Modified estimated gene trees by TCMM (lambda = [lambda]). The input gene trees to TCMM areTCMM-ASTRAL-SU-Results/*/outlier_detection/fasttree_genetrees_[length]_non_100.TCMM-ASTRAL-Outlier-Results/*/TCMM_outliers_truegenetrees.csv: Detected outliers for the modified true gene trees by TCMM. Columns are outlier species, gene trees number (starting from 1), lambda, outlier detection threshold k, and replicate number.TCMM-ASTRAL-Outlier-Results/*/TCMM_outliers_fasttreegenetrees.csv: Detected outliers for the modified estimated gene trees by TCMM. Columns are outlier species, gene trees number (starting from 1), lambda, outlier detection threshold k, and replicate number, and gene sequence length.TCMM-ASTRAL-Outlier-Results/*/TreeShrink_outliers_truegenetrees.csv: Detected outliers for the modified true gene trees by TreeShrink. Columns are replicate number, lambda, gene tree number, outlier species, and p-value. We only considered rows withp-value < 0.05as outliers.TCMM-ASTRAL-Outlier-Results/*/TreeShrink_outliers_fasttreegenetrees.csv: Detected outliers for the modified estimated gene trees by TreeShrink. Columns are replicate number, lambda, gene sequence length, gene tree number, outlier species, and p-value. We only considered rows withp-value < 0.05as outliers.
3) TreeShrink Dataset (Mai and Mirarab 2018):
The original dataest can be found at https://github.com/uym2/TreeShrink. The results of TCMM outlier detection can be found in the TreeShrink_data.tar.gz directory. This directory contains six biological datasets: TreeShrink_data/Plants, TreeShrink_data/Mammals, TreeShrink_data/Frogs, TreeShrink_data/Insects, TreeShrink_data/XenRouse, and TreeShrink_data/XenCannon. Here is a description of each file in these directories:
TreeShrink_data/[data]/unfiltered.trees: Estimated gene trees.TreeShrink_data/[data]/gene_vs_gene_unfiltered_lam_[lambda].trees: Modified estimated gene trees by TCMM (lambda = [lambda]). The input gene trees to TCMM areTreeShrink_data/[data]/unfiltered.trees.TreeShrink_data/[data]/TCMM_outliers.csv: Detected outliers for the modified estimated gene trees by TCMM. Columns are outlier species, gene trees number, lambda, and outlier detection threshold k.TreeShrink_data/[data]/TreeShrink_outliers.txt: Detected outliers for the modified estimated gene trees by TreeShrink. Columns are lambda, gene trees number, outlier species, and p-value. We only considered rows withp-value < 0.05as outliers.
