Data from: Accelerating maximum likelihood phylogenetic inference via early stopping to evade (over-)optimization
Data files
Jun 30, 2025 version files 421.38 MB
-
README.md
22.99 KB
-
stopping_criteria_data.tar.gz
421.36 MB
Abstract
Maximum Likelihood (ML) based phylogenetic inference constitutes a challenging optimization problem. Given a set of aligned input sequences, phylogenetic inference tools strive to determine the tree topology, the branch-lengths, and the evolutionary model parameters that maximize the phylogenetic likelihood function. However, there exist compelling reasons to not push optimization to its limits, by means of early, yet adequate stopping criteria. Since input sequences are typically subject to stochastic and systematic noise, caution is warranted to prevent over-optimization and the risk of overfitting the model to noisy data. To address this, we integrate the Kishino-Hasegawa (KH) test into RAxML-NG as a reliable and fast-to-compute Early Stopping criterion to effectively limit excessive and compute-intensive over-optimization. Initially, we introduce a simplified heuristic tree search strategy in RAxML-NG (sRAxML-NG) as an underlying method for Early Stopping. Subsequently, we use the KH test in combination with sRAxML-NG, to statistically assess the significance of differences between intermediate trees prior to and after major optimization steps. The tree search terminates early when improvements are statistically insignificant. We also propose an extension to the standard KH test that allows to correct for multiple testing, which maintains accuracy while achieving even higher speedups. For benchmarking we use 300 large representative empirical datasets from TreeBASE. For 98% of the DNA datasets, all Early Stopping methods we introduce infer trees that are statistically equivalent to those inferred from RAxML-NG v1.2. For AA datasets, the fraction of datasets where sRAxML-NG, KH, and the KH-multiple testing versions infer statistically equivalent trees is 96%, 95%, and 92%, respectively. In conjuction with sRAxML-NG, the average speedup achieved by the KH-multiple testing version is 5x for DNA and 3.9x for protein datasets compared to RAxML-NG v1.2. We implemented our stopping criteria in RAxML-NG, which is available under GNU GPL at https://github.com/togkousa/raxml-ng/tree/stopping-criteria.
This repository contains the datasets used in our manuscript:
- Anastasis Togkousidis, Alexandros Stamatakis, Olivier Gascuel, Accelerating Maximum Likelihood Phylogenetic Inference via Early Stopping to Evade (Over-)optimization, Systematic Biology, 2025;, syaf043, https://doi.org/10.1093/sysbio/syaf043
The study compares Early Stopping (ES) methods in Maximum Likelihood (ML) phylogenetic tree inference against standard RAxML-NG v1.2. The ES versions are implemented as separate versions within RAxML-NG:
- Simplified RAxML-NG (sRAxML-NG)
- KH version, i.e., RAxML-NG using the KH test (Kishino & Hasegawa, 1989)
- KH-multiple testing version (KH version with multiple testing correction)
The repository includes:
- 300 large empirical MSAs
- 1,076 simulated MSAs
Datasets are organized into three main subfolders: empirical-long/, Simulated/, and unsuccessful_MSAs/.
Empirical Datasets
empirical-long/: Contains 300 large empirical MSAs, sampled from the TreeBASE database (Piel, 2000). Details regarding the sampling criteria are provided in the main text. The MSAs are divided into two subfolders, named dna_long_empirical/ and aa_long_empirical/, corresponding to the 222 DNA and 88 amino-acid (AA) MSAs. Each MSA is stored in its one subfolder (within dna_long_empirical/ or aa_long_empirical/), and the name of the subfolder is the respective MSA code on TreeBASE. Each MSA-subfolder contains the following key-files:
alignment.phy: The empirical MSA in PHYLIP format (DNA or AA).difficulty: The Pythia score (Haag, 2022), storing the predicted difficulty of the MSA.pars_*andrand_*: Files related to parsimony and random starting tree executions, respectively.{st_tree_type}_{version}.raxml.log: RAxML-NG execution log files.{st_tree_type}indicates the starting tree type: eitherpars_(parsimony) orrand_(random).{version}indicates the RAxML-NG version used:standard,simplified,KHorKH-mult. For each version, we conduct 10 independent ML tree inferences, starting from either 10 parsimony (pars_*files), or 10 random (rand_*files) starting trees.pars_consel.outandrand_consel.out: CONSEL (Shimodaira, 2001) log files, reporting the results of plausibility tests.pars_rfs_and_runtimes.csvandrand_rfs_and_runtimes.csv: Intermediate data tables generated by the Snakemake pipeline (see Snakemake files), which report: (a) the RF distance (Robinson & Foulds, 1981) between the best (out of 10) ML tree inferred by each version, and the reference tree, and (b) The runtime (in seconds) of each version, to complete 10 independent ML tree inferences. On empirical MSAs, the reference tree is considered to be the best-ML tree inferred via standard RAxML-NG. Therefore, all RF entries on rows referring to standard RAxML-NG are 0. The runtime and RF-distance values are used in our analysis to construct speedup and RF-distance distributions, for benchmark assessment. The data table comprises the following columns:version: The version used to infer the corresponding best ML tree, which is one of the following:standard,simplified,KHorKH-mult.RF: RF distance between the best-ML tree found by the corresponding version, and the reference tree.Runtime: Runtime (in seconds) of each version to conduct 10 ML tree inferences.
pars_summary.parquetandrand_summary.parquet: Intermediate data-tables, generated by the Snakemake pipeline (see Snakemake files), to summarize the dataset-specific results. The data-tables comprise the following columns:newick: Inferred ML trees in Newick format.logLikelihood: The log-likelihood score of the corresponding ML tree.isBest: Indicates whether the corresponding tree is the highest scoring tree, according to its log-likelihood score. Only one entry isTrue, the others areFalse.pKH,pWKH,pSH,pWSH,pAU: The p-values of the statistical tests: KH, w-KH, SH, w-SH, and AU, respectively. These values are extracted from the CONSEL log files.pKH_significant,pWKH_significant,pSH_significant,pWSH_significant,pAU_significant: Indicate whether the corresponding ML tree is significant (plausible), under the corresponding statistical test, i.e., whether its p-value is ≥ 0.05.plausible: Indicates whether the corresponding ML tree is plausible (True) or not (False). Plausibility is assessed based on thepAUscore, i.e., a tree is plausible if pAU ≥ 0.05.version: Shows the version which inferred the corresponding ML tree, which is one of the following:standard,simplified,KHorKH-mult.
Further, two summary data-tables are stored insideempirical-long/:
empirical-long/empirical_data_features.csv: This datatable summarizes MSA features, for each of the 300 empirical MSAs. These features are:Dataset(dataset name),Type(DNA or AA),Difficulty(Pythia score),Sites(number of sites), andTaxa(number of taxa).empirical-long/summary_empirical.csv: This summary table is constructed by the intermediate dataset-specific parquet and CSV files, i.e., the intermediate files stored within each MSA-subfolder. The table comprises the following columns:Dataset: The name of the datasetType: MSA type, i.e., DNA or AAVersion: RAxML-NG version, used to conduct 10 independent ML tree inferences. The distinct versions are:RAxML-NG v1.2(standard),sRAxML-NG(simplified),KH(simple KH), orKH-mult(KH-multiple testing).Starting Tree: The type of starting trees used for the 10 ML inferences, i.e. parsimony or randomRF: The RF distance calculated between the best-ML tree inferred via each version, and the reference tree. For empirical MSAs, the reference tree is the best-ML tree inferred via standard RAxML-NG.Runtime: The runtime (in seconds) of each version to conduct 10 independent ML tree inferences.Speedup_to_standard: Speedup values for the Early Stopping versions relative to standard RAxML-NG. Each speedup is calculated by dividing the runtime of standard RAxML-NG by the runtime of the corresponding version (i.e., simplified, KH, or KH-mult). Both runtimes are obtained from executions on the same dataset (Datasetcolumn) and using the same starting tree type (parsimony or random). By definition, the speedup values reported for standard RAxML-NG itself are 1.0.Speedup_to_simplified: The calculated speedups of the KH versions (KH, KH-mult) relative to the simplified version. The rows corresponding to standard RAxML-NG are left empty (n/a), as the comparison is irrelevant and not informative (see Manuscript for details).Plausible: The number (out of 10) of plausible ML trees inferred via each version, based on the AU test results (see above).Plausible_Category: This column groups the number of plausible ML trees into five categories: 0 (no plausible trees), 1–3 (between 1 and 3 plausible ML trees), 4–6, 7–9, and 10 (all ML trees are plausible).ML_plausible: Boolean column indicating whether the corresponding version inferred at least one plausible ML tree (1) or none (0), based on the counts reported in thePlausiblecolumn.
Simulated Datasets
Simulated/: This folder contains 1,076 simulated DNA MSAs. The results of these simulated datasets are presented in the Supplementary Material of our manuscript. The MSAs were sampled from datasets used in two independent benchmark studies (Höhler, 2022; Trost, 2024). Each simulated MSA is stored in its own subfolder, named exactly after the corresponding dataset in those studies. MSA-subfolder names containing an underscore character "_", such as 10078_0.phy, correspond to MSAs from Trost et al., while MSA-subfolder names without an underscore, such as 9853, correspond to MSAs from Höhler et al. Each MSA-subfolder contains the following key files:
gtr_g_sim_msa.fasta: The simulated DNA MSA in FASTA format.difficulty: The Pythia score, storing the predicted difficulty of the MSA.gtr_g.raxml.bestTree: The reference (true) tree for the corresponding simulated MSA.pars_*andrand_*: Files related to parsimony and starting tree executions, respectively.{st_tree_type}_{version}.raxml.log: RAxML-NG execution log files.{st_tree_type}indicates the starting tree type: eitherpars_(parsimony) orrand_(random).{version}indicates the RAxML-NG version used:standard,simplified,KHorKH-mult. For some simulated MSAs, we also include log files from two additional versions developed during experimentation: Sampling Noise Normal (SN-Normal;sn-normal) and Sampling Noise RELL (SN-RELL;sn-rell). These additional stopping criteria were part of the experimentation phase and are not reported in the Manuscript (mostly due to negative results), but are documented on GitHub. For each version, we conduct 10 independent ML tree inferences, starting from either 10 parsimony (pars_*files), or 10 random (rand_*files) starting trees.pars_consel.outandrand_consel.out: CONSEL log files reporting the results of plausibility tests.pars_rfs_and_runtimes.csvandrand_rfs_and_runtimes.csv: Intermediate data tables generated by the Snakemake pipeline (see Snakemake files), which store: (a) the RF distance between the best (out of 10) ML trees inferred by each version, and the reference tree, and (b) the runtime (in seconds) of each version to complete 10 independent ML tree inferences. On simulated MSAs, the reference tree is the true tree used for simulations (gtr_g.raxml.bestTree). The runtime and RF-distance values are used in our analysis to construct speedup and RF-distance distributions, for benchmark assessment. The data table contains the following columns:version: Version used to infer the corresponding best ML tree, which is one of the following:standard,simplified,KHorKH-mult(for some datasetssn-normalandsn-rellas well).RF: RF distance between the best-ML tree found by the corresponding version, and the reference tree.Runtime: Runtime (in seconds) of each version to conduct 10 ML tree inferences.
pars_summary.parquetandrand_summary.parquet: Intermediate data-tables, generated by the Snakemake pipeline (see Snakemake files), summarizing dataset-specific results. The data-tables comprise the following columns:newick: Inferred ML trees in Newick format.logLikelihood: The log-likelihood score of the corresponding ML tree.isBest: Indicates whether the corresponding tree is the highest scoring tree, according to its log-likelihood score. Only one entry isTrue, the others areFalse.pKH,pWKH,pSH,pWSH,pAU: The p-values of the statistical tests: KH, w-KH, SH, w-SH, and AU, respectively. These values are extracted from the CONSEL log files.pKH_significant,pWKH_significant,pSH_significant,pWSH_significant,pAU_significant: These values indicate whether the corresponding ML tree is significant, based on the corresponding statistical test, i.e., when the p-value is ≥ 0.05.plausible: Indicates whether the corresponding ML tree is plausible (True) or not (False). Plausibility is assessed based on thepAUscore, i.e., a tree is plausible if pAU ≥ 0.05.version: Version used to infer the corresponding ML tree, which is one of the following:standard,simplified,KHorKH-mult(for some datasetssn-normalandsn-rellas well).
Further, two summary data-tables are included in Simulated/:
Simulated/simulated_data_features.csv: Data-table summarizing the MSA features, for each of the 1,076 simulated DNA MSAs. Its columns areDataset(dataset name),Type(DNA only),Difficulty(Pythia score),Sites(number of sites), andTaxa(number of taxa).Simulated/summary_simulated.csv: This summary table is constructed by the intermediate dataset-specific parquet and CSV files, i.e., the intermediate files stored within each MSA-subfolder. The table comprises the following columns:Dataset: The name of the datasetType: DNAVersion: RAxML-NG version which was used to conduct 10 independent ML tree inferences. It can be one of the following:RAxML-NG v1.2(standard),sRAxML-NG(simplified),KH(simple KH), orKH-mult(KH-multiple testing). For some datasets it can also beSN-NormalorSN-RELL.Starting Tree: The type of starting trees used for the 10 ML inferences, i.e. parsimony or randomRF: RF distance between the best-ML tree inferred via each version, and the reference tree. For simulated MSAs, the reference tree is true tree used for simulations (gtr_g.raxml.bestTree).Runtime: The runtime (in seconds) of each version to conduct 10 independent ML tree inferences.Speedup_to_standard: Speedup values for the Early Stopping versions relative to standard RAxML-NG. Each speedup is calculated by dividing the runtime of standard RAxML-NG by the runtime of the corresponding version (i.e., simplified, KH, or KH-mult; for some datasets also SN-Normal and SN-RELL). Both runtimes are obtained from executions on the same dataset (Datasetcolumn) and using the same starting tree type (parsimony or random). By definition, the speedup values reported for standard RAxML-NG itself are 1.0.Speedup_to_simplified: The calculated speedups of the KH and KH-mult versions (for some datasets also SN-Normal and SN-RELL) relative to the simplified version. The rows corresponding to standard RAxML-NG are left empty (n/a), as the comparison is irrelevant and not informative (see Manuscript for details).Plausible: The number (out of 10) of plausible ML trees inferred via each version, based on the AU test results (see above).Plausible_Category: This column groups the number of plausible ML trees into five categories: 0 (no plausible trees), 1–3 (between 1 and 3 plausible ML trees), 4–6, 7–9, and 10 (all ML trees are plausible).ML_plausible: Boolean column indicating whether the corresponding version inferred at least one plausible ML tree (1) or none (0), based on the counts reported in thePlausiblecolumn.
Unsuccessful datasets
unsuccessful_MSAs/: This folder contains a subset of empirical MSAs, referred to as "Unsuccessful" in our manuscript. These are datasets for which the Early Stopping versions failed to infer at least one plausible ML tree. For these datasets, we performed additional analyses by progressively increasing the regrafting radius of the SPR rounds in the Early Stopping versions. The default radius is 10; we varied it using the --spr-radius X command (see below), where X ∈ {12, 14, 16, 18, 20}. The folder contains two subfolders: dna_long_empirical/ and aa_long_empirical/, corresponding to the unsuccessful DNA and AA empirical MSAs, respectively. Each empirical MSA has its own MSA-subfolder containing the following files:
alignment.phy: The empirical MSA in PHYLIP format (DNA or AA).difficulty: The Pythia score.pars_*andrand_*: Files related to parsimony and random starting tree executions, respectively. Some subfolders may contain only one of the two sets, while others may contain both, depending on which starting tree types failed to infer at least one plausible ML tree.pars_consel.outandrand_consel.out: CONSEL log files reporting the results of plausibility tests.pars_rfs_and_runtimes.csvandrand_rfs_and_runtimes.csv: Intermediate data tables generated by the Snakemake pipeline (see Snakemake files), which report: (a) the RF distance between the best (out of 10) ML trees inferred by each version and the reference tree, and (b) the runtime (in seconds) of each version, to complete 10 independent ML tree inferences. On empirical MSAs, the reference tree is considered to be the best-ML tree inferred via standard RAxML-NG. Therefore, all RF entries corresponding to standard RAxML-NG are 0. The runtime and RF-distance values are used in our analysis to generate speedup and RF-distance distributions, for benchmark assessment. The data table comprises the following columns:version: Version used to infer the corresponding best ML tree. It can be one of the following:standard,simplified-X,KH-mult-X, where X is a numeric variable, X ∈ {12,14,16,18,20}.RF: RF distance between the best-ML tree found by the corresponding version, and the reference tree.Runtime: Runtime (in seconds) for each version to conduct 10 ML tree inferences.
pars_summary.parquetandrand_summary.parquet: Intermediate data-tables, generated by the Snakemake pipeline (see Snakemake files), to summarize the dataset-specific results. These tables comprise the following columns:newick: Inferred ML trees in Newick format.logLikelihood: The log-likelihood score of the corresponding ML tree.isBest: Indicates whether the corresponding tree is the highest scoring tree, according to its log-likelihood score. Only one entry isTrue, the others areFalse.pKH,pWKH,pSH,pWSH,pAU: The p-values of the statistical tests: KH, w-KH, SH, w-SH, and AU, respectively. These values are extracted from the CONSEL log fles.pKH_significant,pWKH_significant,pSH_significant,pWSH_significant,pAU_significant: These values indicate whether the corresponding ML tree is significant (plausible), based on the corresponding statistical test, i.e., when the p-value is ≥ 0.05.plausible: Indicates whether the corresponding ML tree is plausible (True) or not (False). Plausibility is assessed based on thepAUscore, i.e., a tree is plausible if pAU ≥ 0.05.version: Version which inferred the corresponding ML tree, which is one of the following:standard,simplified-X,KH-mult-X, where X is a numeric variable, X ∈ {12,14,16,18,20}.
Further, unsuccessful_MSAs/ folder contains the following summary table:
unsuccessful_MSAs/failed-data-summary.csv: Summary table constructed by the intermediate MSA-specific parquet files and CSV files. Its columns are:Dataset: The name of the datasetType: MSA type, i.e., DNA or AAVersion: RAxML-NG version used to conduct 10 independent ML tree inferences. It can be one of the following:RAxML-NG v1.2(standard),sRAxML-NG(simplified),KH-mult(KH-multiple testing). We did not includeKHin this analysis.Radius: The SPR-radius being used for the corresponding version execution. This number belongs to the set {12,14,16,18,20}. For standard RAxML-NG executions, the cell is left empty (n/a), as no experiments were conducted with adjusted SPR regrafting radius for this version.Starting Tree: The type of starting trees used for the 10 inferences, i.e. parsimony or randomPlausible: The number (out of 10) of plausible ML trees inferred via each version, based on the AU test results (see above).Plausible_Category: This column groups the number of plausible ML trees into five categories: 0 (no plausible trees), 1–3 (between 1 and 3 plausible ML trees), 4–6, 7–9, and 10 (all ML trees are plausible).ML_plausible: Boolean column indicating whether the corresponding version inferred at least one plausible ML tree (1) or none (0), based on the counts reported in thePlausiblecolumn.
Tool invocation commands:
To invoke the standard RAxML-NG v1.2.:
# Using 10 parsimony starting trees
./raxml-ng-adaptive --adaptive off --threads 1 --msa {msa} --model {model} --tree pars{10} --seed 0
# Using 10 random starting trees
./raxml-ng-adaptive --adaptive off --threads 1 --msa {msa} --model {model} --tree rand{10} --seed 0
To invoke the simplified (sRAxML-NG) version:
# Using 10 parsimony starting trees (the random trees are omitted -- see above)
./raxml-ng-adaptive --adaptive off --threads 1 --msa {msa} --model {model} --tree pars{10} --seed 0 --extra simplified-on
# To specify a different spr radius (e.g. X = 20):
./raxml-ng-adaptive --adaptive off --threads 1 --msa {msa} --model {model} --tree pars{10} --seed 0 --extra simplified-on --spr-radius 20
To invoke the KH version:
# Using 10 parsimony starting trees (the random trees are omitted -- see above)
./raxml-ng-adaptive --adaptive off --threads 1 --msa {msa} --model {model} --tree pars{10} --seed 0 --stopping-criterion KH
To invoke the KH-multiple testing version:
# Using 10 parsimony starting trees (the random trees are omitted -- see above)
./raxml-ng-adaptive --adaptive off --threads 1 --msa {msa} --model {model} --tree pars{10} --seed 0 --stopping-criterion KH-mult
# To specify a differet SPR radius (e.g. 20), use the --spr-radius 20 command, see above
To invoke the Sampling Noise versions (for some simulated data):
# SN-Normal, Using 10 parsimony starting trees
./raxml-ng-adaptive --adaptive off --threads 1 --msa {msa} --model {model} --tree pars{10} --seed 0 --stopping-criterion sn-normal
# SN-RELL, Using 10 parsimony starting trees
./raxml-ng-adaptive --adaptive off --threads 1 --msa {msa} --model {model} --tree pars{10} --seed 0 --stopping-criterion sn-rell
Further details
Please refer to the main text of the manuscript for a comprehensive explanation of the experimental setup, analysis, and results.
References
Haag, J., Höhler, D., Bettisworth, B., & Stamatakis, A. (2022). From easy to hopeless—predicting the difficulty of phylogenetic analyses. Molecular Biology and Evolution, 39(12), msac254.
Höhler, D., Haag, J., Kozlov, A. M., & Stamatakis, A. (2022). A representative performance assessment of maximum likelihood based phylogenetic inference tools. BioRxiv, 2022-10.
Kishino, H., & Hasegawa, M. (1989). Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in Hominoidea. Journal of Molecular Evolution, 29, 170-179.
Piel, W. H., Donoghue, M., Sanderson, M., & Netherlands, L. (2000, May). TreeBASE: a database of phylogenetic information. In Proceedings of the 2nd International Workshop of Species (Vol. 2000).
Robinson, D. F., & Foulds, L. R. (1981). Comparison of phylogenetic trees. Mathematical Biosciences, 53(1-2), 131-147.
Shimodaira, H., & Hasegawa, M. (2001). CONSEL: for assessing the confidence of phylogenetic tree selection. Bioinformatics, 17(12), 1246-1247.
Trost, J., Haag, J., Höhler, D., Jacob, L., Stamatakis, A., & Boussau, B. (2024). Simulations of sequence evolution: how (un)realistic they are and why. Molecular Biology and Evolution, 41(1), msad277.
