The Dayhoff Exchange Score: A new metric to quantify site saturation in amino acid datasets prior to phylogenetic analysis
Data files
Aug 12, 2024 version files 8.82 GB
-
DEScoreSuppData.zip
8.82 GB
-
README.md
10.55 KB
Dec 17, 2025 version files 14.52 GB
-
1_Kocot2017.zip
48.07 MB
-
2_MissingData_Simulations.zip
832.26 MB
-
3_NoiseSimulations.zip
546.43 MB
-
4_Other.zip
8.21 MB
-
5_PositionAndTaxaIncrease_DEScore.zip
3.49 GB
-
6_SaturationSims.zip
4.93 GB
-
7_RealDataStudies.zip
2.80 GB
-
8_ComparisonWithSatuRation.zip
1.86 GB
-
README.md
17.18 KB
Dec 22, 2025 version files 14.52 GB
-
1_Kocot2017.zip
48.07 MB
-
2_MissingData_Simulations.zip
832.26 MB
-
3_NoiseSimulations.zip
546.43 MB
-
4_Other.zip
8.21 MB
-
5_PositionAndTaxaIncrease_DEScore.zip
3.49 GB
-
6_SaturationSims.zip
4.93 GB
-
7_RealDataStudies.zip
2.80 GB
-
8_ComparisonWithSatuRation.zip
1.86 GB
-
README.md
17.18 KB
Abstract
Entropic site saturation is a persistent problem in phylogenetic analyses, where it can hinder the accuracy of topology reconstruction. It is fundamentally caused by large amounts of independent change along branches, causing the model to be unable to distinguish phylogenetic signal from noise. The Dayhoff Exchange Score (DE-score) is a new metric to assess this form of site saturation within and between amino acid datasets, which provides both a whole dataset overview and taxon-specific values that represent the contribution of a given taxon to the whole dataset entropic site saturation. We first assess the efficacy of this score at detecting increased entropic site saturation over 20,000 simulation datasets, compare it to the existing Slope R2 score, and then assess its efficacy in the face of the potentially confounding factors of increasing taxon number, number of positions in the alignment, missing data, and noise. Finally, we use the DE-Score to re-evaluate several previously published datasets to illustrate its efficacy.
This supplemental information accompanies the DEScore paper, and is formatted as follows:
Inside the directory DEScoreSuppData you will find 6 folders:
1_Kocot2017.zip
2_MissingData_Simulations.zip
3_NoiseSimulations.zip
4_Other.zip
5_PositionAndTaxaIncrease_DEScore.zip
6_SaturationSims.zip
7_RealDataStudies.zip
8_ComparisonWithSatuRation.zip
1. Kocot2017
This directory contains three further directories, Kocot2017_Slope, Best532_LB_DE_nRCFV and DEScore_Reselected.
Kocot2017_Slope comprises 6 directories: these are copies of the Best 106 genes through to Best 532 genes as selected by the Slope Measurements in Kocot et al 2017. These directories each contain a phylip alignment file, a partition file in txt format, and the output of a RAxML analysis: a most likely tree (RAxML_bipartitions.<Slope Sextile>.tre), bootstrap support files, and a bipartition file.
Best532_LB_DE_nRCFV contains the results of the Kocot et al 2017 reselection based on the genes in the top 5 sextiles of the LB-Score, DE-Score and nRCFV. It comprises a fasta file and a set of IQTree output files (including a newick tree).
DEScore_Reselected comprises 5 directories: these each correspond to the Best 106 genes through to the Best 532 genes as selected by the DE-Score. Each directory contains 10 files: a fasta multiple sequence alignment file, a list of selected genes in txt format, partition file in text format (formatted by cat sequences, then manually reformatted for input to IQtree), a most likely tree (DEScore.<Best Sextile>.partitions.clean.txt.treefile), a most likely tree with the taxa labels corrected to correspond to the Kocot et al naming scheme (DEScore.<Best Sextile>.partitions.txt.clean.treefile) and 5 intermediate output files supplied by IQtree, detailed the maximum likelihood distances (mildist), the logs of the phylogenetic tree run (.iqtree and .log), the checkpoint file (ckp.gz), neighbor joining tree used to initialize the tree search (bionj)
2. MissingData_Simulation
This directory contains the Missing Data simulation data, in both Gapped and Ungapped forms, in the Gaps and NoGaps directories, respectively. The Dayhoff Group Exchange Frequency (labelled Transition Frequency), the absolute count standard deviation (Transition StdDev) and the Frequency standard deviation (TransFreqStDev) for each simulation category can be found in Missing_<SextileNumber>.TransitionTransversion.Pairwise.txt, in the Gaps and NoGaps directories. These measurements were used to originally calculate the DEScore (as seen in the Excel sheet in 4.Other).
Inside each further directory (labelled by sextile from 1of6 to 6of6) is the simulation data from the Fleming & Struck (2023) paper that is reused here, in phylip multiple sequence alignment format.
3. NoiseSimulations
This directory contains the Noise Simulations generated by NoiseMaker (which can be found at the DEScore Github, and in 4. Other).
Inside NoiseSimulations are three directories: OriginalSims, NoiseMaker and ConCatSims. OriginalSims contains 100 simulation datasets, in phylip format, that can also be found in 5_PositionAndTaxaIncrease_DEScore/250Taxa/2100Bases/. NoiseMaker contains the pure noise that was added to these datasets, formatted in 5 directories (from 10Percent Noise through the 50Percent Noise). The noise is formatted in philip format.
The ConCatSims directory contains the data analysed in the Noise Simulation analysis. The Dayhoff Group Exchange Frequency (labelled Transition Frequency), the absolute count standard deviation (Transition StdDev) and the Frequency standard deviation (TransFreqStDev) for each simulation category can be found in <NoisePercent>.TransitionTransversion.Pairwise.txt. These measurements were used to originally calculate the DEScore (as seen in the Excel sheet in 4.Other).
The ConCatSims directory also contains the alignments, in the 5 directories 10Percent, 20Percent, 30Percent, 40Percent and 50Percent. These directories each contain 100 phylip files corresponding to the concatenation of one of the original simulations (250 Taxa by 2100 positions) with the corresponding pair of additional noise (for example, OriginalSims/250by2100.aa.1.fas.phy is paired with NoiseMaker/10Percent/1.Noise.phy to produce ConCatSims/10Percent/10Percent.1.Concat.phy.
4. Other
This directory contains the software that can be found at the Github: DEScore_Calculator.pl and Noisemaker.pl, in the subdirectory DEScore-main. It also contains an additional copy of the PDF Supplementary Information file associated with the main text.
It additionally contains the DEScore Excel file, which provides a compiled form of the data used to initially calculate the DE-Score. This Excel file comprises six sheets:
1\) DayhoffJTTComparison, which assesses the ability of the DE-Score and Dayhoff Exchange Frequency to assess site saturation across increasing branch scaling factors, under two models. The source data that populates this sheet can be found in directory 6_SaturationSims
2\) ComparingSlopeAndDEScore, which compares the results of Slope and the DE-Score under the Chang et al trees, and shows the Spearman's rank correlation coefficients between each dataset, which are used in the main text. The source data that populates this sheet can be found in directory 6_SaturationSims.
3\) BiasCausedByTaxaSites: The analysis of the variation in Dayhoff Exchange Frequency and DE-Score with respect to the number of taxa and amino acid sites in the dataset. This data was used to form the DE-Score, as detailed in the supplementary text. The source data that populates this sheet can be found in directory 5_PositionAndTaxaIncrease_DEScore.
4\) MissingDataBias: The analysis of the variation in Dayhoff Exchange Frequency and DE-Score with respect to the amount of Missing Data in the dataset. Ungapped simulations were generated by alisim's alignment mimic command filling gaps with amino acids based on the proportional amino acid frequencies of the source dataset, while gapped simulations were generated by preserving the gaps present in the original source dataset. The source data that populates this sheet can be found in directory 2_MissingData_Simulations
5\) Noise: The analysis of the variation in Dayhoff Exchange Frequency and DE-Score with respect to the amount of Noise in the dataset. Noise was generated by concatenating 10, 20, 30, 40 and 50% of pure noise (generated by Noisemaker.rub, found in directory 4) to the 250by2100 dataset from directory 5_PositionAndTaxaIncrease_DEScore. The source data that populates this sheet can be found in directory 3_Noise.
6\) Kocot_RealData: The selection of the sextiles for the reselection and Real Data analysis that populates directory 1_Kocot2017. The second half of this sheet (Columns P through AI) assesses which genes appear in both datasets. #N/A denotes a gene that does not appear in that sextile in the DE-Score selected dataset, but that does appear in the Slope selected dataset.
5. PositionAndTaxaIncrease_DEScore
This directory contains the simulation data, generated in IQ-Tree using the alisim function, corresponding to amino acid datasets with increasing numbers of taxa and amino acid positions from 50 to 500 taxa and 300 to 3000 amino acid positions.
These simulation datasets were generated using the following command:
iqtree2 --alisim $SimulationPrefix --seed 2014 -m WAG+F+G -t RANDOM{bal/$taxaNumber} --length $alignmentLength --num-alignments 100 --out-format fasta -redo
This directory is first divided by taxa, and within that it is divided by position. For example, 50Taxa/50by3000 corresponds to a 3000 amino acid long alignment with 50 taxa.
In each Taxa folder, you will find 10 txt files name <Taxa>by<Position>.TransitionTransversion.Pairwise.txt. This file contains the Dayhoff Group Exchange Frequency (labelled Transition Frequency), the absolute count standard deviation (Transition StdDev) and the Frequency standard deviation (TransFreqStDev). These were used to originally calculate the DEScore (as seen in the Excel sheet in 4.Other)
Each taxa-position pair (eg: PositionAndTaxaIncrease_DEScore/50Taxa/50by3000) folder additionally contains the 100 simulations in both .fa and .phy format.
6. SaturationSims
This directory contains the simulation data, originally sourced from Hernandez & Ryan (2021), corresponding to amino acid datasets with increasing amounts of saturation (due to branch length scaling factors).
Inside this directory you will find 01-DAYHOFF and 02-JTT, as in the supplemental data of Hernandez & Ryan 2021. These directories correspond to the model the data was generated under.
Within 01-DAYHOFF you will find:
\- Chang.tre: the original Chang et al tree file. This is the "true tree" used for the simulation comparisons.
\- Chang.PAM.<number>.TransitionTransversion.Pairwise.txt. The Dayhoff Group Exchange Frequency (labelled Transition Frequency), the absolute count standard deviation (Transition StdDev)
and the Frequency standard deviation (TransFreqStDev) for each simulation category, alongside the average uncorrected p-distance and the standard deviation of the p-distance of the
simulation. These measurements were used to originally calculate the DEScore (as seen in the Excel sheet in 4.Other).
\- 20 further directories corresponding to the 20 branch length scaling factors, each named Chang_PAM_<number>.
Inside each of these directories, for each simulation, you will find:
\- A phylip file
\- A most likely tree (<simulation number>.phy.treefile)
\- Intermediate output files supplied by IQtree, detailed the maximum likelihood distances (mldist), the logs of
the phylogenetic tree run (.iqtree and .log), the checkpoint file (ckp), neighbor joining tree used to
initialize the tree search (bionj)
\- The SlopeResults directory, which contains an additional 20 directories (Chang_PAM_<number>_Slope for each simulation category.
Inside each of these 20 directories are:
\- Chang.PAM.<number>.alignments.txt - a list of the alignment phylip files for each category
\- Chang.PAM.resolvedTrees.txt - a list of the tree files for each category
\- Chang.PAM.trees.txt - the input file for a TreSpex analysis against the generating tree
\- log.txt - the log of the TreSpex Slope analysis against the generating tree
\- log.resolved.txt - the log of the TreSpex Slope analysis against the resolved trees
\- Saturation_Files_used.log - output of TreSpex confirming the alignments assessed in the analysis
\- Summary_Saturation.log - log of the command used by TreSpex to recover the resolved trees analysis
Alongside two further directories - GeneratingCorrelation_Results and TrueCorrelation_Results:
\- Correlation_Slope_Summary.txt - this file contains the Slope and R2 measurements for each alignment tested.
\- Both directories comprise two directories: one containing the P Matrices for each simulation analysis in txt format, and one containing the PD matrices in txt format.
These two values are used to calculate the Slope value.
\- Under GeneratingCorrelation_Results, only one PDMatrix txt file exists, as all alignments were compared against the generating tree (Chang.tre). In
RecoveredCorrelation_Results, you will find one PDMatrix for each tree.
Within 02-JTT you will find:
\- 20 further directories corresponding to the 20 branch length scaling factors, each named Chang_JTT_<number>. These contain the simulation alignments in philip format generated under
the JTT model in Hernandez & Ryan 2021 for each branch length scaling factor.
\- Chang.JTT.<number>.TransitionTransversion.Pairwise.txt - This file contains the Dayhoff Group Exchange Frequency (labelled Transition Frequency), the absolute count standard deviation
(Transition StdDev) and the Frequency standard deviation (TransFreqStDev). These were used to originally calculate the DEScore (as seen in the Excel sheet in 4.Other)
\- Chang.JTT.<number>.phy - a phylip alignment used in Hernandez & Ryan 2021 as the seed to generate the alignments under that branch length scaling category.
7. Real Data Studies
This directory contains the results of the real data studies detailed in the main text. Inside this directory you will find two further directories: Multiple Gene Datasets and Single Gene Datasets
Multiple Gene Datasets contains the results of the twenty-one "Multiple Gene" studies re-analysed by the DE-Score inside each directory (Misof et al. 2014, Fernández et al. 2016, Fernández et al. 2017, Irisarri et al 2017, Kocot et al 2017, Peters et al. 2017, Fernandez et al 2018, Hughes et al 2018, Johnson et al. 2018, Sharma et al. 2018, Shen, Jin et al 2018, Shen, Opulente et al. 2018, Benavides et al. 2019, Evangelista et al 2019, Kawahara et al. 2019, Simon et al. 2019, Steenwyk et al 2019, Milla et al 2020, Mongiardo, Koch & Thompson 2021, Wibberg et al 2021, Herranz et al 2022). Each of these directories contains:
\<Dataset Name>.DEScore.fastaSummary.txt - a file containing a list of the DE-Scores of individual gene alignments
\<Dataset Name>.nRCFV.fastaSummary.txt - a file containing a list of the nRCFVs of individual gene alignments
CompleteDataset - a directory containing the complete alignment and its corresponding DE-Score and nRCFV outputs
IndividualGenes - a directory containing individual gene alignments and their corresponding DE-Score and nRCFV outputs
TopSextile - a directory containing the alignment derived by selecting the top sextile of genes as determined by the DE-Score, and its corresponding DE-Score and nRCFV outputs
Single Gene Datasets contains the results of the six "Single Gene" studies re-analysed by the DE-Score inside each directory (Fleming et al 2021, Giaccomelli et al 2025, Novotna-Floriancicova et al 2023, Vieira & Rozas et al 2011). Each of these directories contains the relevant alignment and its corresponding DE-Score and nRCFV outputs.
Please note that two directories - Fleming et al 2021 and Vieira & Rozas 2011 - contain further directories A and B, as two datasets were re-analysed in each study.
8. Comparison with SatuRation
This directory contains the data used to compare the DE-Score metric with the Mean Historical Signal (Lambda) metric used by SatuRation (https://github.com/lsjermiin/SatuRation.v1.0/). As another pre-analysis entropic site saturation detection metric, it is a useful benchmark for assessing the effectiveness of the DE-Score. The implications of this assessment are detailed in our Supplemental Methods.
Inside this directory you will find two directories and one excel file.
- SatuRation_DEScoreComparison.xlsx - this Excel file presents two sheets, one comparing the Mean Lambda and DE-Score under the Dayhoff simulations that were initially used to compare the DE-Score to the Slope Measurement (Supplemental Data Part 6), and a second comparing the two metrics under the JTT simulations used to assess the efficacy of the DE-Score under non-model conditions (Supplemental Data Part 6). This is achieved by means of Pearson Correlations against both metrics and of both metrics against the branch length scaling factor of the simulation datasets.
The two directories are:
- 01-DAYHOFF
- 02-JTT
Within each of these two directories are three files, <ModelName>_1_9_MeanLambda.txt, <ModelName>_10_19_MeanLambda.txt and <ModelName>_20_MeanLambda.txt. These files contain lists of the MeanLambda, in order, over each of the scaling categories (1-9. 10-19, 20)
These two directories additionally contain 20 further directories each. These contain fasta files of the simulations contained in Part 6 (x.phy.fas), alongside the summary result output produced by SatuRate (x.phy.fas.summary.txt).
ChangeLog
This is the third upload of this supplemental dataset, and significany changes have occurred between this version and the original upload.
v3.0:
4_Other - Both scripts, Noisemaker.pl and DEScoreCalculator.pl have been significantly revised. Noisemaker has changed from a Ruby script to a Perl script, and the DEScoreCalculator has moved to Version 1.5, which includes a new, faster logic for calculating the DE-Score, a removal of perl module dependencies and a POD.
7_Real Data Studies - The multi-gene datasets have expanded from 5 to 21.
v2.0:
Most notably, this version divides the supplemental data into individual zip files, allowing users to more easily access the data they are interested in. This comes with a new data structure that groups the data into 8 categories: 1_Kocot2017.zip, 2_MissingData_Simulations.zip, 3_NoiseSimulations.zip, 4_Other.zip, 5_PositionAndTaxaIncrease_DEScore.zip, 6_SaturationSims.zip, 7_RealDataStudies.zip, 8_ComparisonWithSatuRation.zip.
In addition to this new partitioning, sections 7 and 8 are both new. Section 7 presents 6 single gene and 5 multi-gene datasets to evidence the efficacy of the DE-Score. Section 8 compares the DE-Score with SatuRation, a direct measurement of entropy in a multiple sequence alignment.
The methods and their implications are explored in greater detail in the pdf file that can be found inside folder 4_Other.
1_Kocot2017: Applying the DE-Score to real data: Reselection Datasets
The work of Kocot et al (2017) (Kocot, K.M., Struck, T.H., et al. 2017) was chosen for more thorough examination. This paper was chosen for reanalysis as the Slope score (Nosenko, T., Schreiber, F., et al. 2013) was previously used to assess site saturation within genes in this dataset, which allowed us to assess the coherence between the two metrics and the effect of their differences.
For this reselection study, a phylogenetic tree was recovered using IQTree-mpi (v1.6.12) (Nguyen, L.-T., Schmidt, H.A., et al. 2015) and the LG+F model, as in the original analysis, but instead using the sextile of least saturated genes, as selected by the DE-Score, and then a concatenated dataset comprising the 145 genes that were placed in the highest five out of six sextiles (523 genes) of the DE-Score, nRCFV, occupancy and LB-Score at once. These topologies were then compared to the sextiles selected by the Slope score and the highest five out of six tree in the initial study (Kocot, K.M., Struck, T.H., et al. 2017) .
The command used to generate the topologies in IQTree was:
iqtree-mpi -s -sp -m LG+F
2_MissingData_Simulations: Assessing the Effects of Missing Data
To assess the effect of missing data on the Dayhoff Category Exchange Ratio and the DE-Score, we used the 1,200 simulation datasets that were used to assess the effect of Missing Data on nRCFV in (Fleming, J.F. and Struck, T.H. 2023). These simulation datasets used the six "Missing Data" categories of Kocot et al (2017)(Kocot, K.M., Struck, T.H., et al. 2017). These datasets divided the total dataset assessed in that study into sextiles based on increasing percentage Missing Data, ranging from 18.17% to 38.43%. As per Kocot et al's methodology, missing data was classified as the presence of ambiguity characters, gaps or a lack of sampling (or absence) of the target gene in the taxon.
Simulation datasets were created using the alignment mimic command in IQTree2's alisim, which replicates the conditions of the source dataset - including missing data. This resulted in 600 simulation datasets including missing data, which we named "gapped" datasets.
Simulated replicates of the same Kocot et al datasets were then generated using Alisim's -no-copy-gaps command, which generates gapless simulation datasets that otherwise replicate the source alignment. The following commands were used in IQTree2.2.0:
For “Gapped” datasets: iqtree2 –alisim < Output > -s < Missing Data Dataset >
For “Ungapped” datasets: iqtree2 –alisim < Output > -s < Missing Data Dataset > –no-copy-gaps.
3_NoiseSimulations: Assessing the Effects of Noise
To assess the effect of noise on the ratio on the Dayhoff Category Exchange Ratio and the DE-Score, we first selected one category of the simulation datasets used to initially assess changes in the number of taxa and positions - 100 simulation datasets with 250 taxa and 2100 positions. This dataset was chosen as it represented a medium-sized dataset among our simulation datasets. New simulation datasets were generated by combining the 100 simulated datasets with an additional 10%, 20%, 30%, 40% and 50% noise, resulting in datasets that were 2333, 2625, 3000, 3500 and 4200 amino acids long, respectively. Noise was generated by randomly generating a number between 1 and 20 for each site within each taxon in the alignment. Each number was assigned an amino acid prior to the generation of the random number string, and the randomly generated numbers were then transformed into amino acid noise. The script used to do this, Noisemaker, can be found inside section 4_Other, and at the DE-Score Calculator Github, here:
https://github.com/JFFleming/DEScore
5_PositionAndTaxaIncrease_DEScore: Assessing the Effects of an Increasing Number of Taxa and Positions on the Dayhoff Category Exchange Ratio
To assess the effect of changes in the number of taxa and positions in an alignment on the Dayhoff Category Exchange Ratio and the DE-Score, we generated 10,000 simulation datasets under the WAG+F+G model on a balanced tree, using the alisim function of IQTree version 2.3.6 (Nguyen, L.-T., Schmidt, H.A., et al. 2015, Ly-Trong, N., Naser-Khdour, S., et al. 2022), under the following command:
iqtree2 --alisim $SimulationPrefix --seed 2014 -m WAG+F+G -t RANDOM{bal/$taxaNumber} --length $alignmentLength --num-alignments 100 --out-format fasta -redo
WAG+F+G was used to create simulations that contained more compositional variability than the GTR model, and thereby more realistic variation in site saturation between simulation datasets. Simulation datasets were created in bins of 100 datasets each—taxa at intervals of 50 from 50 taxa to 500 and sequences at intervals of 300 from 300 to 3000 positions, producing an end product of 10,000 simulation datasets.
6_SaturationSims: Establishing the ability of the DE-Score to detect Site Saturation and comparing the DE-Score to the slope of the regression line of patristic vs. p-distances
To assess the utility of directly measuring the ratio of the DE-Score to assess site saturation within a dataset, we used simulation datasets previously used to assess site saturation by Hernandez & Ryan (2021) (Hernandez, A.M. and Ryan, J.F. 2021). These datasets, originally based on the Chang et al (2015) tree (Chang, E.S., Neuhof, M., et al. 2015), were generated by applying a scaling factor to all branches of the original tree, from 1 to 20. The datasets were generated under two models, the Dayhoff model and the JTT model, using Seq-Gen, as described in Hernandez & Ryan (2021) (Hernandez, A.M. and Ryan, J.F. 2021). Each scaling factor category comprises 1,000 datasets, resulting in a total of 20,000 datasets for each model category: for a final 40,000 datasets.
To assess the efficacy of our new metric in comparison to existing metrics, we calculated the slope of the regression line of patristic vs. p-distances as described in Nosenko et al (2013) and implemented in TreSpEx (Nosenko, T., Schreiber, F., et al. 2013, Struck, T.H. 2014). As the Slope R2 score has not yet been assessed on simulation datasets, we assessed its efficacy against the 20,000 simulation datasets generated under the Dayhoff model by Hernandez & Ryan (2021) (Hernandez, A.M. and Ryan, J.F. 2021) that were initially used to assess the efficacy of the DaCER.
We generated two variations of the Slope R2 score - one by comparing the patristic distances of the true tree to the p-distances and one by comparing the patristic distances of the trees generated by the Dayhoff model simulation datasets, analysed under the simpler JTT model. This was intended to model the effect of phylogenetic topological artifacts caused by site saturation on the Slope R2 score.
We then used the Spearman's rank correlation coefficient to assess each pair of conditions, by ranking the Slope R2 or DE-score of each dataset and comparing these ranks to assess correlation between the two metrics.
7_RealDataStudies: Applying the DE-Score to real data: Overview Datasets
To better understand the relationship between the DaCER and the DE-Score on real data, and to better explore its efficacy and use cases, we selected 6 previously published single protein phylogenetic datasets across 4 previously published papers (Vieira, F.G. and Rozas, J. 2011, Fleming, J.F., Pisani, D., et al. 2021, Novotná Floriančičová, K., Baltzis, A., et al. 2023, Giacomelli, M., Vecchi, M., et al. 2025), and 21 previously published multi-protein phylogenetic datasets from 21 previously published papers (Misof et al. 2014, Fernández et al. 2016, Fernández et al. 2017, Irisarri et al 2017, Kocot et al 2017, Peters et al. 2017, Fernandez et al 2018, Hughes et al 2018, Johnson et al. 2018, Sharma et al. 2018, Shen, Jin et al 2018, Shen, Opulente et al. 2018, Benavides et al. 2019, Evangelista et al 2019, Kawahara et al. 2019, Simon et al. 2019, Steenwyk et al 2019, Milla et al 2020, Mongiardo, Koch & Thompson 2021, Wibberg et al 2021, Herranz et al 2022).
8_ComparisonWithSatuRation: Comparing the DE-Score to the Mean Lambda Entropy
Using the branch scaling factor datasets previously used to compare the DE-Score to the Slope measurement (6_SaturationSims, we further examined the effectiveness of the DE-Score in comparison to another metric of site saturation in amino acid datasets: mean historical signal strength (λ).
Changes after Dec 17, 2025:
v4.0 (Dec 22nd 2025)
4_Other - A reupload of this section to match the most recent version on the GitHub which improves the user experience through improved commenting, and changes the logical flow of the Noisemaker.pl script for accessibility.
Changes after Aug 12, 2024:
This is the third upload of this supplemental dataset, and significant changes have occurred between this version and the original upload.
v3.0 (DEC 17th 2025)
4_Other - Both scripts, Noisemaker.pl and DEScoreCalculator.pl, have been significantly revised. Noisemaker has changed from a Ruby script to a Perl script, and the DEScoreCalculator has moved to Version 1.5, which includes a new, faster logic for calculating the DE-Score, a removal of Perl module dependencies, and a POD.
7_Real Data Studies - The multi-gene datasets have expanded from 5 to 21.
v2.0 (AUG 4th 2025)
Most notably, this version divides the supplemental data into individual zip files, allowing users to more easily access the data they are interested in. This comes with a new data structure that groups the data into 8 categories: 1_Kocot2017.zip, 2_MissingData_Simulations.zip, 3_NoiseSimulations.zip, 4_Other.zip, 5_PositionAndTaxaIncrease_DEScore.zip, 6_SaturationSims.zip, 7_RealDataStudies.zip, 8_ComparisonWithSatuRation.zip.
In addition to this new partitioning, sections 7 and 8 are both new. Section 7 presents 6 single gene and 5 multi-gene datasets to evidence the efficacy of the DE-Score. Section 8 compares the DE-Score with SatuRation, a direct measurement of entropy in a multiple sequence alignment.
