The Dayhoff Exchange Score: A new metric to quantify site saturation in amino acid datasets prior to phylogenetic analysis
Data files
Aug 12, 2024 version files 8.82 GB
-
DEScoreSuppData.zip
8.82 GB
-
README.md
10.55 KB
Abstract
Site saturation is a persistent problem in phylogenetic analyses, where it can hinder the accuracy of topology reconstruction. It is fundamentally caused by large amounts of independent change along branches, causing the model to be unable to distinguish phylogenetic signal from noise. The Dayhoff Exchange Score (DE-score) is a new metric to assess site saturation within and between amino acid datasets, which provides both a whole dataset overview and taxon-specific values that represent the contribution of a given taxon to the whole dataset saturation. We first assess the efficacy of this score at detecting increased site saturation over 20,000 simulation datasets, compare it to the existing Slope R2 score and then assess its efficacy in the face of the potentially confounding factors of increasing taxon number, number of positions in the alignment, missing data and noise. Finally, we use the DE-Score to re-evaluate a previously published dataset by Kocot et al (2017), to illustrate its efficacy.
The DE-Score is available at:
https://github.com/JFFleming/DEScore
README: The Dayhoff Exchange Score: A new metric to quantify site saturation in amino acid datasets prior to phylogenetic analysis.
https://doi.org/10.5061/dryad.34tmpg4tm
Description of the data and file structure
This supplemental information accompanies the DEScore paper, and is formatted as follows:
Inside the directory DEScoreSuppData you will find 6 folders:
1. Kocot2017
2. MissingData_Simulations
3. NoiseSimulations
4. Other
5. PositionAndTaxaIncrease_DEScore
6. SaturationSims
1. Kocot2017
This directory contains two further directories, Kocot2017_Slope and DEScore_Reselected. Kocot2017_Slope comprises 6 directories: these are copies of the Best 106 genes through to Best 532 genes as selected by the Slope Measurements in Kocot et al 2017. These directories each contain a philip alignment file, a partition file in txt format, and the output of a RAxML analysis: a most likely tree (RAxML_bipartitions.<Slope Sextile>.tre), bootstrap support files, and a bipartition file.
DEScore_Reselected comprises 5 directories: these each correspond to the Best 106 genes through to the Best 532 genes as selected by the DE-Score. Each directory contains 10 files: a fasta multiple sequence alignment file, a list of selected genes in txt format, partition file in text format (formatted by cat sequences, and "clean" formatted for input to IQtree), a most likely tree (DEScore.<Best Sextile>.partitions.clean.txt.treefile) and 5 intermediate output files supplied by IQtree, detailed the maximum likelihood distances (mildist), the logs of the phylogenetic tree run (.iqtree and .log), the checkpoint file (ckp.gz), neighbor joining tree used to initialize the tree search (bionj)
2. MissingData_Simulation
This directory contains the Missing Data simulation data, in both Gapped and Ungapped forms, in the Gaps and NoGaps directories, respectively. The Dayhoff Group Exchange Frequency (labelled Transition Frequency), the absolute count standard deviation (Transition StdDev) and the Frequency standard deviation (TransFreqStDev) for each simulation category can be found in Missing_<SextileNumber>.TransitionTransversion.Pairwise.txt, in the Gaps and NoGaps directories. These measurements were used to originally calculate the DEScore (as seen in the Excel sheet in 4.Other).
Inside each further directory (labelled by sextile from 1of6 to 6of6) is the simulation data from the Fleming & Struck (2023) paper that is reused here, in phylip multiple sequence alignment format.
3. NoiseSimulations
This directory contains the Noise Simulations generated by NoiseMaker (which can be found at the DEScore Github, and in 4. Other).
Inside NoiseSimulations are three directories: OriginalSims, NoiseMaker and ConCatSims. OriginalSims contains 100 simulation datasets, in phylip format, that can also be found in 5_PositionAndTaxaIncrease_DEScore/250Taxa/2100Bases/. NoiseMaker contains the pure noise that was added to these datasets, formatted in 5 directories (from 10Percent Noise through the 50Percent Noise). The noise is formatted in philip format.
The ConCatSims directory contains the data analysed in the Noise Simulation analysis. The Dayhoff Group Exchange Frequency (labelled Transition Frequency), the absolute count standard deviation (Transition StdDev) and the Frequency standard deviation (TransFreqStDev) for each simulation category can be found in <NoisePercent>.TransitionTransversion.Pairwise.txt. These measurements were used to originally calculate the DEScore (as seen in the Excel sheet in 4.Other).
The ConCatSims directory also contains the alignments, in the 5 directories 10Percent, 20Percent, 30Percent, 40Percent and 50Percent. These directories each contain 100 phylip files corresponding the the concatenation of one of the original simulations (250 Taxa by 2100 positions) with the corresponding pair of additional noise (for example, OriginalSims/250by2100.aa.1.fas.phy is paired with NoiseMaker/10Percent/1.Noise.phy to produce ConCatSims/10Percent/10Percent.1.Concat.phy.
4. Other
This directory contains the software that can be found at the Github: DEScore_Calculator.pl and Noisemaker.rub, in the subdirectory DEScore-main.
It additionally contains the DEScore Excel file, which provides a compiled form of the data used to initially calculate the DE-Score. Inside the DEScore Excel file, one sheet, Kocot_RealData, uses N/A. N/A in the following columns notes that the gene was not selected within the corresponding DE-Score Sextile.
5. PositionAndTaxaIncrease_DEScore
This directory contains the simulation data, originally sourced from Fleming & Struck (2023), corresponding to amino acid datasets with increasing numbers of taxa and amino acid positions from 50 to 500 taxa and 300 to 3000 amino acid positions.
This directory is first divided by taxa, and within that it is divided by position. For example, 500Taxa/300Bases corresponds to a 300 amino acid long alignment with 500 taxa.
In each Taxa folder, you will find 10 txt files name <PositionNumber>Bases.TransitionTransversion.Pairwise.txt. This file contains the Dayhoff Group Exchange Frequency (labelled Transition Frequency), the absolute count standard deviation (Transition StdDev) and the Frequency standard deviation (TransFreqStDev). These were used to originally calculate the DEScore (as seen in the Excel sheet in 4.Other)
Each taxa-position pair (eg: PositionAndTaxaIncrease_DEScore/50Taxa/300Bases) folder additionally contains the 100 simulations in .phy format.
6. SaturationSims
This directory contains the simulation data, originally sourced from Hernandez & Ryan (2021), corresponding to amino acid datasets with increasing amounts of saturation (due to branch length scaling factors).
Inside this directory you will find 01-DAYHOFF and 02-JTT, as in the supplemental data of Hernandez & Ryan 2021. These directories correspond to the model the data was generated under.
Within 01-DAYHOFF you will find:
\- Chang.tre: the original Chang et al tree file. This is the "true tree" used for the simulation comparisons.
\- Chang.PAM.<number>.TransitionTransversion.Pairwise.txt. The Dayhoff Group Exchange Frequency (labelled Transition Frequency), the absolute count standard deviation (Transition StdDev)
and the Frequency standard deviation (TransFreqStDev) for each simulation category, alongside the average uncorrected p-distance and the standard deviation of the p-distance of the
simulation. These measurements were used to originally calculate the DEScore (as seen in the Excel sheet in 4.Other).
\- 20 further directories corresponding to the 20 branch length scaling factors, each named Chang_PAM_<number>.
Inside each of these directories, for each simulation, you will find:
\- A phylip file
\- A most likely tree (<simulation number>.phy.treefile)
\- Intermediate output files supplied by IQtree, detailed the maximum likelihood distances (mldist), the logs of
the phylogenetic tree run (.iqtree and .log), the checkpoint file (ckp), neighbor joining tree used to
initialize the tree search (bionj)
\- The SlopeResults directory, which contains an additional 20 directories (Chang_PAM_<number>_Slope for each simulation category.
Inside each of these 20 directories are:
\- Chang.PAM.<number>.alignments.txt - a list of the alignment phylip files for each category
\- Chang.PAM.resolvedTrees.txt - a list of the tree files for each category
\- Chang.PAM.trees.txt - the input file for a TreSpex analysis against the generating tree
\- log.txt - the log of the TreSpex Slope analysis against the generating tree
\- log.resolved.txt - the log of the TreSpex Slope analysis against the resolved trees
\- Saturation_Files_used.log - output of TreSpex confirming the alignments assessed in the analysis
\- Summary_Saturation.log - log of the command used by TreSpex to recover the resolved trees analysis
Alongside two further directories - GeneratingCorrelation_Results and TrueCorrelation_Results:
\- Correlation_Slope_Summary.txt - this file contains the Slope and R2 measurements for each alignment tested.
\- Both directories comprise two directories: one containing the P Matrices for each simulation analysis in txt format, and one containing the PD matrices in txt format.
These two values are used to calculate the Slope value.
\- Under GeneratingCorrelation_Results, only one PDMatrix txt file exists, as all alignments were compared against the generating tree (Chang.tre). In
RecoveredCorrelation_Results, you will find one PDMatrix for each tree.
Within 02-JTT you will find:
\- 20 further directories corresponding to the 20 branch length scaling factors, each named Chang_JTT_<number>. These contain the simulation alignments in philip format generated under
the JTT model in Hernandez & Ryan 2021 for each branch length scaling factor.
\- Chang.JTT.<number>.TransitionTransversion.Pairwise.txt - This file contains the Dayhoff Group Exchange Frequency (labelled Transition Frequency), the absolute count standard deviation
(Transition StdDev) and the Frequency standard deviation (TransFreqStDev). These were used to originally calculate the DEScore (as seen in the Excel sheet in 4.Other)
\- Chang.JTT.<number>.phy - a phylip alignment used in Hernandez & Ryan 2021 as the seed to generate the alignments under that branch length scaling category.
Sharing/Access information
Data was derived from the following sources:
- Fleming & Struck 2023: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05270-8#Fun
- Hernandez & Ryan 2021: https://academic.oup.com/sysbio/article/70/6/1200/6219974
- Kocot et al 2017: https://academic.oup.com/sysbio/article/66/2/256/2449704
Code/Software
Software written during the project can be found in 4. Other, and additionally at https://github.com/JFFleming/DEScore.
Additionally, this project used IQTree for phylogenetic reconstruction (https://www.iqtree.org/) and TreSpex for Slope and R2 analyses (https://journals.sagepub.com/doi/10.4137/EBO.S14239)