Skip to main content
Dryad

Six-state amino acid recoding is not an effective strategy to offset compositional heterogeneity and saturation in phylogenetic analyses

Cite this dataset

Hernandez, Alexandra; Ryan, Joseph (2021). Six-state amino acid recoding is not an effective strategy to offset compositional heterogeneity and saturation in phylogenetic analyses [Dataset]. Dryad. https://doi.org/10.5061/dryad.5mkkwh757

Abstract

Six-state amino acid recoding strategies are commonly applied to combat the effects of compositional heterogeneity and substitution saturation in phylogenetic analyses. While these methods have been endorsed from a theoretical perspective, their performance has never been extensively tested. Here, we test the effectiveness of 6-state recoding approaches by comparing the performance of analyses on recoded and non-recoded datasets that have been simulated under gradients of compositional heterogeneity or saturation. In our simulation analyses, non-recoding approaches consistently outperform 6-state recoding approaches. Our results suggest that 6-state recoding strategies are not effective in the face of high saturation. Further, while recoding strategies do buffer the effects of compositional heterogeneity, the loss of information that accompanies 6-state recoding outweighs its benefits. In addition, we evaluate recoding schemes with 9, 12, 15, and 18 states and show that these consistently outperform 6-state recoding. Our analyses of other recoding schemes suggest that under conditions of very high compositional heterogeneity, it may be advantageous to apply recoding using more than 6 states, but we caution that applying any recoding should include sufficient justification. Our results have important implications for the more than 90 published papers that have incorporated 6-state recoding, many of which have significant bearing on relationships across the tree of life.

Usage notes

To reproduce analyses, see the README file that is included with the alignments for instructions on how to integrate them into the GitHub repository associated with the data (https://github.com/josephryan/Hernandez_Ryan_2021_Recoding). NOTE: The 40 million alignments with no induced compositional heterogeneity (used to generate null distributions) have not been included. The instructions for generating these alignments are included in the 02-COMPOSITIONAL_HETEROGENEITY/01-NULL_DISTRIBUTION directory of the GitHub repository.