Skip to main content
Dryad

Compositionally constrained sites drive long branch attraction

Cite this dataset

Szánthó, Lénárd Lajos; Lartillot, Nicolas; Szöllősi, Gergely; Schrempf, Dominik (2023). Compositionally constrained sites drive long branch attraction [Dataset]. Dryad. https://doi.org/10.5061/dryad.g79cnp5rh

Abstract

Accurate phylogenies are fundamental to our understanding of the pattern and process of evolution. Yet, phylogenies at deep evolutionary timescales, with correspondingly long branches, have been fraught with controversy resulting from conflicting estimates from models with varying complexity and goodness of fit. Analyses of historical as well as current empirical datasets, such as alignments including Microsporidia, Nematoda or Platyhelminthes, have demonstrated that inadequate modeling of across-site compositional heterogeneity, which is the result of biochemical constraints that lead to varying patterns of accepted amino acids along sequences, can lead to erroneous topologies that are strongly supported. Unfortunately, models that adequately account for across-site compositional heterogeneity remain computationally challenging or intractable for an increasing fraction of contemporary datasets. Here, we introduce "compositional constraint analysis", a method to investigate the effect of site-specific amino acid diversity on phylogenetic inference, and show that more constrained sites with lower diversity and less constrained sites with higher diversity exhibit ostensibly conflicting signal under models ignoring across-site compositional heterogeneity and thus contribute to topological bias and long branch attraction artifacts. We demonstrate that  more complex models accounting for across-site compositional heterogeneity can ameliorate this bias. We present CAT-PMSF, a pipeline for diagnosing and resolving phylogenetic bias resulting from inadequate modeling of across-site compositional heterogeneity based on the CAT model. CAT-PMSF is robust against long branch attraction in all alignments we have examined. We suggest using CAT-PMSF when convergence of the CAT model cannot be assured. We find evidence that compositionally constrained sites are driving long branch attraction in two metazoan datasets and recover evidence for Porifera as the sister group to all other animals.

Methods

Please see the README document ("README.md") and the accompanying published article: Lénárd L. Szánthó, Nicolas Lartillot, Gergely J. Szöllősi and Dominik
Schrempf 2023. Compositionally constrained sites drive long branch attraction. Systematic Biology. Accepted. DOI: 10.1093/sysbio/syad013

Usage notes

Please see the README document ("README.md") and the accompanying published article: Lénárd L. Szánthó, Nicolas Lartillot, Gergely J. Szöllősi and Dominik
Schrempf 2023. Compositionally constrained sites drive long branch attraction. Systematic Biology. Accepted. DOI: 10.1093/sysbio/syad013

Funding

Gordon and Betty Moore Foundation, Award: GBMF9741

European Research Council, Award: 714774