Skip to main content
Dryad logo

Structure and stability constrained substitution models outperform traditional substitution models used for evolutionary inference

Citation

Bastolla, Ugo; Lorca, Ivan; Arenas, Miguel (2022), Structure and stability constrained substitution models outperform traditional substitution models used for evolutionary inference, Dryad, Dataset, https://doi.org/10.5061/dryad.6wwpzgn2g

Abstract

The current knowledge about how protein structures influence sequence evolution is rarely incorporated into substitution models adopted for phylogenetic inference, which are commonly based on independent with the same substitution process and ignore the known variation of the evolutionary rates across sites with different structural properties. In previous works, we presented site-specific substitution models of protein evolution based on selection on the folding stability of the native state (Stab-CPE), which predict more realistically the evolutionary variability across protein sites. However, those Stab-CPE present qualitative differences from observed data, probably because they ignore changes in the native structure, despite empirical studies suggesting that conservation of the native structure is a strong selective force. Here we present novel structurally constrained substitution models (Str-CPE) based on Julián Echave’s model of the structural change due to a mutation as the linear response of the protein to a perturbation and on the explicit model of the perturbation generated by a specific amino-acid mutation. Compared to our previous Stab-CPE models, the novel Str-CPE models are more stringent (they predict lower sequence entropy and substitution rate), provide higher likelihood to multiple sequence alignments (MSA) of the wild-type protein, and better predict the observed substitution rates. Next, we combine Str-CPE and Stab-CPE models to obtain structure and stability constrained substitution models (SSCPE) that fit the empirical MSAs even better. Importantly, these SSCPE models present a relevant improvement of the phylogenetic likelihood for all ten protein families that we analyzed with the program RAxML-NG. We implemented the SSCPE models in the program Prot evol, freely available at https://github.com/ugobas/Prot_evol.

Methods

The data were generated by the programs tnm (torsional network model, Mendez and Bastolla 2010) and Prot_evol, whose last version is presented in the paper related with the dataset.

Usage Notes

The files are text files that can be open with any text editor.

Funding

Agencia Estatal de Investigación, Award: PID2019-109041GB-C22/10.13039/501100011033

Agencia Estatal de Investigación, Award: PID2019-107931GA-I00/AEI/10.13039/501100011033