Structure and stability constrained substitution models outperform traditional substitution models used for evolutionary inference
Bastolla, Ugo; Lorca, Ivan; Arenas, Miguel (2022), Structure and stability constrained substitution models outperform traditional substitution models used for evolutionary inference, Dryad, Dataset, https://doi.org/10.5061/dryad.6wwpzgn2g
The current knowledge about how protein structures inﬂuence sequence evolution is rarely incorporated into substitution models adopted for phylogenetic inference, which are commonly based on independent with the same substitution process and ignore the known variation of the evolutionary rates across sites with diﬀerent structural properties. In previous works, we presented site-speciﬁc substitution models of protein evolution based on selection on the folding stability of the native state (Stab-CPE), which predict more realistically the evolutionary variability across protein sites. However, those Stab-CPE present qualitative diﬀerences from observed data, probably because they ignore changes in the native structure, despite empirical studies suggesting that conservation of the native structure is a strong selective force. Here we present novel structurally constrained substitution models (Str-CPE) based on Julián Echave’s model of the structural change due to a mutation as the linear response of the protein to a perturbation and on the explicit model of the perturbation generated by a speciﬁc amino-acid mutation. Compared to our previous Stab-CPE models, the novel Str-CPE models are more stringent (they predict lower sequence entropy and substitution rate), provide higher likelihood to multiple sequence alignments (MSA) of the wild-type protein, and better predict the observed substitution rates. Next, we combine Str-CPE and Stab-CPE models to obtain structure and stability constrained substitution models (SSCPE) that ﬁt the empirical MSAs even better. Importantly, these SSCPE models present a relevant improvement of the phylogenetic likelihood for all ten protein families that we analyzed with the program RAxML-NG. We implemented the SSCPE models in the program Prot evol, freely available at https://github.com/ugobas/Prot_evol.
The data were generated by the programs tnm (torsional network model, Mendez and Bastolla 2010) and Prot_evol, whose last version is presented in the paper related with the dataset.
The files are text files that can be open with any text editor.
Agencia Estatal de Investigación, Award: PID2019-109041GB-C22/10.13039/501100011033
Agencia Estatal de Investigación, Award: PID2019-107931GA-I00/AEI/10.13039/501100011033