Site-specific structure and stability constrained substitution models improve phylogenetic inference

Bastolla, Ugo 1 ; Lorca, Ivan1 ; Arenas, Miguel 2

Published Oct 13, 2022; Updated Jul 02, 2024 on Dryad. https://doi.org/10.5061/dryad.6wwpzgn2g

Abstract

In previous studies, we presented site-specific substitution models of protein evolution based on selection on the folding stability of the native state (Stab-CPE), which predict more realistically the evolutionary variability across protein sites. However, those Stab-CPE present qualitative differences from observed data, probably because they ignore changes in the native structure, despite empirical studies suggesting that conservation of the native structure is a stronger selective force than selection on folding stability.

Here we present novel structurally constrained substitution models (Str-CPE) based on Julian Echave's model of the structural change due to a mutation as the linear response of the protein to a perturbation and on the explicit model of the perturbation generated by a specific amino-acid mutation. Compared to our previous Stab-CPE models, the novel Str-CPE models are more stringent (they predict lower sequence entropy and substitution rate), provide higher likelihood to multiple sequence alignments (MSA) that include one or more known structures, and better predict the observed conservation across sites. The models that combine Str-CPE and Stab-CPE models are even more stringent and fit the empirical MSAs better.

We refer collectively to our models as structure and stability constrained substitution models (SSCPE). Importantly in comparison to the traditional empirical substitution models, the SSCPE models infer phylogenetic trees of distantly related proteins more similar to reference trees based on structural information.

We implemented the SSCPE models in the program SSCPE.pl, freely available at https://github.com/ugobas/SSCPE, which infers phylogenetic trees under the SSCPE models with the program RAxML-NG from a concatenated alignment and a list of protein structures that overlap with it.

This repository contains programs and data used for calibrating and testing the Structure and stability constrained protein evolution (SSCPE) substitution models computed by the program Prot_evol https://github.com/ugobas/Prot_evol

The program Prot_evol takes as input a list of protein structures (PDB files) and a multiple sequence alignment (MSA) at the amino acid level. It can be run directly through the framework SSCPE.pl, which computes the site-specific substitution models with Prot_evol and input them to the program RAxML-NG (Kozlov AM, Darriba D, Flouri T, Morel B, Stamatakis A. 2019. RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference. Bioinformatics 35: 4453-4455) for inferring maximum likelihood phylogenetic trees.

We adopted 3 data sets: 203 monomeric proteins with known structure and 10-300 sequences MSAs for which we computed the SSCPE models, a selection of 10 of these proteins for phylogenetic inference with RAxML-NG, and 5 groups of proteins related at the distant superfamily level for which we compared ML trees obtained with the SSCPE models and with empirical substitution models to structure-based phylogenetic trees obtained with the program PC_ali (Bastolla U, Piette O, Abia D 2023. PC_ali: a tool for improved multiple alignments and evolutionary inference based on a hybrid protein sequence and structure similarity score. Bioinformatics 39:btad630).

All files in the zip archives are plain text files for each studied PDB structure. Their name begins with the PDB code, sometimes in upper and sometimes in lower case, followed by the chain code (ex. 1BP2A, 1BP2_A or 1bp2A for chain A of the PDB file 1bp2). Lines beginning with # are comments.

Monomeric dataset:

Alignments.zip : Multiple sequence alignments including one PDB structure that gives name to the alignment. Courtesy of Julian Echave, Universidad Nacional de San Martin, Argentina (ex: ALI=1BP2_A_mafft.fasta). They are input to Prot_evol or SSCPE.pl as -ali <alignment_file_name>
RMSD.zip and DE.zip: structural deformations RMSD and DE generated by all point mutations of the wild-type protein in the PDB, as predicted by the program tnm https://github.com/ugobas/tnm with option -pred_mut or input parameters PRED_MUT=1 and MUT_PARA=Mutation_para.in *
Name: <>RMSD.dat (ex. 1BP2A.mut*̣_RMSD.dat, 1BP2.mut_DE.dat). These files present one line for each residue in the PDB and 22 columns: Mutated residue, number of contacts, and predicted effect of the mutations (RMSD) to each of the 20 amino acids in the order indicated in the third line of the file.
It is used for computing the SSCPE models and either it is input to Prot_evol through the configuration file Input_Prot_evol.in, record "STR_MUT=<structure_mutation_file_name>" (ex: STR_MUT=1BP2A.mut_RMSD.dat) or Prot_evol looks for it in the folder TNM_DATA, created upon installation and, if it does not exist it computes it running the tnm program.
Rates_monomeric.zip : Site-specific evolutionary rate, sequence entropy and hydrophobicity predicted by Prot_evol for every site of the tested proteins and number of contacts computed from the PDB structure. Name: ..rate_profile.dat (ex. 1bp2A.SSCPE.RMSDWT.rate_profile.dat), where MODEL is one of the eight SSCPE models that we studied. For each position in the PDB file, 9 columns are shown. They are named in the last comment line that begins with #. The columns represent: (1) Residue number in the PDB (2) Amino acid in the PDB (3) Secondary structure (4) Number of native contacts (5) Site-specific entropy of the predicted stationary amino acid frequency, S_i=-sum_a P_i(a) log(P_i(a)) (6) Mean hydrophobicity of the predicted stationary amino acid frequency, h_i= sum_a P_i(a)*hydro(a) (7-8-9-10) Predicted site-specific substitution rate R_i=sum_b P_i(a)Q_i(a,b)=sum_b P_i(a)P_i(b)E_i(a,b) with exchangeability matrix E_i determined through the flux model (7), the empirical model (8), the flux model plus the Halpern-Bruno model (9, default) or the flux model without the HB model (10).
Summary_monomeric.zip : Properties of the different SSCPE models computed by Prot_evol, averaged over sites i for any test protein (ex. 1bp2A.SSCPE.summary.dat). For each model (column 1) the columns represent: (2) Lambda0 (optimized selection parameter of the first fitness component) (3) Lambda1 (optimized selection parameter of the 2nd fitness component if present) (4) lik(MSA) Average log-likelihood of each MSA column MSA_i(a) with respect to the site-specific model P_i(a) (5,6,8) Average Kullback-Leibler divergencies: (5) between model P_i(a) and regularized MSA f_i(a), (6) between regularized MSA f_i(a) and model P_i(a) and (8) between model P_i(a) and global frequencies P_mut(a), (7) Score that is minimized (KL or -log_lik), (9) Average of the site-specific entropy S_i of the model, (10) Model averaged the folding free energy DeltaG (11) freezing temperature Tf predicted through the Random Eenergy Model (in units of room temperature) (12) Mean of the model average hydrophobicity. For each SSCPE model of the stationary frequencies, 8 exchangeability matrices are considered according to the combinations of the options Halpern-Bruno (yes/no), flux (yes/no) and rate (actual or normalized) and an approximate estimate of the sum of pairwise log likelihood is printed.
Tables_monomeric.zip: Summary results averaged across 203 monomeric proteins. It contains the file Results_SSCPE_new.txt For every model (one per line) this file presents the following values and standard errors, averaged across all sites of the 203 proteins and described in the line #1=... : log likelihood of the MSA; sequence entropy; Kullback-Leibler (KL) divergence between the model and the MSA; symmetrized KL divergence; KL divergence from the global (not site-specific) model; model averaged folding free energy DeltaG; selection parameters Lambda1 and Lambra2. It also contains the files named Results_cont_[MODEL]_NEW.txt, one per each SSCPE model. These files present the average and standard error of site-specific properties for sites with given number of native contacts, as listed in the line #cont... : number of contacts, number of sites, hydrophobicity, sequence entropy, and substitution rate in the four variations noHB-noflux, noHB-flux, HB-noflux, HB-flux.
Prot_evol_models.zip : 10 site-specific SSCPE substitution models used for testing the tree likelihood with the program RAxML-ng (Kozlov AM, Darriba D, Flouri T, Morel B, Stamatakis A. 2019. RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference. Bioinformatics 35: 4453-4455). Name: .SSCPE..AA_profiles.txt where PDB is PDB code+chain index, MODEL is one of the SSCPE models (ex. 1bp2A.SSCPE.RMSD.AA_profiles.txt). For each site (one per line), the file reports the column of the MSA that it models, the amino acid in the reference sequence, the frequencies of the 20 amino acids and the sequence entropy. The script SSCPE.pl runs Prot_evol, computes the specified exchangeability matrix and transforms the models in a format that can be input to RAxML-NG.
RAxML_results.zip : Results obtained running 12 empirical models and 16 SSCPE models with the program RAxML-ng (Kozlov AM, Darriba D, Flouri T, Morel B, Stamatakis A. 2019). For each protein and model, it contains the ML tree inferred by raxml-ng [Protein].[Model].raxml.bestTree and the log file of RAXML-NG or SSCPE.pl

Superfamily data sets: [superfamily].zip

For each of 5 superfamilies (Aldolase, Globins_C1, NADP_C1, Ploop_C1, Ploop_C2) the file named [supfam].zip contains the MSA obtained by PC_ali (.fas), the list of PDB codes (.pdb), the NJ tree inferred by PC_ali based on structure information (.unroot.tree) and the ME evolution tree inferred by FastME (Lefort et al. 2015. FastME 2.0: A Comprehensive, Accurate, and Fast Distance-Based Phylogeny Inference Program. Mol Biol Evol 32:2798-800) with the structure-based divergences computed by PC_ali, the ML trees inferred by raxml for each SSCPE model and empirical model (.raxml.bestTree) and a table with the comparison with reference tree, log.likelihood, branch length and regmlame score of each model (Results.[supfam].txt).

Data used internally by the SSCPE models:

Mutation_para.in : Mutation weights used by the tnm program for predicting the RMSD and DE. Input to the tnm program setting the line MUT_PARA=Mutation_para.in
atom_numb.tsv : Table with the number of heavy atom for each amino acid type used by the program tnm for predicting the structural deformation produced by a mutation
energy_BKV.tsv : Table with the effective contact interaction free energy for each pair of amino acid types used by the program tnm for predicting the structural deformation produced by a mutation (Bastolla, Knapp and Vendruscolo PNAS 2000)
Residues_distances.tsv : Table with the most frequent native distance between each pair of amino acid types used by the program tnm for predicting the structural deformation produced by a mutation.

The PDB files can be downloaded from the PDB site https://www.rcsb.org/

Site-specific structure and stability constrained substitution models improve phylogenetic inference

Data files

Abstract

README: SSCPE

Methods

Usage notes

Works referencing this dataset