Skip to main content
Dryad logo

Data from: Epistatic contributions promote the unification of incompatible models of neutral molecular evolution

Citation

De la Paz, Jose Alberto; Nartey, Charisse M.; Yuvaraj, Monisha; Morcos, Faruck (2020), Data from: Epistatic contributions promote the unification of incompatible models of neutral molecular evolution, Dryad, Dataset, https://doi.org/10.5061/dryad.2ngf1vhj8

Abstract

We introduce a model of amino acid sequence evolution that accounts for the statistical behavior of real sequences induced by epistatic interactions. The model is refered as Sequence Evolution with Epistatic Contributions (SEEC). Known statistical properties such as overdispersion, heterotachy and Gamma-distributed rate-across-sites are shown to be emergent properties of this model while being consistent with neutral evolution theory, thereby unifying observations from previously disjointed evolutionary models of sequences.  The relationship between site restriction and heterotachy is characterized by tracking the effective alphabet dynamics of sites.  We also observe an evolutionary Stokes Shift in the fitness of sequences that have undergone evolution under our simulation.  In this dataset we include all the data used as input and output of the model connected with the publication "Epistatic contributions promote the unification of incompatible models of neutral molecular evolution".

Methods

We introduce a model called Sequence Evolution with Epistatic Constributions (SEEC). The model dynamics are based on parameters derived from Multiple Sequence Alignments analyzed using Direct Coupling Analysis methodology. The dataset contains formated aligments of protein families compiled from Pfam and processed in our manuscript. These aligments are the input to the global inference method called Direct Coupling Analysis (DCA).  The dataset also includes paramaters computed and required to run the evolutionary model developed in the paper.

Usage Notes

The dataset includes Alignments and Matlab files (.mat) with relevant variables used in the evolutionary model described in the article.

Family alignments and eij hi parameters (Parameters_orig.mat)

alignment number pfam name pfam number
1 7tm_1 PF00001
2 ATP-synt_ab PF00006
3 Globin PF00042
4 Response_reg PF00072
5 ATP-synt_A PF00119
6 Sigma54_activat PF00158
7 SBP_bac_3 PF00497 
8 HisKA PF00512
10 HATPase_c PF02518
11 Peripla_BP_3 PF13377

 

Pairwise coupling and local fields parameters computed from msa_numerical.txt by bmDCA method (https://github.com/matteofigliuzzi/bmDCA). The data type of it is MAT file, which can be read by MATLAB.

This file generates:

Native: Amino-acid sequence used as first member of the evolutionary simulation. Translated directly from the first sequence in the original fasta alignment using bmDCA_preprocessing.sh from the bmDCA library referenced.

align: Amino-acid sequences from the fasta alignment used to train the DCAcouplings translated using bmDCA_preprocessing.sh from the bmDCA library referenced.

e: Coupling matrix (qLxqL) formed from L x L submatrices of size q x q with the pairwise couplings for each pair of sites (L submatrices in diagonal are irrelevant). Inferred using bmDCA_v2.1.sh from the referred bmDCA library.

h: Local fields matrix (q x L) with the single site contributions to the Potts Hamiltonian. Inferred using bmDCA_v2.1.sh from the referred bmDCA library.

Basic Outputs from Evolutionary Model (Ev_Output.mat)

Most basic outputs generated by the SEEC model (https://github.com/AlbertodelaPaz/SEEC). The data type of it is MAT file, which can be read by MATLAB.

This file stores the following MATLAB variables:

EvTraj: Amino-acid sequences generated by the evolutionary simulation. Each row represents a different generation from the model. Translation according to the conventions givien by bmDCA_preprocessing.sh with labeling from 1 to 21. Generated using Probevolution.m in the SEEC library with parameters stored in Parameters_orig.mat.

R: Statistical dispersion index calculated every 50 steps of the evolutionary trajectory. Defined as the variance of the time between non-synonymous substitutions divided by the average time between substitutions. Generated using Probevolution.m in the SEEC library with parameters stored in Parameters_orig.mat.

familyH: Statistical energy associated to each member of the original family alignment according to the Potts Hamiltonian under the parameters e and h. Calculated using Generalhamiltonian.m from SEEC library with parameters stored in Parameters_orig.mat.

trajectoryH: Statistical energy associated to each member of the evolutionary trajectory stored in EvTraj according to the Potts Hamiltonian under the parameters e and h. Calculated using Generalhamiltonian.m from SEEC library with parameters stored in Parameters_orig.mat and the sequences in EvTraj.