Data from: Epistatic contributions promote the unification of incompatible models of neutral molecular evolution
Data files
Feb 18, 2020 version files 2.40 GB
-
Alignment_002.txt_filtered11
30.48 MB
-
Alignment_003.txt_filtered6
1.99 MB
-
Alignment_005.txt_filtered11
8.87 MB
-
Alignment_006.txt_filtered8
26.77 MB
-
Alignment_007.txt_filtered11
34.70 MB
-
Alignment_008.txt_filtered3
5.45 MB
-
Alignment_010.txt_filtered6
14.35 MB
-
Ev_Output-10.mat
802.06 KB
-
Ev_Output-2.mat
630.17 KB
-
Ev_Output-3.mat
662.99 KB
-
Ev_Output-4.mat
750.24 KB
-
Ev_Output-5.mat
383.09 KB
-
Ev_Output-6.mat
545.44 KB
-
Ev_Output-7.mat
261.56 KB
-
Ev_Output-8.mat
1.39 MB
-
Ev_Output-9.mat
982.89 KB
-
Ev_Output.mat
404.76 KB
-
msa_numerical-10.txt
38.57 MB
-
msa_numerical-2.txt
34.84 MB
-
msa_numerical-3.txt
10.02 MB
-
msa_numerical-4.txt
51.13 MB
-
msa_numerical-5.txt
14.52 MB
-
msa_numerical-6.txt
47.02 MB
-
msa_numerical-7.txt
2.40 MB
-
msa_numerical-8.txt
43.63 MB
-
msa_numerical-9.txt
29.12 MB
-
msa_numerical.txt
18.16 MB
-
Parameters_orig-10.mat
37.26 MB
-
Parameters_orig-2.mat
34.03 MB
-
Parameters_orig-3.mat
7.30 MB
-
Parameters_orig-4.mat
71.37 MB
-
Parameters_orig-5.mat
46.52 MB
-
Parameters_orig-6.mat
38.23 MB
-
Parameters_orig-7.mat
17.07 MB
-
Parameters_orig-8.mat
23.35 MB
-
Parameters_orig-9.mat
20.17 MB
-
Parameters_orig.mat
92.49 MB
-
parameters_ref-10.txt
137.12 MB
-
parameters_ref-2.txt
144.43 MB
-
parameters_ref-3.txt
375.88 MB
-
parameters_ref-4.txt
262.79 MB
-
parameters_ref-5.txt
231.23 MB
-
parameters_ref-6.txt
60.12 MB
-
parameters_ref-7.txt
62.09 MB
-
parameters_ref-8.txt
236.11 MB
-
parameters_ref-9.txt
60.97 MB
-
parameters_ref.txt
19.86 MB
Abstract
We introduce a model of amino acid sequence evolution that accounts for the statistical behavior of real sequences induced by epistatic interactions. The model is refered as Sequence Evolution with Epistatic Contributions (SEEC). Known statistical properties such as overdispersion, heterotachy and Gamma-distributed rate-across-sites are shown to be emergent properties of this model while being consistent with neutral evolution theory, thereby unifying observations from previously disjointed evolutionary models of sequences. The relationship between site restriction and heterotachy is characterized by tracking the effective alphabet dynamics of sites. We also observe an evolutionary Stokes Shift in the fitness of sequences that have undergone evolution under our simulation. In this dataset we include all the data used as input and output of the model connected with the publication "Epistatic contributions promote the unification of incompatible models of neutral molecular evolution".
Methods
We introduce a model called Sequence Evolution with Epistatic Constributions (SEEC). The model dynamics are based on parameters derived from Multiple Sequence Alignments analyzed using Direct Coupling Analysis methodology. The dataset contains formated aligments of protein families compiled from Pfam and processed in our manuscript. These aligments are the input to the global inference method called Direct Coupling Analysis (DCA). The dataset also includes paramaters computed and required to run the evolutionary model developed in the paper.
Usage notes
The dataset includes Alignments and Matlab files (.mat) with relevant variables used in the evolutionary model described in the article.
Family alignments and eij hi parameters (Parameters_orig.mat)
alignment number | pfam name | pfam number |
1 | 7tm_1 | PF00001 |
2 | ATP-synt_ab | PF00006 |
3 | Globin | PF00042 |
4 | Response_reg | PF00072 |
5 | ATP-synt_A | PF00119 |
6 | Sigma54_activat | PF00158 |
7 | SBP_bac_3 | PF00497 |
8 | HisKA | PF00512 |
10 | HATPase_c | PF02518 |
11 | Peripla_BP_3 | PF13377 |
Pairwise coupling and local fields parameters computed from msa_numerical.txt by bmDCA method (https://github.com/matteofigliuzzi/bmDCA). The data type of it is MAT file, which can be read by MATLAB.
This file generates:
Native: Amino-acid sequence used as first member of the evolutionary simulation. Translated directly from the first sequence in the original fasta alignment using bmDCA_preprocessing.sh from the bmDCA library referenced.
align: Amino-acid sequences from the fasta alignment used to train the DCAcouplings translated using bmDCA_preprocessing.sh from the bmDCA library referenced.
e: Coupling matrix (qLxqL) formed from L x L submatrices of size q x q with the pairwise couplings for each pair of sites (L submatrices in diagonal are irrelevant). Inferred using bmDCA_v2.1.sh from the referred bmDCA library.
h: Local fields matrix (q x L) with the single site contributions to the Potts Hamiltonian. Inferred using bmDCA_v2.1.sh from the referred bmDCA library.
Basic Outputs from Evolutionary Model (Ev_Output.mat)
Most basic outputs generated by the SEEC model (https://github.com/AlbertodelaPaz/SEEC). The data type of it is MAT file, which can be read by MATLAB.
This file stores the following MATLAB variables:
EvTraj: Amino-acid sequences generated by the evolutionary simulation. Each row represents a different generation from the model. Translation according to the conventions givien by bmDCA_preprocessing.sh with labeling from 1 to 21. Generated using Probevolution.m in the SEEC library with parameters stored in Parameters_orig.mat.
R: Statistical dispersion index calculated every 50 steps of the evolutionary trajectory. Defined as the variance of the time between non-synonymous substitutions divided by the average time between substitutions. Generated using Probevolution.m in the SEEC library with parameters stored in Parameters_orig.mat.
familyH: Statistical energy associated to each member of the original family alignment according to the Potts Hamiltonian under the parameters e and h. Calculated using Generalhamiltonian.m from SEEC library with parameters stored in Parameters_orig.mat.
trajectoryH: Statistical energy associated to each member of the evolutionary trajectory stored in EvTraj according to the Potts Hamiltonian under the parameters e and h. Calculated using Generalhamiltonian.m from SEEC library with parameters stored in Parameters_orig.mat and the sequences in EvTraj.