Beyond mutations: accounting for selection and self-organization in the analysis of protein evolution
Data files
Mar 01, 2024 version files 450.17 KB
-
cytochrome_b.clustal_num
-
cytochrome_b.docx
-
cytochrome_c_oxidase_I.clustal_num
-
cytochrome_c_oxidase_I.docx
-
cytochrome_c_oxidase_III.clustal_num
-
cytochrome_c_oxidase_III.docx
-
NADH_dehydrogenase_I.clustal_num
-
NADH_dehydrogenase_I.docx
-
README.md
-
S100_calcium_binding_protein_A6.docx
-
S100A6.clustal_num
Abstract
Molecular phylogenetic research has relied on the analysis of the coding sequences by genes or of the amino acid sequences by the encoded proteins. Enumerating the numbers of mismatches, being indicators of mutation, has been central to pertinent algorithms. However, the constraining forces of selection and self-organization have been unaccounted for in conventional approaches, possibly causing available models to fall short of representing the actual evolutionary history. Specific amino acids possess quantifiable characteristics that enable the conversion from “words” (strings of letters denoting amino acids or bases) to “waves” (strings of quantitative values representing the physico-chemical properties) or to matrices (coordinates representing the positions in a comprehensive property space). The application of such numerical representations to evolutionary analysis takes into account not only mutation but also selection/self-organization as influences that drive speciation, because selective pressures favor certain mutations over others, and this predilection is represented in the characteristics of the incorporated amino acids (it is not born out solely by the mismatches). Besides being more discriminating sources for treegenerating algorithms than match/mismatch, the number strings can be examined for overall similarity with average mutual information, autocorrelation, and fractal dimension. Bivariate wavelet analysis aids in distinguishing hypermutable versus conserved domains of the protein. Further, the matrix depiction is readily subjected to comparisons of distances (Euclidean distance, Frobenius distance), and it allows the generation of heat maps or graphs. These analytical algorithms have been automated in R and are applicable to various processes that are describable in matrix format.
README: Beyond Mutations: Accounting for Selection and Self-Organization in the Analysis of Protein Evolution
https://doi.org/10.5061/dryad.tht76hf63
Description of the data and file structure
Publicly accessible sequences were collected from the NCBI landmark model organisms and then sought to add representatives of diverse clades from NCBI nucleotide.
Sharing/Access information
Data was derived from the following sources:
- NCBI
Code/Software
NA