Skip to main content
Dryad

Beyond mutations: accounting for selection and self-organization in the analysis of protein evolution

Cite this dataset

Weber, Georg F.; Wu, Xiaoyong; Rai, Shesh N. (2024). Beyond mutations: accounting for selection and self-organization in the analysis of protein evolution [Dataset]. Dryad. https://doi.org/10.5061/dryad.tht76hf63

Abstract

Molecular phylogenetic research has relied on the analysis of the coding sequences by genes or of the amino acid sequences by the encoded proteins. Enumerating the numbers of mismatches, being indicators of mutation, has been central to pertinent algorithms. However, the constraining forces of selection and self-organization have been unaccounted for in conventional approaches, possibly causing available models to fall short of representing the actual evolutionary history. Specific amino acids possess quantifiable characteristics that enable the conversion from “words” (strings of letters denoting amino acids or bases) to “waves” (strings of quantitative values representing the physico-chemical properties) or to matrices (coordinates representing the positions in a comprehensive property space). The application of such numerical representations to evolutionary analysis takes into account not only mutation but also selection/self-organization as influences that drive speciation, because selective pressures favor certain mutations over others, and this predilection is represented in the characteristics of the incorporated amino acids (it is not born out solely by the mismatches). Besides being more discriminating sources for treegenerating algorithms than match/mismatch, the number strings can be examined for overall similarity with average mutual information, autocorrelation, and fractal dimension. Bivariate wavelet analysis aids in distinguishing hypermutable versus conserved domains of the protein. Further, the matrix depiction is readily subjected to comparisons of distances (Euclidean distance, Frobenius distance), and it allows the generation of heat maps or graphs. These analytical algorithms have been automated in R and are applicable to various processes that are describable in matrix format.

README: Beyond Mutations: Accounting for Selection and Self-Organization in the Analysis of Protein Evolution

https://doi.org/10.5061/dryad.tht76hf63

Description of the data and file structure

Publicly accessible sequences were collected from the NCBI landmark model organisms and then sought to add representatives of diverse clades from NCBI nucleotide.

Sharing/Access information

Data was derived from the following sources:

  • NCBI

Code/Software

NA

Funding

National Cancer Institute, Award: CA224104

Steven Goldman Memorial

University of Cincinnati, Award: Pivot Award