Skip to main content
Dryad

Deep evolutionary analysis reveals the design principles of fold A glycosyltransferases

Cite this dataset

Taujale, Rahil et al. (2020). Deep evolutionary analysis reveals the design principles of fold A glycosyltransferases [Dataset]. Dryad. https://doi.org/10.5061/dryad.v15dv41sh

Abstract

Glycosyltransferases (GTs) are prevalent across the tree of life and regulate nearly all aspects of cellular functions. The evolutionary basis for their complex and diverse modes of catalytic functions remain enigmatic. Here, based on deep mining of over half million GT-A fold sequences, we define a minimal core component shared among functionally diverse enzymes. We find that variations in the common core and emergence of hypervariable loops extending from the core contributed to GT-A diversity. We provide a phylogenetic framework relating diverse GT-A fold families for the first time and show that inverting and retaining mechanisms emerged multiple times independently during evolution. Using evolutionary information encoded in primary sequences, we trained a machine learning classifier to predict donor specificity with nearly 90% accuracy and deployed it for the annotation of understudied GTs. Our studies provide an evolutionary framework for investigating complex relationships connecting GT-A fold sequence, structure, function and regulation.

Methods

The GT-A sequences were collected by a similarity search strategy using multiply aligned manually curated GT-A fold profiles. The sequences were further aligned to the profiles to determine the GT-A domain bounds and insertions.

Usage notes

This dataset includes all putative GT-A fold sequences that belong to one of the 53 GT-A fold families. These were collected by searching the NCBI nr and the UniProt proteomes databases. The file hierarchy.tsv contains a table that lists the hierarchy of the families. For each level and family, there will be a corresponding _nr.fasta, _nr.tsv, _uniprot,fasta and _uniprot.tsv files that contain the sequences from the NCBInr and the Uniprot proteomes database in fasta and tsv formats respectively.

Funding

National Institutes of Health, Award: R01 GM130915

National Institutes of Health, Award: T32 GM107004