Skip to main content
Dryad

UltraScan Solution Modeler (US-SOMO) hydrodynamic parameter, structural small angle scattering and SESCA circular dichroism (CD) calculations on AlphaFold predicted structures

Cite this dataset

Brookes, Emre; Rocco, Mattia (2023). UltraScan Solution Modeler (US-SOMO) hydrodynamic parameter, structural small angle scattering and SESCA circular dichroism (CD) calculations on AlphaFold predicted structures [Dataset]. Dryad. https://doi.org/10.5061/dryad.jq2bvq89s

Abstract

Recent spectacular advances by AI programs in 3D structure predictions from protein sequences have revolutionized the field in terms of accuracy and speed. The resulting "folding frenzy" has already produced predicted protein structure databases for the entire human and other organisms' proteomes. However, rapidly ascertaining a predicted structure's reliability based on measured properties in solution should be considered. Shape-sensitive hydrodynamic parameters such as the diffusion and sedimentation coefficients (D0t(20,w),s0(20,w)) and the intrinsic viscosity ([η]) can provide a rapid assessment of the overall structure likeliness, and SAXS would yield the structure-related pair-wise distance distribution function p(r) vs. r. Using the extensively validated UltraScan SOlution MOdeler (US‑SOMO) suite, a database was implemented calculating from AlphaFold structures the corresponding D0t(20,w), s0(20,w), [η], p(r) vs. r, and other parameters. Circular dichroism spectra were computed using the SESCA program. Some of AlphaFold's drawbacks were mitigated, such as generating whenever possible a protein's mature form. Others, like the AlphaFold direct applicability to single-chain structures only, the absence of prosthetic groups, or flexibility issues, are discussed. Overall, this implementation of the US‑SOMO‑AF database should already aid in rapidly evaluating the consistency in solution of a relevant portion of AlphaFold predicted protein structures.

Methods

Production of this dataset required three major steps: collect the AlphaFold entries and additional metadata; prepare the structures for hydrodynamic, structural and CD calculations; and compute the hydrodynamic, structural and CD properties

Briefly, each entry in the entire AlphaFold database was first compared with the corresponding entry in the UniProt database to find the (putative) initiator methionine, signal peptide and transit peptide regions, which were subsequently removed from the AlphaFold PDB files. Additional variants were created when propeptides were found. Potential disulfides were identified (subsequently allowing a better evaluation of the partial specific volume and of M) and written as SSBOND records in the cured PDBs, together with HELIX and SHEET information identified using the DSSP implementation in UCSF Chimera (Pettersen et al, 2004. Journal of computational chemistry, 25(13), pp.1605-1612). Batch-mode US-SOMO was then used to calculate the mass M, The translational diffusion coefficient D0t(20,w), the sedimentation coefficient s0(20,w), the derived Stokes' (or hydrodynamic) radius Rs, the intrinsic viscosity [η], the radius of gyration Rg, the maximum extensions along the principal X, Y and Z axes of the molecule, and the generation of an anhydrous small angle X-ray scattering pairwise distribution function p( r ) vs. r distributions (that are normalized by the M of the structure). SESCA was subsequently used to generate 170-270 nm circular dichroism CD spectra from each cured structure.

Usage notes

This is a tar archive of all datasets for each AlphaFold entry. This includes a csv file containing all hydrodynamic parameters, a pdb file containing the cured pdb structure, an mmCIF file containing the cured pdb structure and a data file containing the circular dichroism spectrum, and a p(r) vs r dat file.

Use "tar xf somoaf_all_data.tar" to extract the primary archive.
This will result in 1,002,038 individual .txz file, each representing one UniProt accession code and containing 5 files.
When propepties are identified and removed, the extracted file name will contain a -pp#, where # is a list of the propepties removed.
For example, to extract the data from an individual txz file, use "tar Jxf xxxx.txz", where xxxx is replaced by the appropriate name containing the accession code.

Further details are in the provided README.md file.

Funding

National Institute of General Medical Sciences, Award: GM120600

National Science Foundation of Sri Lanka, Award: OAC-1912444