Data from: Functional regulation of aquaporin dynamics by lipid bilayer composition
Data files
Feb 09, 2024 version files 224.08 GB
-
00-starting-coordinates-and-params.zip
-
01-features-and-MSM.zip
-
02-discrete-frames-and-analysis.zip
-
03-complex-traj-stripped-lipid.tar.bz2
-
03-complex-traj-stripped-water.tar.bz2
-
03-LL-traj-stripped-lipid.tar
-
03-PL-traj-stripped-lipid.tar
-
03-PO-traj-stripped-lipid.tar
-
04-revision-analyses.zip
-
GITHUB.zip
-
msm-env.yml
-
plotly.yml
-
README.md
-
SOURCE-F8_Lipid_Order_LL.tar.bz2
-
SOURCE-F8_Lipid_Order_PL.tar.bz2
-
SOURCE-F8_Lipid_Order_PO.tar.bz2
-
SOURCE-Landscape_Figures.tar.bz2
-
SOURCE-Main_Text_no_FEL_ORD.tar.bz2
-
SOURCE-SF19-29-30-31-32-33.tar.bz2
-
SOURCE-SI_no_FEL_ORD.tar.bz2
Abstract
With the diversity of lipid-protein interactions, any observed membrane protein dynamics or functions directly depend on the lipid bilayer selection. However, the implications of lipid bilayer choice are seldom considered unless characteristic lipid-protein interactions have been previously reported. Using molecular dynamics simulation, we characterize the effects of membrane embedding on plant aquaporin SoPIP2;1, which has no reported high-affinity lipid interactions. The regulatory impacts of a realistic lipid bilayer, and nine different homogeneous bilayers, on varying SoPIP2;1 dynamics were examined. We demonstrate that SoPIP2;1's structure, thermodynamics, kinetics, and water transport are altered as a function of each membrane construct's ensemble properties. Notably, the realistic bilayer provides stabilization of non-functional SoPIP2;1 metastable states. Hydrophobic mismatch and lipid order parameter calculations further explain how lipid ensemble proprties manipulate SoPIP2;1 behavior. Our results illustrate the importance of careful bilayer selection when studying membrane proteins. To this end, we advise cautionary measures when performing membrane protein moelcular dynamics simulations.
README: Functional regulation of aquaporin dynamics by lipid bilayer composition
https://doi.org/10.5061/dryad.jsxksn0hc
Our data can be separated into a few different categories, whose architecture we further dissect below.
00 - Molecular dynamics simulations must be performed using topologies (.parm7/.prmtop) and coordinate (.rst/.ncrst) files. We have provided AMBER-compatible simulation files parameterized using the CHARMM36 force fields.
01 - Our primary method of choice for this manuscript was via an adaptive sampling regime, where the resulting aggregate trajectories were incorporated into master kinetic frameworks using Markov state models (MSMs). A MSM has been made and optimized for each SoPIP2:lipid bilayer construct. To make an MSM, one generally follows a generalized protocol:
-> Featurization - numerical ways to describe the structures obtained in molecular dynamics trajectories in "n" degrees of
freedom using reaction coordinates. Reaction coordinates are chosen as described in our Methods
section to best describe the underlying dynamics of SoPIP2;1 opening/closing transitions provided each
lipid bilayer environment.
-> Discretization - after being described through featurization, each npy file corresponding to a trajectory is discretized into
microstates using a mini-batch k-means clustering algorithm within PyEmma. A resulting clustering object
file is prepared to reflect the discretization step. A tICA (time-lagged independent component analysis)
decomposition file is prepared, representing dimensionality reduction.
-> MSM construction - Maximum likelihood MSMs are made using PyEmma, where the input is the clustering object and
the output is the resulting MSM object file.
-> Validation - MSMs need to be validated for their quality so that any thermodynamic or kinetic results can be interpreted
in confidence. To validate the quality of our MSMs, we evaluate convergence for our implied timescale
(ITS) plots; confirm Markovianity using the Chapman-Kolmogoverov test; verify that the statistical
reweighting introduced by the MSM is not "overtuned" and distorts our data structures by generating
artificial metastable states; and optimize the VAMP-2 score.
The construction and validation of MSMs is a bit of a "chicken-and-egg" problem, as some of the hyperparameter tuning
requires initial guesses and evaluations of inputs (i.e., Markov lag time, cluster count, tIC dimensionality, and input feature
set selection). We perform a grid search testing many combinations of parameters and use our swath of results to select a
representative MSM that satisfies all Validation criteria.
02 - Discrete frames were extracted from our metastable states in order to gain ensemble level insights on SoPIP2;1 structure in response to different lipid bilayers. These discrete frames were used for our plugging dihedral analyses (Fig. 4 of our final NCOMMS submission) and hydrophobic mismatch (Fig. 7 of our final NCOMMS submission).
03 - Continuous trajectories which describe metastable states from each tICA landscape are provided, as well as their subsequent analyses. Depending on the analysis, some of the trajectories are stripped of just water molecules and ions (i.e., lipids remain for lipid-related analyses), or are stripped of just lipids and ions (i.e., water molecules remain for water-related analyses). These continuous trajectories were used as replicates for different analyses, including water transport activity (Fig. 5 of our final NCOMMS submission), HOLE (Fig. 6 of our final NCOMMS submission), lipid order parameter (Fig. 8 of our final NCOMMS submission), as well as related Supplementary Analyses. NOTE that due to repository limitations on Dryad, the data $LIPID-traj-stripped-water.tar.bz2 data is contained within the SOURCE-F8 directories.
04 - During the peer review process, we were asked to perform additional analyses which describe more of the direct interactions between SoPIP2;1 and each of its lipid bilayer embeddings. These analyses include lipid residence time, which was performed using PyLipID software and is included in our Supplementary Information as heatmaps and radial protein-lipid interaction fingerprints. Another analysis that was performed was comparing average lipid contact probabilities between most known aquaporins present in the RCSB PDB databank against our simulated aquaporin, SoPIP2;1. This folder mainly contains PDB files for the chains of individual protomers obtained from the MemProtMD database; Excel spreadsheets for organizing average lipid residence times in a state-dependent manner; Excel spreadsheets for mapping these aquaporin sequences' residues and b-factors to SoPIP2;1; and then Jupyter notebooks for plotting the data.
Lastly, the Source Data, which includes the data, scripts, and python environments needed to directly reproduce manuscript text figures, is provided. Inclusion of Source Data is a requirement of Nature Publishing Group editorial policies.
Description of the data and file structure
└── SoPIP2-lipid-GitHub/
├── 00-starting-coordinates-and-params/
│ ├── open-structure/
│ │ ├── .rst/ncrst
│ │ └── .parm7/prmtop
│ └── closed-structure
├── 01-features-and-MSM/
│ ├── features-per-system/
│ │ ├── POPC-features.tar.bz2: calculated features for all frames and all trajectories
│ │ └── ... (other bilayer systems files are named similarly: $lipid-)
│ └── MSM-related-objs/
│ ├── POPC/
│ │ ├── *features.pkl: indices of protein residue pairs
│ │ ├── *tica_obj.pkl: PyEMMA tICA object
│ │ ├── *tica_trajs.pkl: PyEMMA tICs for each traj and frame
│ │ ├── *cluster_obj.pkl: PyEMMA k-means clustering object
│ │ ├── *msm_obj.pkl: PyEMMA MSM object
│ │ ├── *ITS-error.pkl: PyEMMA implied timescale object
│ │ ├── weights.pkl: PyEMMA stationary distribution of MSM
│ │ └── probabilities.npy: msm and raw probability of each cluster
│ └── ...
├── 02-discrete-frames-and-analysis/
│ ├── minima-box (.csv files)
│ ├── trajectories.tar.bz2 (.xtc files with water stripped along with *gro parameters)/
│ │ ├── POPC-frames-stripped-wat.tar.bz2
│ │ └── ...
│ ├── dihedral.tar.bz2/
│ │ ├── POPC-dihedral.tar.bz2 (all *npy containing dihedral data of POPC system)
│ │ └── ...
│ └── mismatch.tar.bz2/
│ ├── prot-bulk-mismatch.tar.bz2 (all *pkl containing prot-bulk data)
│ └── shell-bulk-mismatch.tar.bz2 (all *pkl containing shell-bulk data)
├── 03-continuous-trajs-and-analysis/
│ ├── cont-traj-data.csv (all cont trajs information and calculated results)
│ ├── trajectories-stripped-lipid.tar.bz2/
│ │ ├── POPC-traj-stripped-lipid.tar.bz2/
│ │ │ ├── strippedxtc of 3 trajs (lipids stripped) per macrostate
│ │ │ └── *pdb of coordinate files for analyses
│ │ └── ...
│ ├── trajectories-stripped-water.tar.bz2/
│ │ ├── complex-traj-stripped-water.tar.bz2/
│ │ │ ├── *wat.xtc of 3 trajs (waters stripped) per macrostate
│ │ │ └── *pdb of coordinate files for analyses
│ │ │
│ │ └── For homogeneous bilayers, trajs are in SOURCE-F8_Lipid_Order_.tar.bz
│ ├── water.tar.bz2/
│ │ ├── POPC-wat-transport.tar.bz2/
│ │ │ ├── passagetime: time it takes for each water to transport
│ │ │ ├── transportedres: residue ID of the waters that was transported
│ │ │ ├── transport: 3d array recording whether a water was transported at each frame
│ │ │ └── wat-in-pore: number of water occupying the pore at each frame
│ │ ├── POPC-wat-restime.tar.bz2/
│ │ │ ├── AVG-RES-TIME: average time water continuously spends in slice, per water
│ │ │ ├── STD-RES-TIME: standard deviation of average residence time
│ │ │ ├── FRAMES-PER-WAT: per water, record frames during which water occupy slices
│ │ │ ├── WAT-PER-FRAME: per frame, record waters in slice
│ │ │ ├── ALL-WAT-COUNT: number of water continuously in slice, per water
│ │ │ └── ALL-WAT: record all water that has been in slice
│ │ └── ...
│ └── lipid-order.tar.bz2/
│ ├── POPC-lipid-ord.tar.bz2/
│ │ ├── *protein-x.pkl: protein x positions across traj
│ │ ├── *protein-y.pkl: protein y positions across traj
│ │ ├── *lipid-x.pkl: lipid x positions of each lipid
│ │ ├── *lipid-y.pkl: lipid y positions of each lipid
│ │ ├── *scc_sn1.pkl: order param of each lipid's tail sn1
│ │ ├── *scc_sn2.pkl: order param of each lipid's tail sn2
│ │ ├── *avg_scc.pkl: average order param of each lipid
│ │ └── *std_scc.pkl: standard deviation of order param
│ └── ...
│ ├── prot-bulk-mismatch.tar.bz2 (all *pkl containing prot-bulk data)
│ └── shell-bulk-mismatch.tar.bz2 (all *pkl containing shell-bulk data)
└── 04-revision-analyses/
├── revisions_data/
│ ├── APL (has area-per-lipid code and files)
│ ├── PYLIPID (has representative traj, and calculations for each bilayer)
│ ├── MEMPROTMD (MEMPROTMD AQP-data curated, aligned structures, and excel sheets)
│ └── lipid_interaction_plotting (notebooks and illustrator files for figure generation)
└── revision_code/
├── complex_pylipid (has code for PyLipID calculations on complex bilayer)
├── anh_pylipid.yml (yml for creating PyLipID environment. Has conda and pip commands too)
├── anh_pylipid_no_pip.yml (use this yml first by completing all conda command installs)
├── lipid-residence-time.py (PyLipID script for homogeneous bilayer trajectories)
└── load-membrane-apl-agr.py (script for parsing Membrainy APL calculation agr outputs)
Sharing/Access information
All parameter files and trajectories are organized by protein-structure/lipid and are being uploaded on a Box folder. While Box is not necessarily public by definition, we have made our Box folder public to any who has access to the link. This Box link has been made available on our Github repository, and is reposted here as well:
https://uofi.app.box.com/folder/210261756127?s=uc33gid1jhyuc0oru8tr30to3x9kyj8z
Code/Software
Code is provided in our Github Reposiitory:
https://github.com/ShuklaGroup/Lipid_composition_on_AQP
In general, codes were written using Python and Bash. Figures were made using a combination of different packages such as matplotlib, seaborn, and plotly.
Source Data
The msm-env.yml was used on our local clusters to perform and plot these analyses (except for the MemProtMD and PyLipID heatmaps, and the Radial Fingerprint analysis). The plotly.yml was used to recreate all plots on a personal machine. The main needs are matplotlib, seaborn, mdtraj, pyemma, and scikit-learn, to name a few dependencies. The only exception is "SF27 - Radial FP", which uses plotly. The plotly.yml was also used for "SF21-22-23_MemProtMD" and SF24-25-26_PyLipID". If the plotly.yml has too many other dependencies, a fresh plotly install will do the trick.
Due to upload limitations on Dryad, the Source Data are divided across a few different tar.bz2 files. The notation "F" is used for Main Text Figures, and the notation "SF" is used for Supplementary Figures.
SOURCE-Main_text_no_FEL_ORD.tar.bz2
Contains F3, F4, F5, F6, and F7
F3 - MFPT
yml provided
run plot-mfpt-240109.ipynb with jupyter notebook
F4 - dihedral
yml provided
cd $LIPID
unpack *tar.bz2
cd into unpacked directory
python plot-dihedral-pi-231001-revision.py
F5 - water_transport
yml provided
Panel A
unpack water_transport_data.tar.bz2
cd into unpacked directory
run plot-water-230406.ipynb with jupyter notebook
Panel B
unpack water_in_pore_data.tar.bz2
cd into unpacked directory
run plot-wat-in-pore.ipynb with jupyter notebook
F6 - HOLE (confirm)
yml provided
run plot-hole-and-wat.ipynb with jupyter notebook
F7 - Hydrophobic mismatch (confirm)
yml provided
unpack hydrophobic_mismatch_data.tar.bz2
cd into unpacked directory
plot-mismatch-prot-shell-bulk.ipynb
SOURCE-Landscape_Figures.tar.bz2
Contains F2, SF15, and SF17
F2 - tica_FEL
yml provided
cd $LIPID
unpack *.tar.bz2
cd into unpacked directory
python plot-tica-rev.py
SF15 - Bootstrap
yml provided
cd $LIPID
python plot-bt-220306-maxE3.py
SF17 - Representative Traj
yml provided
cd $LIPID
unpack *.tar.bz2
cd into unpacked directory
python plot-bt-220306-maxE3.py
SOURCE-SI_no_FEL_ORD.tar.bz2
Contains SF1-SF14, SF18, SF21, SF22-23-24, SF25-26-27, SF28
SF2 to SF4 - Post MSM tICA correlation, ITS, and Raw Counts/Reweighting
yml provided
cd $LIPID
python $SCRIPT.py (within directory)
SF5 to SF14 - CK test
yml provided
python post-msm-ck.py
SF18 - APL
yml provided
use jupyter notebook to compile each independent notebook
SF19-29-30-31-32-33
use jupyter notebook to compile plot-ord-vs-wat.ipynb
SF21
yml provided
cd $LIPID
unpack *.tar.bz2
cd into unpacked directory
python plot-wat-res-time-230509.py script
SF22-23-24 - MemProtMD
yml provided
jupyter notebook memprotmd_heatmaps.ipynb
SF25-26-27_PyLipID
yml provided
jupyter notebook pylipid_heatmaps.ipynb
SF28 - Radial FP
yml provided
jupyter notebook radial_bar_chart_plotly.ipynb
SOURCE SF-19-29-30-31-32-33.tar.bz2\
Contains a notebook which can generate the above figures.
yml provided
jupyter notebook plot-ord-vs-wat.ipynb
Main Text Figure 8 requires direct availability of both trajectories and data. These directories are quite heavy, and have therefore been separated into different SOURCE-F8_Lipid_Order_*.tar.bz2 directories. Within each directory, do the following:
F8 - Lipid Order versus water transport
yml provided
cd $LIPID
unpack all child directories
have all $LIPID files in the same working directory
python plot-ord-lipyphilic.py
Methods
Molecular dynamics trajectories were generated using AMBER18 software on a local computational cluster. Resulting analysis files were created via python scripts and/or java-based software (i.e. Membrainy). For more information about specific codes, please refer to our Github Repository: https://github.com/ShuklaGroup/Lipid_composition_on_AQP/
This work has been performed by Anh T. P. Nguyen, Austin T. Weigle & Diwakar Shukla at the University of Illinois Urbana-Champaign.
This Dryad dataset has been made available in compliance with the "minimum dataset" requirement per the Data Availability guidelines by Nature Communications.
For further inquiries concerning further data availability (i.e., 29 TB of molecular dynamics trajectories), please contact Prof. Diwakar Shukla at diwakar@illinois.edu . Austin T. Weigle (https://orcid.org/0000-0002-1619-2452) prepared this Dryad repository.