Epistasis facilitates functional evolution in an ancient transcription factor
Data files
Jan 05, 2024 version files 6.31 GB
Abstract
A protein’s genetic architecture – the set of causal rules by which its sequence determines its specific functions – also determines the functional impacts of mutations and the protein’s evolutionary potential. Prior research has proposed that proteins’ genetic architecture is very complex, with pervasive epistatic interactions that constrain evolution and make function difficult to predict from sequence. Most of this work has considered only the amino acid states present in two sequences of interest and the direct paths between them, but real proteins evolve in a multidimensional space of 20 possible amino acids per site. Moreover, almost all prior work has assayed the effect of sequence variation on a single protein function, leaving unaddressed the genetic architecture of functional specificity and its impacts on the evolution of new functions. Here we develop a new logistic regression-based method to directly characterize the global causal rules of the genetic architecture of multiple protein functions from 20-state combinatorial deep mutational scanning (DMS) experiments. We apply it to dissect the genetic architecture and evolution of a transcription factor’s specificity for DNA, using data from a combinatorial DMS of an ancient steroid hormone receptor’s capacity to activate transcription from two biologically relevant DNA elements. We show that the genetic architecture of DNA recognition and specificity consists of a dense set of main and pairwise effects that involve virtually every possible amino acid state in the protein-DNA interface, but higher-order epistasis plays only a tiny role. Pairwise interactions enlarge the set of functional sequences and are the primary determinants of specificity for different DNA elements. Epistasis also massively expands the number of opportunities for single-residue mutations to switch specificity from one DNA target to another. By bringing variants with different functions close together in sequence space, pairwise epistasis therefore facilitates rather than constrains the evolution of new functions.
README: Epistasis facilitates functional evolution in an ancient transcription factor
https://doi.org/10.5061/dryad.jsxksn0hk
This directory contains initial and intermediate datasets computed by the analysis pipeline. All .rda files can be loaded into R or RStudio.
Description of the data and file structure
Initial Data Files
AA.SEQ.rda - Sequence code for every genotype. Row names are the sequence, with the RE listed at the beginning (E: ERE, S: SRE). Columns 1-5 give the RE or amino acid state for each site (X1-X4). Remaining columns are indicator variable for the amino acid states at each site (1: present, 0:absent)
DT.11P.CODING.rda - Initial data file from Starr et al. 2017. Contains genotype sequence; normalized counts of each genotype in each sorting bin for each RE for two replicates (RE, replicate, bin); estimate of number of colony forming units for each genotype, sorting bin, RE, and replicate; estimate of mean fluorescence for each replicate and an estimate based on combining replicates; class (null, weak, strong) estimates for each replicate and an estimate based on combining replicates; specificity class (null, promiscuous, ERE-specific, SRE-specific) based on class on ERE and SRE; and estimate of standard error around mean fluorescence estimates. Columns for RE.prediction and RE.full.class were removed and not used. Models were fit to RE.pooled.class data.
DT.JOINT.rda - Processed input data. Each row is a specific genotype on a particular RE. Columns give sequence; amino acid at each site; estimate of cfus; mean fluorescence estimate; standard error of estimate; and class values used for model fitting.
DT.JOINT.APPEND.rda - Processed input data additionally containing model predictions for class, specificity (joint) class, and genetic score (link) of each genotype on each RE for different models. TE (Third order epistasis) first + second + third order model; PE (pairwise epistasis) first + second order model; ME (main effects) first order model.
EFFECT.TABLE.TE.rda - Gives the sequence and genetic score (link) of every sequence for all genotypes on both REs. In addition, provides the individual component of each genetic score for the model with all first, second, and third order interactions. Columns include the intercepts for binding (B0) and specificity (S0), the first order effects (BX1-BX4, SX1-SX4), second order effects (BX1X2-BX3X4, SX1X2-SX3X4), and third order effects (BX1X2X3-BX2X3X4, SX1X2X3-SX2X3X4).
Model Fits
ME.MATRIX.rda; PE.MATRIX.rda; TE.MATRIX.rda - Sparse model matrices for fitting data. Each matrix represents the reference-free approach, using indicator variables for the presence of a particular term (column) in a particular genotype (row). M,P,T refer to the first order, first + second, and first+second+third order models respectively.
ME.L1.MODEL.rda; ME.L2.MODEL.rda; PE.L1.MODEL.rda; PE.L2.MODEL.rda; TE.L1.MODEL.rda; TE.L2.MODEL.rda - Model fits using modified glmnetpo functions. L1 models use lasso penalty while L2 models use ridge regression. Provides degrees of freedom, percent of deviance explained, and lambda value for the full set of lambda values fit. Also provides parameter estimates at each lambda value.
ME.L1.CV.rda; ME.L2.CV.rda; PE.L1.CV.rda; PE.L2.CV.rda; TE.L2.CV.rda - Repeated model fits using cross-validation. Each cross validation is 10-fold and fit to all folds. This is repeated 10 times using different partitions of genotypes into folds, with each replicate a different list element.
ME.COEFS.ADJ.rda; ME.COEFS.EFFECT.ADJ.B0.rda; ME.COEFS.EFFECT.ADJ.S0.rda; PE.COEFS.ADJ.rda; PE.COEFS.EFFECT.ADJ.B0.rda; PE.COEFS.EFFECT.ADJ.S0.rda; TE.COEFS.ADJ.rda; TE.COEFS.EFFECT.ADJ.B0.rda; TE.COEFS.EFFECT.ADJ.S0.rda - Amino acid and interaction coefficient estimates from model fits. Adjusted so that marginals sum to zero and thus conform to the reference-free framework. Intercepts for binding (B0) and specificity (S0). Other coefficients are named with site (X1-X4) and state (A-Y).
ME.THRESH.NULL.rda; ME.THRESH.WEAK.rda; PE.THRESH.NULL.rda; PE.THRESH.WEAK.rda; TE.THRESH.NULL.rda; TE.THRESH.WEAK.rda - Fitted threshold values between null and weak (NULL) and weak and strong (WEAK) categories for different models.
PREDICT.ME.CLASS.rda; PREDICT.PE.CLASS.rda; PREDICT.TE.CLASS.rda; PREDICT.ME.LINK.rda; PREDICT.PE.LINK.rda; PREDICT.TE.LINK.rda; - Predicted classes (CLASS) and genetic scores (LINK) of all genotypes under different models.
TE.COEF.MATRIX.ADJ.rda - Same as TE.MATRIX.rda but replaces indicator variables with the estimated effect from the first + second + third order model.
AA.EFFECTS.START.MG.rda; AA.EFFECTS.START.PG.rda; AA.EFFECTS.START.TG.rda; AA.EFFECTS.STOP.MG.rda; AA.EFFECTS.STOP.PG.rda; AA.EFFECTS.STOPTG.rda - Files for determining the effect of every possible mutation on every possible genetic background. Each row is a mutation at a particular site (X1-X4) from a particular initial state (A-Y) to another particular mutated state (A-Y), resulting in 20 x 19 x 4 = 1520 possible mutation 'types'. Each column is a particular genetic background, i.e. combination of amino acid states at the other three sites, resulting in 20 x 20 x 20 = 16,000 backgrounds. The START files give the phenotype of the genetic background with the initial state at the particular site. The STOP files give the phenotype of the genetic background with the mutated state at the particular site. The effect of a mutation is calculated by subtracting the STOP files from the START files. MG is the first-order only model, PG the model with first- and second-order effects, and TG the model with the first-, second-, and third-order effects.
Network effects
NG.ACT.ADJM.rda; MG.ACT.ADJM.rda; PG.ACT.ADJM.rda; TG.ACT.ADJM.rda; NH.ACT.ADJM.rda; MH.ACT.ADJM.rda; PH.ACT.ADJM.rda; TH.ACT.ADJM.rda; G.INT_ERE.IND_GENO.ADJM.rda - Sparse adjacency matrices. Rows and columns are genotypes. Values are 1 if genotypes are considered adjacent or 0 if not. N,M,P,T refer to different models. The N datasets contains all genotypes, regardless of function. The M datasets contains only non-null genotypes under the first order model. The P datasets contains only non-null genotypes under the first and second order model. The T datasets contains only non-null genotypes under the first, second, and third order model. The G datasets have genotypes connected via the genetic code. To account for disjoint serine codons, a 21st amino acid Z is introduced and all combinations of genotypes with serine are produced with S and Z. These genotypes have the same phenotype, but are connected differently due to the genetic code. The H dataset connects genotypes via hamming distance. The INT_ERE dataset connects only those genotypes that are activators in all networks.
NG.ACT.DIST.rda; MG.ACT.DIST.rda; PG.ACT.DIST.rda; TG.ACT.DIST.rda; NH.ACT.DIST.rda; MH.ACT.DIST.rda; PH.ACT.DIST.rda; TH.ACT.DIST.rda; G.INT_ERE.IND_GENO.DIST.rda - Same as adjacency matrices, but gives shortest path distance from one genotype to another. Genotypes that do not connect are given NA.
NG.ACT.NET.rda; MG.ACT.NET.rda; PG.ACT.NET.rda; TG.ACT.NET.rda; NH.ACT.NET.rda; MH.ACT.NET.rda; PH.ACT.NET.rda; TH.ACT.NET.rda; G.INT_ERE.IND_GENO.NET.rda - Same as adjacency matrices, but structured for igraph.
ME.ONE.STEP.rda; PE.ONE.STEP.rda; TE.ONE.STEP.rda - Set of single mutations that move from an ERE-specific to an SRE-specific genotype. Columns give the genetic score for the initial (SOURCE) and destination (TARGER) genotype on both ERE and SRE. The contribution of first order terms (MAIN) is given for changes on both REs. For PE and TE models, additional columns give the contribution of second or second and third order effects (EPI).
NG.ERE.SRE.DIST.rda; MG.ERE.SRE.DIST.rda; PG.ERE.SRE.DIST.rda; TG.ERE.SRE.DIST.rda; NH.ERE.SRE.DIST.rda; MH.ERE.SRE.DIST.rda; PH.ERE.SRE.DIST.rda; TH.ERE.SRE.DIST.rda - Shortest distance between ERE-specific (rows) and SRE-specific (columns) genotypes. G is for connections given the genetic code, H is for hamming distance.
NG.ERE.SRE.MIN.DIST.rda; MG.ERE.SRE.MIN.DIST.rda; PG.ERE.SRE.MIN.DIST.rda; TG.ERE.SRE.MIN.DIST.rda; NH.ERE.SRE.MIN.DIST.rda; MH.ERE.SRE.MIN.DIST.rda; PH.ERE.SRE.MIN.DIST.rda; TH.ERE.SRE.MIN.DIST.rda - Distance for each ERE-specific genotype to the nearest SRE-specific genotype.
NG.NUM.PATHS.rda; MG.NUM.PATHS.rda; PG.NUM.PATHS.rda; TG.NUM.PATHS.rda; NH.NUM.PATHS.rda; MH.NUM.PATHS.rda; PH.NUM.PATHS.rda; TH.NUM.PATHS.rda - Number of unique shortest paths between ERE-specific (rows) and SRE-specific (columns) genotypes. A path is unique the set of genotypes traversed is unique.
NG.ERE.DISTINCT.rda; MG.ERE.DISTINCT.rda; PG.ERE.DISTINCT.rda; TG.ERE.DISTINCT.rda; NH.ERE.DISTINCT.rda; MH.ERE.DISTINCT.rda; PH.ERE.DISTINCT.rda; TH.ERE.DISTINCT.rda - Number of distinct shortest paths between ERE-specific (rows) and SRE-specific (columns) genotypes. This metric accounts for overlap of paths using many of the same genotypes. It thus approximates the number of distinct routes that have no overlapping genotypes.
ALL.ACT.gexf; ME.TE.ACT.gexf - Graph of connected activators in any model (ALL) or only those in either the ME or TE models. Meant for visualization of connected genotypes using gephi. Contains attributes to determine which model each genotype is an activator in. The former allows one to compare across all models while the later allows one to look at differences with and without epistasis.
Evolutionary simulations
MG.ALL_ERE.ALL_GENO.SIMULATE.rda; PG.ALL_ERE.ALL_GENO.SIMULATE.rda; TG.ALL_ERE.ALL_GENO.SIMULATE.rda; MH.ALL_ERE.ALL_GENO.SIMULATE.rda; PH.ALL_ERE.ALL_GENO.SIMULATE.rda; TH.ALL_ERE.ALL_GENO.SIMULATE.rda; - Results of evolutionary simulations for each ERE-specific genotypes in each model. Simulations have no selection for specificity. Each file has five components stored in a list. The first component gives the number of steps for each starting genotype (columns) for each replicate simulation (row). The second component lists the ending SRE-specific genotype for each starting genotype (columns) for each replicate (row). The third component gives the complete path taken. The fourth component lists neighboring genotype for each step in each path. The fifth component lists the specificity class of each neighboring genotype for each step in each path.
MG.INT_ERE.ALL_GENO.SIMULATE.rda; PG.INT_ERE.ALL_GENO.SIMULATE.rda; TG.INT_ERE.ALL_GENO.SIMULATE.rda; MH.INT_ERE.ALL_GENO.SIMULATE.rda; PH.INT_ERE.ALL_GENO.SIMULATE.rda; TH.INT_ERE.ALL_GENO.SIMULATE.rda - Same as above, but additional simulations for the set of ERE-specific genotypes found in all three models.
MG.ALL_ERE.ALL_GENO.SEL.SIMULATE.N1.rda; MG.ALL_ERE.ALL_GENO.SEL.SIMULATE.N3.rda; MG.ALL_ERE.ALL_GENO.SEL.SIMULATE.N5.rda; MG.ALL_ERE.ALL_GENO.SEL.SIMULATE.N10.rda; MG.ALL_ERE.ALL_GENO.SEL.SIMULATE.N17.rda; MG.ALL_ERE.ALL_GENO.SEL.SIMULATE.N31.rda; MG.ALL_ERE.ALL_GENO.SEL.SIMULATE.N56.rda; MG.ALL_ERE.ALL_GENO.SEL.SIMULATE.N100.rda; MG.ALL_ERE.ALL_GENO.SEL.SIMULATE.N177.rda; MG.ALL_ERE.ALL_GENO.SEL.SIMULATE.N316.rda; MG.ALL_ERE.ALL_GENO.SEL.SIMULATE.N562.rda; MG.ALL_ERE.ALL_GENO.SEL.SIMULATE.N1000.rda; MG.ALL_ERE.ALL_GENO.SEL.SIMULATE.N1778.rda; MG.ALL_ERE.ALL_GENO.SEL.SIMULATE.N3162.rda; MG.ALL_ERE.ALL_GENO.SEL.SIMULATE.N5263.rda; MG.ALL_ERE.ALL_GENO.SEL.SIMULATE.N10000.rda; PG.ALL_ERE.ALL_GENO.SEL.SIMULATE.N1.rda; PG.ALL_ERE.ALL_GENO.SEL.SIMULATE.N3.rda; PG.ALL_ERE.ALL_GENO.SEL.SIMULATE.N5.rda; PG.ALL_ERE.ALL_GENO.SEL.SIMULATE.N10.rda; PG.ALL_ERE.ALL_GENO.SEL.SIMULATE.N17.rda; PG.ALL_ERE.ALL_GENO.SEL.SIMULATE.N31.rda; PG.ALL_ERE.ALL_GENO.SEL.SIMULATE.N56.rda; PG.ALL_ERE.ALL_GENO.SEL.SIMULATE.N100.rda; PG.ALL_ERE.ALL_GENO.SEL.SIMULATE.N177.rda; PG.ALL_ERE.ALL_GENO.SEL.SIMULATE.N316.rda; PG.ALL_ERE.ALL_GENO.SEL.SIMULATE.N562.rda; PG.ALL_ERE.ALL_GENO.SEL.SIMULATE.N1000.rda; PG.ALL_ERE.ALL_GENO.SEL.SIMULATE.N1778.rda; PG.ALL_ERE.ALL_GENO.SEL.SIMULATE.N3162.rda; PG.ALL_ERE.ALL_GENO.SEL.SIMULATE.N5263.rda; PG.ALL_ERE.ALL_GENO.SEL.SIMULATE.N10000.rda; TG.ALL_ERE.ALL_GENO.SEL.SIMULATE.N1.rda; TG.ALL_ERE.ALL_GENO.SEL.SIMULATE.N3.rda; TG.ALL_ERE.ALL_GENO.SEL.SIMULATE.N5.rda; TG.ALL_ERE.ALL_GENO.SEL.SIMULATE.N10.rda; TG.ALL_ERE.ALL_GENO.SEL.SIMULATE.N17.rda; TG.ALL_ERE.ALL_GENO.SEL.SIMULATE.N31.rda; TG.ALL_ERE.ALL_GENO.SEL.SIMULATE.N56.rda; TG.ALL_ERE.ALL_GENO.SEL.SIMULATE.N100.rda; TG.ALL_ERE.ALL_GENO.SEL.SIMULATE.N177.rda; TG.ALL_ERE.ALL_GENO.SEL.SIMULATE.N316.rda; TG.ALL_ERE.ALL_GENO.SEL.SIMULATE.N562.rda; TG.ALL_ERE.ALL_GENO.SEL.SIMULATE.N1000.rda; TG.ALL_ERE.ALL_GENO.SEL.SIMULATE.N1778.rda; TG.ALL_ERE.ALL_GENO.SEL.SIMULATE.N3162.rda; TG.ALL_ERE.ALL_GENO.SEL.SIMULATE.N5263.rda; TG.ALL_ERE.ALL_GENO.SEL.SIMULATE.N10000.rda - Same as above, but simulations are under selection for increased specificity. Strength of selection is controlled by population size (N).
PG_MG.INT_ERE.IND_GENO.STEPS.p.rda; TG_MG.INT_ERE.IND_GENO.STEPS.p.rda; TG_PG.INT_ERE.IND_GENO.STEPS.p.rda; PH_MH.INT_ERE.IND_GENO.STEPS.p.rda; TH_MH.INT_ERE.IND_GENO.STEPS.p.rda; TH_PH.INT_ERE.IND_GENO.STEPS.p.rda - p-values from permutation tests of differences in path length for simulations for ERE-specific genotypes found in all three models.
MG.ALL_ERE.ALL_GENO.NEIGHBOR.CLASS.STEP.rda; PG.ALL_ERE.ALL_GENO.NEIGHBOR.CLASS.STEP.rda; TG.ALL_ERE.ALL_GENO.NEIGHBOR.CLASS.STEP.rda; MG.INT_ERE.ALL_GENO.NEIGHBOR.CLASS.STEP.rda; PG.INT_ERE.ALL_GENO.NEIGHBOR.CLASS.STEP.rda; TG.INT_ERE.ALL_GENO.NEIGHBOR.CLASS.STEP.rda - Analogous to the fifth component of the simulations.
Code/Software
Code for running analyses can be found at https://github.com/JoeThorntonLab/DBD.GeneticArchitecture