BTK autoinhibition analyzed by high-throughput swaps of SH2 domains
Abstract
BTK, a Tec-family tyrosine kinase, resembles the Src and Abl kinases in that an SH2-SH3 module regulates the activity of the kinase domain, principally through an inhibitory interaction between the SH3 and kinase domains. In Src-family kinases, phosphorylation of a C-terminal tail latches the SH2 domain onto the kinase domain, stabilizing the inhibitory conformation of the SH3 domain; in Abl, interaction between the kinase domain and a myristoyl group on the N-terminal segment provides a similar latching function. The structure of autoinhibited BTK resembles that of the Src and Abl kinases, but BTK lacks an obvious SH2-kinase latch. To assess the role of the SH2 domain in autoinhibition of BTK, we generated hundreds of chimeric BTK molecules in which the native SH2 domain is replaced by other SH2 domains. We measured the fitness of these chimeric proteins using a high-throughput assay in T and B cells. Surprisingly, many SH2 domains increased fitness when substituted into BTK. Analysis of one set of chimeric proteins indicates that the increase in fitness stems from the ability of the substituted SH2 domains to disrupt BTK autoinhibition while maintaining phosphotyrosine targeting. Thus, although BTK lacks a specialized latch, distributed interactions between the SH2 and kinase domains stabilize the autoinhibitory conformation of BTK. While phosphotyrosine recognition can be conferred on BTK by evolutionarily distant SH2 domains, autoinhibition requires specific interactions with the kinase domain that arose through evolutionary refinement of the regulatory mechanism, and is less easily be mimicked by heterologous SH2 domains.
https://doi.org/10.5061/dryad.rxwdbrvjx
Description of the data and file structure
These datasets comprise RNA-seq data that quantify fitness scores for BTK or variants. One set of data (SH2) swaps 328 SH2 domains into the BTK protein. Another set (Helix I) swaps the C-terminal helix in the BTK kinase domain for 118 Tec-kinase C-terminal helices from jawed vertebrates. The third set (Abundance) measures the protein abundance of each SH2 chimera using fluorescently tagged variants of the proteins and cell sorting. Each replicate has an input (the denominator in the fitness calculation) and a sort sample (the CD69-selected reads, or the numerator in the calculation). In addition, the SH2 and Helix I experiments were performed in two human lymphocyte cell lines: Jurkat T cells and Ramos B cells that had been knocked out for ITK and BTK, respectively. The abundance measurements were only performed using the Jurkat cells.
In the accompanying data folder, there is an excel spreadsheet ('DatasetMetadataFinal.xlsx') that describes the location and type of each raw and processed file. Each dataset has two fastq read files (for reads 1 and 2), one count file with tabulated read counts from the fastq files, one processed file with normalized fitness scores, and the nucleotide and protein sequences associated with each variant in each library.
Normalized data were generated with custom R and python scripts available on github (https://github.com/timeisen/MutagenesisPlotCode). Software to generate the SH2 domain swaps using ancestral sequence reconstruction is available on github (https://github.com/timeisen/SH2s)
Files and variables
File: data.zip
Description: All data associated with this manuscript. The location and type of each file is described in 'DatasetMetadataFinal.xlsx'
Within the data folder, there are five directories:
raw_fastq/
nucleotide_sequences/
count_data/
processed_data/
protein_sequences/
raw_fastq directory:
In the raw_fastq directory, there are three sub-directories that contain the raw fastq files for the each of the three types of experiments published in this manuscript (abundance measurements, helix I fitness values, and SH2 fitness values).
Within these folders, there are fastq files with the following naming scheme, also described in the metadata excel spreadsheet:
experiment _ cell type (Jurkat or Ramos cells) _ dataset (input, I, or eluate, E) _ replicate (A, B, C, or D) _ read (1 or 2)
Each fastq file contains read information. The read information consist of a repeating 4-line unit:
Line 1: an "@" sign followed by the unique read identifier and sample barcode
Line 2: the nucleotide sequence of the read
Line 3: a "+" sign
Line 4: the per-base quality scores
nucleotide_sequences directory:
The raw fastq files are aligned to custom-built indices with Kallisto (Bray et al., 2016). These indices are built with the fasta files in the nucleotide_sequences directory. This directory contains two files for the two types of libraries that were analyzed for this manuscrupt (Helix I and SH2 libraries, please note that the abundance measurements were performed for the SH2 libraries only.)
Each fasta file contains a repeating unit of two datatypes:
Line 1: a ">" sign followed by the name of the sequence.
Lines 2-: the nucleotide letters associated with the named sequence in line 1. Please note that these letters can span multiple lines, and a new entry is designated only with another ">" sign.
count_data directory:
The data in this directory consists of tsv files which are the standard output from Kallisto for read-assignment data. Each file has five fields:
(1) "target_id" is the name of the nucleotide sequence that is used for the alignment. These names correspond to the sequences in the fasta files.
(2) "length" is the length of the of the variant nucleotide sequence.
(3) "eff_length" is the effective length of the sequence: the gene length minus the insert size. Please see the Kallisto study for a complete definition.
(4) "est_counts" is the number of mapped reads associated with each variant sequence.
(5) "tpm" is a measure of the abundance of each sequence and it is used in RNA-seq mapping. It normalizes the est_counts to the total number of reads and the length of each mapped sequence.
processed_data directory:
This directory contains three files for the three high-throughput experiments analyzed for this study. Each file contains the processed fitness scores calculated from the individual count-data files.
NormalizedDataLookupTablesSH2Abundance.txt contains five fields:
(1) "basename" is the name of the sequence that is used to calculate the fitness scores.
(2) "resi_pos" is the amino-acid residue position in the original protein that corresponds to the SH2 domain.
(3) "control_type" is a flag (K, X, or WT) that designates whether the sequence contains a stop codon (X), a lysine substitution for an arginine at position 307 (K), or no substitution (WT).
(4) "Jurkat_string" is the fitness score in the Jurkat experiment +/- standard error.
(5) "Abundance_string" is the abundance score in the Jurkat cells +/- standard error.
NormalizedDataLookupTablesSH2.txt contains the same five fields as the previous file, except that field 5 contains the fitness scores in Ramos cells.
NormalizedDataLookupTableEpistasis.txt contains the fitness scores for the Helix I library and consists of 6 fields:
(1) "basename" is the name of the sequence that is used to calculate the fitness scores.
(2) "species" is the name of the animal species that contains the BTK or other Tec kinase helix I sequence.
(3) "Ramos_Fitness_Node_BMXC" is the fitness score for the Helix I sequence in the BMX-H SH2 background in Ramos cells. The +/- is standard error.
(4) "Ramos_Fitness_BTK" is the fitness score for the Helix I sequence in the BTK SH2 background in Ramos cells. The +/- is standard error.
(5) "Jurkat_Fitness_Node_BMXC" is the fitness score for the Helix I sequence in the BMX-H SH2 background in Jurkat cells. The +/- is standard error.
(6) "Jurkat_Fitness_BTK" is the fitness score for the Helix I sequence in the BTK SH2 background in Jurkat cells. The +/- is standard error.
protein_sequences directory:
This directory contains the protein sequences for the SH2 domains and I helices used in this study. Each sequence is in the same fasta format described above. These files contain amino-acid residue sequences.
Code/software
Code to process the read-count files and generate normalized fitness scores is available on github (https://github.com/timeisen/MutagenesisPlotCode). Software to generate the SH2 domain swaps using ancestral sequence reconstruction is available on github (https://github.com/timeisen/SH2s)
Quantification of fitness from sequencing data was performed as in Eisen et al., Sci Signal. 2024. Briefly, Fastq files from MiSeq runs were aligned to the Fasta files containing the full sequences of each variant using Kallisto (Bray, Nat. Biotech. 2016) to generate read counts for each variant. A read cutoff of 50 reads was applied to the input libraries such that any variant not passing this threshold was discarded. Next, the unnormalized scores were calculated by dividing the number of reads in the sorted dataset by the number of reads in the input dataset and taking the log10. These unnormalized scores were normalized by subtracting the mean of the wild-type fitness scores. The SH2-domain library included 22 synonymous wild type sequences that were generated by randomly choosing 5 codons and substituting them for randomly chosen synonymous counterparts. Sequences that introduced additional BsaI restriction sites were avoided. In the helix I library, 44 synonymous wild type sequences were included. Fitness scores were calculated by subtracting the mean of these synonymous sequences. Code to generate saturation-mutagenesis sequences and to analyze RNA-seq libraries was written using R and Python and is available on Github (https://github.com/timeisen/MutagenesisPlotCode).
