B cell receptor parent-child pairs for studying somatic hypermutation
Data files
Dec 17, 2024 version files 22.15 MB
-
for_dryad.zip
22.14 MB
-
README.md
7.74 KB
Abstract
Somatic hypermutation (SHM) is the diversity-generating process in antibody affinity maturation. Probabilistic models of SHM are needed for analyzing rare mutations, understanding the selective forces guiding affinity maturation, and understanding the underlying biochemical process. High throughput data offers the potential to develop and fit SHM models on relevant data sets. Here we develop several out-of-frame and synonymous-mutations datasets using the strategy of
Spisak, N., Walczak, A. M., & Mora, T. (2020). Learning the heterogeneous hypermutation landscape of immunoglobulins from high-throughput repertoire data. Nucleic Acids Research, 48(19), 10702–10712. https://doi.org/10.1093/nar/gkaa825
for inferring parent-child pairs of sequences.
We apply this to data from the following studies:
Briney, B., Inderbitzin, A., Joyce, C., & Burton, D. R. (2019). Commonality despite exceptional diversity in the baseline human antibody repertoire. Nature. https://doi.org/10.1038/s41586-019-0879-y
Jaffe, D. B., Shahi, P., Adams, B. A., Chrisman, A. M., Finnegan, P. M., Raman, N., Royall, A. E., Tsai, F., Vollbrecht, T., Reyes, D. S., Hepler, N. L., & McDonnell, W. J. (2022). Functional antibodies exhibit light chain coherence. Nature, 611(7935), 352–357. https://doi.org/10.1038/s41586-022-05371-z
Tang, C., Krantsevich, A., & MacCarthy, T. (2022). Deep learning model of somatic hypermutation reveals importance of sequence context beyond hotspot targeting. iScience, 25(1), 103668. https://doi.org/10.1016/j.isci.2021.103668
Vergani, S., Korsunsky, I., Mazzarello, A. N., Ferrer, G., Chiorazzi, N., & Bagnara, D. (2017). Novel Method for High-Throughput Full-Length IGHV-D-J Sequencing of the Immune Repertoire from Bulk B-Cells with Single-Cell Resolution. Frontiers in Immunology, 8, 1157. https://doi.org/10.3389/fimmu.2017.01157
README: B cell receptor parent-child pairs for studying somatic hypermutation
https://doi.org/10.5061/dryad.np5hqc044
Description of the data and file structure
This is reprocessed B-cell receptor (BCR) sequence data as described in https://www.biorxiv.org/content/10.1101/2024.11.26.625407v1 from the following studies:
Briney, B., Inderbitzin, A., Joyce, C., & Burton, D. R. (2019). Commonality despite exceptional diversity in the baseline human antibody repertoire. Nature 566, 393–397. https://doi.org/10.1038/s41586-019-0879-y
Jaffe, D.B., Shahi, P., Adams, B.A. et al. (2022). Functional antibodies exhibit light chain coherence. Nature 611, 352–357. https://doi.org/10.1038/s41586-022-05371-z
Spisak, N., Walczak, A. M., & Mora, T. (2020). Learning the heterogeneous hypermutation landscape of immunoglobulins from high-throughput repertoire data. Nucleic Acids Research, 48(19), 10702–10712. https://doi.org/10.1093/nar/gkaa825
Tang, C., Krantsevich, A., & MacCarthy, T. (2022). Deep learning model of somatic hypermutation reveals importance of sequence context beyond hotspot targeting. iScience, 25(1), 103668. https://doi.org/10.1016/j.isci.2021.103668
Vergani, S., Korsunsky, I., Mazzarello, A. N., Ferrer, G., Chiorazzi, N., & Bagnara, D. (2017). Novel Method for High-Throughput Full-Length IGHV-D-J Sequencing of the Immune Repertoire from Bulk B-Cells with Single-Cell Resolution. Frontiers in Immunology, 8, 1157. https://doi.org/10.3389/fimmu.2017.01157
BCR sequences are clustered into clonal families and germline sequences are inferred.
For each clonal family, phylogenetic tree inference and reconstruction of ancestral sequences are performed.
The parent and child sequences of each branch of a tree form a "parent-child pair" (PCP), which is used for training or evaluating models.
Files and variables
File: for_dryad.zip
Description: Zip file containing all of the data sets.
The roles of these data sets can be found in the shm_data.py
module of thrifty-experiments-1
. Below is a summary of each data set:
data set name | description |
---|---|
shmoof |
PCPs of out-of-frame BCR sequences from Spisak et al., who processed Briney et al. |
val_curatedShmoofNotbigNoN |
A subset of individuals from the shmoof data, and additionally, sites with undetermined nucleotides at the 5' and 3' ends are truncated from sequences. |
tangshm |
PCPs of out-of-frame BCR sequences from Tang et al. |
v1wyatt |
PCPs of productive BCR sequences from Jaffe et al. |
syn10x |
PCPs from v1wyatt where sites that are not 4-fold degenerate are masked with the N symbol. Therefore, the unmasked sites only experience synonymous mutations. |
Data files are CSV tables where each row corresponds to a PCP.
The meaning of the columns are as follows:
column name | description |
---|---|
sample_id |
sample label, where a sample corresponds to an individual |
family |
clonal family label within a sample |
parent_name * |
label of the parent sequence |
parent |
parent nucleotide sequence |
child_name * |
label of the child sequence |
child |
child nucleotide sequence |
branch_length |
branch length computed in IQ-TREE |
depth * |
number of edges away the child sequence is from the naive sequence in the inferred tree |
distance * |
sum of branch lengths of the child sequence from the naive sequence in the inferred tree |
v_gene |
inferred germline V gene |
cdr1_codon_start * |
position of the first nucleotide of the first codon in CDR1 |
cdr1_codon_end * |
position of the first nucleotide of the last codon in CDR1 |
cdr2_codon_start * |
position of the first nucleotide of the first codon in CDR2 |
cdr2_codon_end * |
position of the first nucleotide of the last codon in CDR2 |
cdr3_codon_start * |
position of the first nucleotide of the first codon in CDR3 (i.e. after the conserved Cys); may not make sense in out-of-frame context |
cdr3_codon_end * |
position of the first nucleotide of the last codon in CDR3 (i.e. before the conserved Trp); may not make sense in out-of-frame context |
parent_is_naive * |
True/False whether the parent is the naive sequence |
child_is_leaf |
True/False whether the child is a leaf node (i.e. observed sequence) |
(*) column is not in the shmoof
and val_curatedShmoofNotbigNoN
data files
Code/software
These data are meant to be run using our thrifty-experiments-1 pipeline. Software dependencies, etc, are described there.
Methods
As described in the Materials and Methods section of https://www.biorxiv.org/content/10.1101/2024.11.26.625407v1