B cell receptor parent-child pairs for studying somatic hypermutation

Matsen IV, Frederick 1 ; Sung, Kevin1 ; Johnson, Mackenzie1

Published Dec 17, 2024; Updated Jun 02, 2025 on Dryad. https://doi.org/10.5061/dryad.np5hqc044

Data files

Dec 17, 2024 version files 22.15 MB

for_dryad.zip

22.14 MB
README.md

7.74 KB

Jun 02, 2025 version files 75.53 MB

for_dryad.v2.zip

75.52 MB
README.md

8.20 KB

Abstract

Somatic hypermutation (SHM) is the diversity-generating process in antibody affinity maturation. Probabilistic models of SHM are needed for analyzing rare mutations, understanding the selective forces guiding affinity maturation, and understanding the underlying biochemical process. High throughput data offers the potential to develop and fit SHM models on relevant data sets. Here we develop several out-of-frame and synonymous-mutations datasets using the strategy of

Spisak, N., Walczak, A. M., & Mora, T. (2020). Learning the heterogeneous hypermutation landscape of immunoglobulins from high-throughput repertoire data. Nucleic Acids Research, 48(19), 10702–10712. https://doi.org/10.1093/nar/gkaa825

for inferring parent-child pairs of sequences.

We apply this to data from the following studies:

Briney, B., Inderbitzin, A., Joyce, C., & Burton, D. R. (2019). Commonality despite exceptional diversity in the baseline human antibody repertoire. Nature. https://doi.org/10.1038/s41586-019-0879-y

Jaffe, D. B., Shahi, P., Adams, B. A., Chrisman, A. M., Finnegan, P. M., Raman, N., Royall, A. E., Tsai, F., Vollbrecht, T., Reyes, D. S., Hepler, N. L., & McDonnell, W. J. (2022). Functional antibodies exhibit light chain coherence. Nature, 611(7935), 352–357. https://doi.org/10.1038/s41586-022-05371-z

Tang, C., Krantsevich, A., & MacCarthy, T. (2022). Deep learning model of somatic hypermutation reveals importance of sequence context beyond hotspot targeting. iScience, 25(1), 103668. https://doi.org/10.1016/j.isci.2021.103668

Vergani, S., Korsunsky, I., Mazzarello, A. N., Ferrer, G., Chiorazzi, N., & Bagnara, D. (2017). Novel Method for High-Throughput Full-Length IGHV-D-J Sequencing of the Immune Repertoire from Bulk B-Cells with Single-Cell Resolution. Frontiers in Immunology, 8, 1157. https://doi.org/10.3389/fimmu.2017.01157

https://doi.org/10.5061/dryad.np5hqc044

Description of the data and file structure

This is reprocessed B-cell receptor (BCR) sequence data as described in https://www.biorxiv.org/content/10.1101/2024.11.26.625407v1 from the following studies:

Briney, B., Inderbitzin, A., Joyce, C., & Burton, D. R. (2019). Commonality despite exceptional diversity in the baseline human antibody repertoire. Nature 566, 393–397. https://doi.org/10.1038/s41586-019-0879-y

Jaffe, D.B., Shahi, P., Adams, B.A. et al. (2022). Functional antibodies exhibit light chain coherence. Nature 611, 352–357. https://doi.org/10.1038/s41586-022-05371-z

BCR sequences are clustered into clonal families, and germline sequences are inferred.
For each clonal family, phylogenetic tree inference and reconstruction of ancestral sequences are performed.

The parent and child sequences of each branch of a tree form a "parent-child pair" (PCP), which is used for training or evaluating models.

Files and variables

File: for_dryad.zip

Description: Zip file containing all of the data sets.

The roles of these data sets can be found in the shm_data.py module of thrifty-experiments-1. Below is a summary of each data set:

data set name	description
`shmoof`	PCPs of out-of-frame BCR sequences from Spisak et al., who processed Briney et al.
`val_curatedShmoofNotbigNoN`	A subset of individuals from the `shmoof` data, and additionally, sites with undetermined nucleotides at the 5' and 3' ends are truncated from sequences.
`tangshm`	PCPs of out-of-frame BCR sequences from Tang et al.
`syntang`	PCPs of productive BCR sequences from Tang et al. where sites that are not 4-fold degenerate are masked with the `N` symbol. Therefore, the unmasked sites only experience synonymous mutations.
`syn10x`	PCPs of productive BCR sequences from Jaffe et al. where sites that are not 4-fold degenerate are masked with the `N` symbol. Therefore, the unmasked sites only experience synonymous mutations.

Data files are CSV tables where each row corresponds to a PCP.
The meaning of the columns is as follows:

column name	description
`sample_id`	sample label, where a sample corresponds to an individual
`family`	clonal family label within a sample
`parent_name`*	label of the parent sequence
`parent`	parent nucleotide sequence
`child_name`*	label of the child sequence
`child`	child nucleotide sequence
`branch_length`	branch length computed in IQ-TREE
`depth`*	number of edges away the child sequence is from the naive sequence in the inferred tree
`distance`*	sum of branch lengths of the child sequence from the naive sequence in the inferred tree
`v_gene`	inferred germline V gene
`cdr1_codon_start`*	position of the first nucleotide of the first codon in CDR1
`cdr1_codon_end`*	position of the first nucleotide of the last codon in CDR1
`cdr2_codon_start`*	position of the first nucleotide of the first codon in CDR2
`cdr2_codon_end`*	position of the first nucleotide of the last codon in CDR2
`cdr3_codon_start`*	position of the first nucleotide of the first codon in CDR3 (i.e. after the conserved Cys); may not make sense in out-of-frame context
`cdr3_codon_end`*	position of the first nucleotide of the last codon in CDR3 (i.e. before the conserved Trp); may not make sense in out-of-frame context
`parent_is_naive`*	True/False whether the parent is the naive sequence
`child_is_leaf`	True/False whether the child is a leaf node (i.e., observed sequence)

(*) the column is not in the shmoof and val_curatedShmoofNotbigNoN data files

Code/software

These data are meant to be run using our thrifty-experiments-1 pipeline. Software dependencies, etc, are described there.

Version changes

2-June-2025: Added syntang dataset described above.