Data from: A scalable model for simulating multi-round antibody evolution and benchmarking of clonal tree reconstruction methods
Data files
Jun 11, 2023 version files 188.68 MB
Abstract
Affinity maturation (AM) of B cells through somatic hypermutations (SHMs) enables the immune system to evolve to recognize diverse pathogens. The accumulation of SHMs leads to the formation of clonal lineages of antibody-secreting b cells that have evolved from a common naïve B cell. Advances in high-throughput sequencing have enabled deep scans of B cell receptor repertoires, paving the way for reconstructing clonal trees. However, it is not clear if clonal trees, which capture microevolutionary time scales, can be reconstructed using traditional phylogenetic reconstruction methods with adequate accuracy. In fact, several clonal tree reconstruction methods have been developed to fix supposed shortcomings of phylogenetic methods. Nevertheless, no consensus has been reached regarding the relative accuracy of these methods, partially because evaluation is challenging. Benchmarking the performance of existing methods and developing better methods would both benefit from realistic models of clonal lineage evolution specifically designed for emulating B cell evolution. In this paper, we propose a model for modeling B cell clonal lineage evolution and use this model to benchmark several existing clonal tree reconstruction methods. Our model, designed to be extensible, has several features: by evolving the clonal tree and sequences simultaneously, it allows modeling selective pressure due to changes in affinity binding; it enables scalable simulations of large numbers of cells; it enables several rounds of infection by an evolving pathogen; and, it models building of memory. In addition, we also suggest a set of metrics for comparing clonal trees and measuring their properties. Our results show that while maximum likelihood phylogenetic reconstruction methods can fail to capture key features of clonal tree expansion if applied naively, a simple post-processing of their results, where short branches are contracted, leads to inferences that are better than alternative methods.
Methods
The data was created using the simulation method DimSIM. As described in the paper, the analyses includes two sets of simulations, one based on real target antibodies (SARS-Cov2) and the other based on flu.
SARS-CoV2 simulations had 3-5 rounds with 50 replicates. For targets, we first selected all heavy chain sequences of human antibodies with IGHV1-58 and IGHJ3 from the Coronavirus Antibody Database that neutralize some variants of SARS-CoV2 and have 16 amino acids in their CDR3. Per upload date, we chose the antibody that neutralizes the most variants of SARS-CoV2 resulting in 14 sequences. We then randomly chose targets among them. The infection start date was set to be the upload date. Each round of infections is set to last 50 days. At the end of simulations, we sample ς = 50, 100, 200, 500 antibody-coding nucleotide sequences from the last round of infection and built the clonal tree.
For flu simulations, we performed several simulations with r = 56 rounds of flu, using sequences of hemagglutinin (HA) protein. We used the NCBI Influenza Virus Resource which includes HA sequences from influenza B virus. We selected 59 HA sequences belonging to flu infections in Hong Kong among which 56 had the same length (584 aa). These were used to find the targets using and evolutionary approach described in the paper. We set up four experiments, varying one or two parameters in each experiment and setting the remaining ones to default values. Details of these conditions are given in Tables 1 and 3 of the paper.