Data from: Deep mutational scanning of HBV reveals a mechanism for cis preferential reverse transcription

Yu, Yingpu1; Kass, Maximilian A.1 ; Zhang, Mengyin1; Youssef, Noor2; Freije, Catherine A.1; Brock, Kelly P.2; Aguado, Lauren C.1; Seifert, Leon L.1; Venkittu, Sanjana1; Hong, Xupeng1; Shlomai, Amir1; de Jong, Ype1; Marks, Debbie S.2; Rice, Charles M.1; Schneider, William M.1

Published Feb 29, 2024 on Dryad. https://doi.org/10.5061/dryad.x3ffbg7qx

Data files

Feb 29, 2024 version files 39.83 MB

R1.zip

39.82 MB
README.md

8.14 KB

Abstract

Hepatitis B virus (HBV) is a small double-stranded DNA virus that chronically infects 296 million people. Over half of its compact genome encodes protein in two overlapping reading frames, and during evolution, multiple selective pressures can act on shared nucleotides. This study combines an RNA-based HBV cell culture system with deep mutational scanning to uncouple cis- and trans-acting sequence requirements in the HBV genome. The results support a leaky ribosome scanning model for polymerase translation, provide a fitness map of the HBV polymerase at single nucleotide resolution, and identify conserved prolines adjacent to the HBV polymerase termination codon that stall ribosomes. Further experiments indicated that stalled ribosomes tether the nascent polymerase to its template RNA, ensuring cis-preferential RNA packaging and reverse transcription of the HBV genome.

Access this dataset on Dryad

The purpose of this dataset is to provide the analysis software, the raw pre-processed experimental data files, and the processed result files used in the paper “Deep mutational scanning of HBV reveals a mechanism for cis preferential reverse transcription”.

For detailed information on the experimental design, please refer to the paper. Briefly, mutants of the hepatitis B virus (HBV) were generated (input population) and transfected into cell cultures. In cell culture, HBV mutants were either depleted or enriched based on the effects of their mutations (output population). Afterwards, both input and output populations were sequenced to quantify the enrichment or depletion of each HBV mutant. From these sequencing results, so called codoncounts files were generated using barcoded-subamplicon sequencing software from Jesse Bloom’s dms_tools2 software package. These codoncounts files, simply put, state how often a certain mutant was present in the input or output population.

The codoncounts files represent the main raw pre-processed files in this Dryad dataset (/data/codoncounts/). Mutants of HBV are split into groups corresponding to certain regions of the HBV genome that were mutated separately. For example, one group is called “TP” (in e.g. “TP_plasmid1_codoncounts.csv”), meaning that mutations were generated in the terminal protein domain (TP) of HBV in that experiment. The term “plasmid” in “TP_plasmid1_codoncounts.csv” denotes an input population data file, while “cell” in “TP_cell1_codoncounts.csv” indicates an output population data file. The number 1 refers to replicate one. There are usually three replicates, though the number can vary.

A Jupyter Notebook, JNote.ipynb, is the primary analysis software file. This notebook takes the codoncounts files and calculates the enrichment score of each mutant. Using those results, JNote.ipynb generates the majority of figures in the paper, including the heatmaps in Figures 2B and 3A. JNote.ipynb also generates corresponding data tables with the plotted numeric values for each figure in CSV file format (/Fig/CSV/).

Additionally, a second Jupyter Notebook is for a proline sliding window analysis related to Fig S5C in the paper. Please refer to the methods section of the paper for additional documentation. Furthermore, the file hbv_pol_model_parameter_details.txt contains configurations for natural sequence analysis related to figure S3B in the paper.

Description of the data and file structure

After unzipping, the dataset R1 contains:

JNote.ipynb - Main analysis software file. Takes codoncounts files (/data/codoncounts) as input to generate most figures and data tables (/Fig/CSV) in the paper.
/data/codoncounts - directory containing all codoncounts files. The first term denotes the region of the HBV genome that was mutated. The terms “cell” or “plasmid” denote the input or output populations, respectively. The number refers to the replicate numbering. For example, “Core_cell1_codoncounts.csv” is one of the three replicate mutant libraries of the Core region in the HBV genome after cell culture selection. Column “site” corresponds to the amino acid site position; column “wildtype” to the wildtype codon; columns "AAA” to “TTT” to the mutation from wildtype, with the sequencing read count given as the row value.
/data/other - directory containing additional support files necessary for JNote.ipynb
/Fig/CSV – directory containing the corresponding data tables of the figures in the paper with the plotted numeric values in CSV format. The figure numbering follows the numbering in the paper. Please refer to the figure legends in the paper for additional information.
- Fig_2B.csv – Deep mutational scanning of the Core region: heatmap
  
  X-axis -> column “site” –> amino acid site position
  
  Y-axis -> column “variable” -> variant mutation codon
  
  Color scale -> column "logmfactor” -> log2 enrichment factor
- Fig_2C.csv – Deep mutational scanning of the Core region: group analysis I
  
  X-axis -> column “group” –> certain variants grouped
  
  Y-axis -> column " logmfactor” -> log2 enrichment factor
- Fig_2D.csv – Deep mutational scanning of the Core region: group analysis II
  
  X-axis -> column “group” –> certain variants grouped
  
  Y-axis -> column " logmfactor” -> log2 enrichment factor
- Fig_3A.csv – Deep mutational scanning of Polymerase: heatmap
  
  X-axis -> column “site” –> amino acid site position
  
  Y-axis -> column “variable” -> variant mutation codon
  
  Color scale -> column "logmfactor” -> log2 enrichment factor
- Fig_3D.csv – Deep mutational scanning of Polymerase: group analysis
  
  X-axis -> column “group” –> certain variants grouped
  
  Y-axis -> column "norm_abs_logmfactor” -> selective pressure
- Fig_S1B.csv – Nucleotide mutation type bias analysis
  
  X-axis -> column “type” -> single nucleotide mutation type
  
  Y-axis -> column “frac” -> percentage of mutants
- Fig_S1C.csv – Deep mutational scanning of the Core region: input plasmid counts heatmap
  
  X-axis -> column “site” –> amino acid site position
  
  Y-axis -> column “variable” -> variant mutation codon
  
  Color scale -> column "pre_frac” * 1E6 -> log10 plasmid counts per 1E6 reads
- Fig_S2B.csv – Deep mutational scanning of the Core region: Kozak analysis I
  
  X-axis -> column “kozakscore” -> Kozak score of C1
  
  Y-axis -> column " logmfactor” -> log2 enrichment factor of variants affecting C1 Kozak
- Fig_S2C.csv – Deep mutational scanning of the Core region: Kozak analysis II
  
  X-axis -> column “kozakscore” -> Kozak score of J ORF start
  
  Y-axis -> column " logmfactor” -> log2 enrichment factor of variants affecting J ORF start Kozak
- Fig_S2D.csv –> Deep mutational scanning of the Core region: Kozak analysis III
  
  X-axis -> column “group” -> Group C/T or A/G
  
  Y-axis -> column " logmfactor” -> log2 enrichment factor
- Fig_3A.csv – Deep mutational scanning of Polymerase: input plasmid counts heatmap
  
  X-axis -> column “site” –> amino acid site position
  
  Y-axis -> column “variable” -> variant mutation codon
  
  Color scale -> column "pre_frac” * 1E6 -> log10 plasmid counts per 1E6 reads
- Fig_3A.csv – Fitness of polymerase variants obtained from natural sequences: heatmap
  
  X-axis -> column “pos” –> amino acid site position
  
  Y-axis -> column “mutant” -> variant mutation codon
  
  Color scale -> column " prediction_independent” -> log2 factor natural enrichment
PDB_files_Pol.zip - contains files related to three-dimensional models of HBV polymerase (Pol) generated using AlphaFold2 related to Figure 3BC in the paper.
hbv_pol_model_parameter_details.txt - are parameters for the analysis of natural sequences in Figure S3B.
proline_sliding_window.ipynb - second Jupyter Notebook for a proline sliding window analysis for Figure S5C in the paper.

Sharing/Access information

Link to other publicly accessible locations of the data:

https://github.com/HBV-DMS/R1

Code/Software

The main Jupyter Notebook, JNote.ipynb, can be run to re-analyze the experimental data starting from codoncounts files and to generate the majority of figures in the paper. The simplest way to run JNote.ipynb is to open JNote.ipynb through Google Colab by opening https://colab.research.google.com/github/HBV-DMS/R1/blob/main/JNote.ipynb and following the instructions in the notebook.