Data from: Footprint of the host restriction factors APOBEC3 on the genome of human viruses
Gillet, Nicolas; Poulain, Florian (2020), Data from: Footprint of the host restriction factors APOBEC3 on the genome of human viruses, Dryad, Dataset, https://doi.org/10.5061/dryad.n8pk0p2sd
APOBEC3 enzymes are innate immune effectors that introduce mutations into viral genomes. These enzymes are cytidine deaminases which transform cytosine into uracil. They preferentially mutate cytidine preceded by thymidine making the 5’TC motif their favored target. Viruses have evolved different strategies to evade APOBEC3 restriction. Certain viruses actively encode viral proteins antagonizing the APOBEC3s, others passively face the APOBEC3 selection pressure thanks to a depleted genome for APOBEC3-targeted motifs. Hence, the APOBEC3s left on the genome of certain viruses an evolutionary footprint.
The aim of our study is the identification of these viruses having a genome shaped by the APOBEC3s. We analyzed the genome of 33,400 human viruses for the depletion of APOBEC3-favored motifs. We demonstrate that the APOBEC3 selection pressure impacts at least 22% of all currently annotated human viral species. The papillomaviridae and polyomaviridae are the most intensively footprinted families; evidencing a selection pressure acting genome-wide and on both strands. Members of the parvoviridae family are differentially targeted in term of both magnitude and localization of the footprint. Interestingly, a massive APOBEC3 footprint is present on both strands of the B19 erythroparvovirus; making this viral genome one of the most cleaned sequences for APOBEC3-favored motifs. We also identified the endemic coronaviridae as significantly footprinted. Interestingly, no such footprint has been detected on the zoonotic MERS-CoV, SARS-CoV-1 and SARS-CoV-2 coronaviruses. In addition to viruses that are footprinted genome-wide, certain viruses are footprinted only on very short sections of their genome. That is the case for the gamma-herpesviridae and adenoviridae where the footprint is localized on the lytic origins of replication. A mild footprint can also be detected on the negative strand of the reverse transcribing HIV-1, HIV-2, HTLV-1 and HBV viruses.
Together, our data illustrate the extent of the APOBEC3 selection pressure on the human viruses and identify new putatively APOBEC3-targeted viruses.
The A3 evolutionary footprint left on viral genomes is defined as the under-representation of A3-targeted motifs. Because most of the A3 proteins favors deamination of cytosine to uracil in a 5’TC dinucleotide context, we have chosen to look for the under-representation of the 5’TC motif. We differentiate three K-mers containing the TC motif; one K-mer having the C in the first position of the codon (NNTCNN), one K-mer having the C in the second position of the codon (TCN) and one K-mer having the C in the third position of the codon (NTC). A3-introduced deamination of cytosine in viral genome produces an uracil that can be fixed in the form of thymidine after genome replication. This transition will have different impacts depending on the position of the mutated C. The C to T mutation will be non-synonymous if the C is at the first or second position of the codon. However, if the mutated C occupies the third position of the codon, the C to T mutation will always be synonymous. Therefore, A3-driven natural selection should deplete more intensively NTC codons than TCN or NNTCNN motifs (as in those cases the C to U mutation will impact the encoded amino acid). Obviously, A3 editing can also target the template strand where a C to T mutation will translate into G to A transition in the coding strand. Again, this transition will have different impacts depending on the position of the mutated G. The G to A mutation will be non-synonymous if the G is at the first or second position of the codon (Fig 1A, GAN and NGA K-mers). However, if the mutated G occupies the third position of the codon the mutation will be most of the time synonymous. Because synonymous mutations are presumably more likely to be retained than non-synonymous, we define the A3 footprint as the depletion of NTC or NNGANN K-mers.
We downloaded complete viral genomes from the “NCBI Virus” database (as released in April 2020) and computed observed vs expected K-mer ratios. Briefly, a synthetic coding genome was generated by concatenating the different coding sequences allowing the counting of the occurrence of a given K-mer (n obs (K-mer)). Each synthetic coding genome has been randomly shuffled a thousand times. The expected count is calculated as the average of the occurrences of this K-mer over the thousand iterations (n exp (K-mer)). A negative K-mer ratio indicates depletion of that K-mer. The observed vs expected ratio of the NTC K-mer will be compared to those of the NNTCNN and TCN K-mers. Similarly, the observed vs expected ratio of the NNGANN K-mer will be compared to those of the GAN and NGA K-mers.
S1 Table: Genomic K-mer ratios for human viruses.
Observed/expected K-mer ratios for each genomic human viral sequence.
S2 Table: Genic K-mer ratios for human viruses.
Observed/expected K-mer ratios for each genic human viral sequence.
S3 Table: Genomic K-mer ratios for non-human viruses.
Observed/expected K-mer ratios for each genomic and genic non-human viral sequence.
Fonds De La Recherche Scientifique - FNRS, Award: 31270116
Fonds De La Recherche Scientifique - FNRS, Award: 34972507
Fonds De La Recherche Scientifique - FNRS, Award: 31454280