Skip to main content

Understanding shared variation in SARS-CoV-2 genomes

Cite this dataset

Wyman, Stacia (2022). Understanding shared variation in SARS-CoV-2 genomes [Dataset]. Dryad.


The project is a collaborative effort of investigators from the University of California, Berkeley’s Innovative Genomics Institute (IGI) and School of Public Health (SPH); Kaiser Permanente Northern California (KPNC); and the California Department of Public Health (CDPH), with administrative and programmatic support provided by Heluna Health. Over the project period, the collaborating investigators will analyze approximately 35,000 genomes of SARS-CoV-2 specimens obtained from KPNC members and sequenced by the CDPH through its COVIDNet activities. By combining results from the genomic analysis of low-frequency alleles with clinical and epidemiologic data available in patient records, including demographic variables, COVID-19 vaccination status (dates of vaccination; number of doses; manufacturer), COVID-19 disease severity, and underlying medical conditions, we assessed which shared genomic variations are associated with a greater risk of symptomatic infection and severe clinical outcomes; COVID-19 vaccine effectiveness; and transmission of SARS-CoV-2 in the household. The project and its results can serve as a model for community-based monitoring of the evolution and spread of SARS-CoV-2 and use of the data to inform decisions about the formulation and use of COVID-19 vaccines, including booster doses and next-generation vaccines.


Sample collection

Our samples are from Kaiser Northern California patients testing positive for SARS-CoV-2 starting June 1, 2021, and through the present. The RNA is sent to the California Department of Public Health (CDPH) lab to be sequenced by COVIDNet–a consortium of primarily UC system labs helping CDPH with the overflow and backlog of samples. Once the genomes have been sequenced, the lineage information and unique deidentified PAUI number are returned to Kaiser where this information is recorded. Metadata from this list of PAUI’s is sent weekly to UC Berkeley. The KPNC sequencing data is returned to us through a third party that is processing all CDPH genomes and stored on a server at UC Berkeley and matched with metadata using PAUI’s.

Sequence analysis

The raw sequencing data is processed through a SARS-CoV-2 analysis pipeline that has been modified for this work as follows. Adapter removal and trimming are performed using bbduk. The reads are then aligned to the Wuhan reference genome using minimap2 followed by primer trimming using iVAR . We next create a pileup file using samtools and use that input to create a consensus file. This consensus file is created with iVAR using a minimum depth of 10 reads and majority rule for base calling. We next use iVAR to call variants from the pileup file where we set the threshold for calling a mutation to be 0.01. This will call mutations for any loci where at least one percent of the reads are non-reference. This very low threshold allows us to capture all variation that is seen in the sequencing data. The list of variants is then annotated with the gene and amino acid change (if there is one), and whether the mutation is considered defining in any SARS-CoV-2 variants and whether that mutation is seen in only one variant.

This dataset includes the fasta consensus sequences and mutation calls for each genome.

Usage notes

These are fasta files and tab-delimited files and can be opened with any editor (fasta) or excel (mutation files).


Rockefeller Foundation, Award: 0889.0101