Clinical surveillance identifies SARS-CoV-2 outbreaks and emergence of novel variants in real-time
Data files
Oct 30, 2025 version files 291.98 MB
-
README.md
2.94 KB
-
Surveillance_data_v2.csv
291.97 MB
Abstract
Monitoring community health and tracking SARS-CoV-2 evolution were critical priorities throughout the COVID-19 pandemic. However, widespread shortages of personal protective equipment, the necessity for social distancing, and the redeployment of healthcare personnel to clinical duties presented significant barriers to traditional sample collection. In this study, we evaluated the feasibility of using self-collected saliva specimens for the qualitative detection of SARS-CoV-2 infection. Following confirmation of reliable viral detection in saliva, we established a large-scale surveillance program in Arizona, USA, to enable clinical diagnosis and genomic sequencing from self-collected samples. Between April 2020 and December 2023, we tested approximately 1.4 million saliva samples using RT-PCR, identifying 94,330 SARS-CoV-2 infections. Whole genome sequencing was performed on 69,595 samples, yielding 54,040 high-quality consensus genomes. This surveillance approach enabled real-time monitoring of infection trends, outbreak detection within specific populations, and the identification of novel viral lineages over the course of the pandemic. The co-location of clinical testing and sequencing capabilities within the same facility significantly reduced turnaround time from the identification of positive cases to the generation of sequencing data. Our findings support the use of self-collected saliva as a scalable, cost-effective, and practical strategy for infectious disease surveillance in future pandemics.
Access this dataset at Dryad: DOI: 10.5061/dryad.z08kprrsh
This dataset includes RT-PCR testing data and sequencing data for the ABCTL SARS-CoV-2 testing and sequencing project performed at ASU.
Data and file structure:
Surveillance_data_v2.csv: RT-PCR testing data and NGS sequencing data
In all fields, where data was missing, unavailable, or inapplicable, cell values have been filled with 'Blank'. A cell with this value should not be interpreted as a participant response or metadata characteristic, test result, or cohort descriptor. It is recommended to replace this value with a NA/NaN/blank when performing future analysis.
For RTPCR values, during the study period, primary qPCR testing migrated from TaqPath COVID-19 Combo Kit (hereafter TaqPath 1.0) testing to TaqPath COVID-19 Fast PCR Combo Kit 2.0 (hereafter TaqPath 2.0) testing. Ct values prefaced with "Seq_" are Ct values using TaqPath 1.0 assays run on samples before sequencing. After the transition to TaqPath 2.0, samples testing positive were run on TaqPath 1.0 to ease future comparisons and test for S gene target failure.
Clarification of database columns:
sample_id: randomized tube identifier
participant_id: randomized patient identifier
SEX: participant's self-identified SEX
RACE: participant's self-identified RACE
ethnicity: participant's self-identified ethnicity bin
pt_age: participant's age bin
test_invitation: whether a sample was included in the invited testing comparison
collection_week: surveillance week sample was collected
pcr_turnaround_hrs: hours between sample registration and results returned
qpcr_plateid: plate identifier for RT-PCR testing
N gene: TaqPath 1.0 and 2.0 Ct values for N gene locus
ORF1ab: TaqPath 1.0 Ct values for ORF1ab locus
S gene: TaqPath 1.0 Ct values for S gene locus
MS2: TaqPath 1.0 internal control
ORF1a: TaqPath 2.0 Ct values for ORF1a locus
ORF1b: TaqPath 2.0 Ct values for ORF1b locus
RNase P: TaqPath 2.0 internal control
RESULT: RT-PCR assasy result
Sequencing_week: surveillance week sample sequencing was performed
Seq_plate: plate identifier for WGS
GISAID_turnaround_days: days elapsed from sample registration to GISAID upload
GISAID Accession: GISAID sample Accession identified
Sequence name: GISAID sequence name
GISAID_lineage: PANGO lineage assigned by GISAID
Major_lineage: major PANGO lineage of sample genome
Genome ambiguity: percentage of ambiguous nucleotides in consensus genome
Genome_coverage: genome coverage obtained by WGS
Seq_ORF1ab_ct: TaqPath 1.0 Ct values at ORF1ab locus for samples used in sequencing
Seq_N_ct: TaqPath 1.0 Ct values at N gene locus for samples used in sequencing
Seq_S_ct: TaqPath 1.0 Ct values at S gene locus for samples used in sequencing
Sharing/Access information:
Contact authors for additional information.
