A method for characterizing disease emergence curves from paired pathogen detection and serology data
Abstract
Wildlife disease surveillance programs and research studies track infection and identify risk factors for wild populations, humans, and agriculture. Often, several types of samples are collected from individuals to provide more complete information about an animal's infection history. Methods that jointly analyze multiple data streams to study disease emergence and drivers of infection via epidemiological process models remain underdeveloped. Joint-analysis methods can more thoroughly analyze all available data, more precisely quantifying epidemic processes, outbreak status, and risks. We contribute a paired data modeling approach that analyzes multiple samples from individuals. We use "characterization maps" to link paired data to epidemiological processes through a hierarchical statistical observation model. Our approach can provide both Bayesian and frequentist estimates of epidemiological parameters and states. Our approach can also incorporate test sensitivity and specificity, and we propose model-fit diagnostics. We motivate our approach through the need to use paired pathogen and antibody detection tests to estimate parameters and infection trajectories for the widely applicable susceptible, infectious, recovered (SIR) model. We contribute general formulas to link characterization maps to arbitrary process models and datasets and an extended SIR model that better accommodates paired data. We find via simulation that paired data can more efficiently estimate SIR parameters than unpaired data, requiring samples from 5-10 times fewer individuals. We use our method to study SARS-CoV-2 in wild White-tailed deer (Odocoileus virginianus) from three counties in the United States. Estimates for average infectious times corroborate captive animal studies. The estimated average cumulative proportion of infected deer across the three counties is 73%, and the basic reproductive number (R0) is 1.88. Wildlife disease surveillance programs and research studies can use our methods to jointly analyze paired data to estimate epidemiological process parameters and track outbreaks. Paired data analyses can improve precision and accuracy when sampling is limited. Our methods use general statistical theory to let applications extend beyond the SIR model we consider, and to more complicated examples of paired data. The methods can also be embedded in larger hierarchical models to provide landscape-scale risk assessment and identify drivers of infection.
README: A method for characterizing disease emergence curves from paired pathogen detection and serology data
Reproducibility strategy
The targets
workflow manager for R
organizes the analysis. A thorough tutorial and a quick overview are available to learn targets
. The targets
package can make it easier to create and store project artifacts, such as pre-processed datasets, fitted models, diagnostic and predictive output, and tables and figures. However, the tutorial describes ideal workflows that do not necessarily scale well to very large projects with many computationally expensive steps. So, the repository's use of the targets
package will occasionally deviate from the tutorial's demonstration workflows.
The targets
package creates and manipulates artifacts through a series of user-defined "target" objects. Each target is a code chunk wrapped in a call to the function targets::tar_target()
, see Chapter 6 of the package's tutorial for examples. A benefit of the targets
package is that each target's code chunk is executed in a clean working environment, which can reduce the risk of errors and memory issues caused by variable name conflicts and workspace clutter from temporary objects. The targets
package will also (almost always) automatically identify when a target's code chunk refers to one or more other targets by name, load outputs from those code chunks, and make those outputs available for use within the current code chunk's execution environment.
The workflow here typically saves outputs to an automatically generated output/
project subdirectory.
For targets that save output to disk, returning the file or directory name wrapped in a list with a timestamp can help make it easier to 1) tell when the target was last built (as an alternative to using targets::tar_timestamp
), and 2) trigger downstream targets to rebuild even if the updated outputs of the triggering target are unchanged. The latter case is important to account for because the triggering target may only output a path or filename, whose contents may have changed even if the path or filename has not.
Data
Data in the subdirectory data/
report positive/negative test results for swab and serology samples taken from hunter-harvested deer in each of the 3 counties. Swabs were tested for SARS-CoV-2 using reverse-transcriptase polymerase chain reaction (rRT-PCR), and serology samples were tested for SARS-CoV-2 antibodies using a surrogate virus neutralization test (sVNT). Each pair of samples (i.e., each row) was collected from the same deer. The county names and exact data collection dates are masked. The first sample collected from each county is assigned HarvestDate=0
. The collection dates for subsequent samples represent the number of days past the first sample collection date in the county.
Steps to reproduce the simulation
Code and data files are written in the R programming language.
- Run
make_simulation.R
to prepare simulation data - Run the SLURM job
Rscripts/sim_fit/dothescience.job
on an HPC system to fit the model to simulated data - Copy the
output/sim_fit
folder to your local machine, for post-processing
Steps to reproduce analysis and figures
Run make.R
to prepare data for analysis and run main analyses. Figures and tables will be saved to an automatically generated output/
directory.