Improving distribution models of sparsely-documented disease vectors by incorporating information on related species via joint modeling
Data files
Apr 10, 2024 version files 1.73 MB
-
county_pops.csv
1.51 KB
-
distance_to_pathogen_fl.xlsx
15.68 KB
-
FL_case_data.xlsx
30.72 KB
-
fl_percent_insured.csv
5.70 KB
-
README.md
1.81 KB
-
vector_data.csv
1.67 MB
Abstract
A necessary component of understanding vector-borne disease risk is the accurate characterization of the distributions of their vectors. Species distribution models have been successfully applied to data-rich species but may produce inaccurate results for sparsely-documented vectors. In light of global change, vectors that are currently not well-documented could become increasingly important, requiring tools to predict their distributions. One way to achieve this could be to leverage data on related species to inform the distribution of a sparsely-documented vector based on the assumption that the environmental niches of related species are not independent. Relatedly, there is a natural dependence of the spatial distribution of a disease on the spatial dependence of its vector. Here, we propose to exploit these correlations by fitting a hierarchical model jointly to data on multiple vector species and their associated human diseases to improve distribution models of sparsely-documented species. To demonstrate this approach, we evaluated the ability of twelve models—which differed in their pooling of data from multiple vector species and inclusion of disease data—to improve distribution estimates of sparsely-documented vectors. We assessed our models on two simulated data sets, which allowed us to generalize our results and examine their mechanisms. We found that when the focal species is sparsely documented, incorporating data on related vector species reduces uncertainty and improves accuracy by reducing overfitting. When data on vector species are already incorporated, disease data only marginally improve model performance. However, when data on other vectors are not available, disease data can improve model accuracy and reduce overfitting and uncertainty. We then assessed the approach on empirical data on ticks and tick-borne diseases in Florida and found that incorporating data on other vector species improved model performance. This study illustrates the value of exploiting correlated data via joint modeling to improve distribution models of data-limited species.
https://doi.org/10.5061/dryad.jwstqjqhq
Description of the data and file structure
- vector_data.csv: presence/absence locations of vectors with corresponding scaled environmental covariates (please see methods for sources for covariate data).
- FL_case_data.xlsx: Human disease data. Sheets 1-3 include annual county-level count data for each disease. Sheets 4 -6 include county-level presence data for each disease. The order of counties and years reported for each disease are given in the “meta” sheet.
- distance_to_pathogen_fl.xlsx: includes data on the number of cases reported in pets within each county over the study period, and whether the binary classification for circulation. d_circ is the distance to the closest county in which the disease is circulating (based on the county midpoint).
- county_pops.csv: county population sizes
- fl_percent_insured.csv: county insurance coverage
Sharing/Access information
Data and code for analysis are available on GitHub (link in Related Works).
Data was derived from the following sources:
- http://vectormap.si.edu
- https://www.inaturalist.org
- https://www.floridahealth.gov/diseases-and-conditions/tick-and-insect-borne-diseases/tick-surveillance.html
- https://capcvet.org/maps#/
- www.countyhealthrankings.org
- www.worldpop.org
Code/Software
R, RStan
Vector Data
Vector presence data were obtained from VectorMap and iNaturalist. Only iNaturalist data considered “research grade” were included, and we removed duplicates. To obtain absence data, we referenced VectorMap publications and assumed that if a species was not reported at a sampling location, but was included within the study, that the species was absent at that location. To avoid conflating low sampling effort with low vector presence, we based pseudo-absence locations on presence locations from chiggers, fleas, and mites from both databases and the Global Biodiversity Information Facility. We used a 1:1 ratio of presence to absence points, which produces the most accurate predicted distribution for regression techniques (Barbet-Massin et al., 2012).
We artificially sparsely sampled one species within our empirical data (A. maculatum) by including 20% of available presence-absence data in our training set and withholding the rest for testing. The artificial sparse sampling allowed for a robust testing data set to evaluate model performances. To ensure spatial independence between our training and testing data, data were split using the blockCV package (Valavi et al., 2018) in R Version 2023.03.0+386 (R Core Team, 2023). To test the limitations of incorporating disease data, we selected a vector species that does not transmit any of the diseases within our model as our focal species. Empirical sample sizes are given in Supp Table 2.
Human Disease Data
We obtained annual incidence data on three human diseases (anaplasmosis, ehrlichiosis, Lyme disease) from 2011 to 2019 for each county from the Florida Department of Health. We translated this into human disease presence data in a given county in a given year based on whether the annual incidence there was greater than zero.
Covariate data
We modeled vector distributions as a function of environmental covariates, which have been linked to tick presence: land cover (Randolph, 2000), 30-year average maximum temperature (Ogden et al., 2020), 30-year average precipitation (Ogden et al., 2020), regional Palmer hydrological drought index (Jones and Kitron, 2000), normalized differential vegetation index (Randolph, 2000), and distance to the nearest waterbody (Kahl and Alidousti, 1997). We obtained landcover data from Global Land Cover Characteristics Database (Loveland et al., 2000), 30-year average climate data from WorldClim (Fick and Hijmans, 2017), Palmer Hydrological Drought Index from NOAA (Bushra and Rohli, 2017), and Normalized Difference Vegetation Index data from USGS Landsat (Vermote et al., 2016). Finally, we obtained waterbody data from the World Wildlife Foundation’s Global Lakes and Wetlands database (McGwire and Fisher, 2001). Pathogen circulation was based on Companion Animal Parasitic Council data, which reports the seroprevalence in canines receiving veterinary treatment. To avoid considering imported cases as indicative of endemicity, we considered a threshold of five annual cases to signal transmission. Finally, to account for under-reporting (Madison-Antenucci, et al., 2020), we modeled reporting probability as a function of health insurance coverage and population size. Insurance data were obtained from County Health Rankings (www.countyhealthrankings.org), and population data were obtained from WorldPop (www.worldpop.org).
Simulated data
Our first simulation simulates data for three well-documented species (A. americanum, A. maculatum, D. variabilis) and a single sparsely-documented species (I. scapularis). “Well-documented” is defined as 500 samples and “sparsely-documented” is defined as 30 samples (Supp Figure 2). Our second simulation simulates all four species as well-documented (Supp Table 3).