Comparative ecological analysis and predictive modeling of tick-borne pathogens
Data files
Apr 29, 2024 version files 4.63 MB
Abstract
Tick-borne diseases constitute the predominant vector-borne health threat in North America. Recent observations have noted a significant expansion in the range of the black-legged tick (Ixodes scapularis Say, Acari: Ixodidae), alongside a rise in the incidence of diseases caused by its vectored pathogens: Borrelia burgdorferi (Spirochaetales: Spirochaetaceae), Babesia microti (Piroplasmida: Babesiidae), and Anaplasma phagocytophilium (Rickettsiales: Anaplasmataceae), the causative agents of Lyme disease, babesiosis, and anaplasmosis, respectively. Prior research identified environmental features that influence the ecological dynamics of I. scapularis and B. burgdorferi that can be used to predict the distribution and abundance of these organisms, and thus Lyme disease risk. In contrast, there is a paucity of research into the environmental determinants of B. microti and A. phagocytophilium. Here we use over a decade of surveillance data to model the impact of environmental features on the infection prevalence of these increasingly common human pathogens in ticks across New York State (NYS). Our findings reveal a consistent northward and westward expansion of B. microti in NYS from 2009 to 2019, while the range of A. phagocytophilum varied at fine spatial scales. We constructed biogeographic models using data from over 1000 site-year visits and encompassing more than 250 environmental variables to accurately forecast infection prevalence for each pathogen to future years that were not included in model training. Several environmental features were identified to have divergent effects on the pathogens, revealing potential ecological differences governing their distribution and abundance. These validated biogeographic models are immediately useful for disease prevention efforts.
README: Comparative ecological analysis and predictive modeling of tick-borne pathogens
https://doi.org/10.5061/dryad.0zpc8675c
We have submitted our raw tick collection and environmental data (DIN_All.csv), code for training and evaluating biogeographic models (Pathogen_Scripts.py), and supplementary data describing features used by trained models (Pathogen_Sup_Table.csv). Required packages can be found in pathogen.yml.
Description of contained files
DIN_All.csv
Contained are tick collection data including
- Nymph - Number of nymphal Ixodes scapularis ticks collected,
- BB_Pos - Number of ticks infected with Borrelia burgdorferi,
- BM_Pos - Number of ticks infected with Babesia microti, and
- AP_Pos - Number of ticks infected with Anaplasma phagocytophilium.
- Units for all infection prevalences are nymphs positive for pathogen divided by the total nymphs tested.
Also contained are geographic, climactic, and ecological features which were used to model populations of pathogens. Some important features include LAT- Latitude (degrees), LONG- Longitude (degrees), year, month, and week of collection, DeerHarvest - an estimate of deer population size (units - deer harvested at county level), and elevation (meters). Climactic data include monthly temperature (degrees Celsius), precipitation (mm), and vapor pressure data (h pa). A Complete description of the features and the methodology used to curate the data set can be found in Tran et al. (2021).
Note: Blank cells in the dataset represent instances where a data point was unavailable for a given site year. Analyses in Pathogen_Scripts.py use a modeling framework that can accommodate missing data (CatBoost) so the dataset can be used as is.
Pathogen_Scripts.py
Contained is the Python script which uses data from DIN_All.csv to train gradient-boosted models to learn associations between environmental variables and pathogen populations. Sklearn is used to generate random forests for feature selection and to assess model accuracy. Catboost is used with default hyperparameters to generate final regression models using selected features.
To run the script, first please download dependencies from pathogen.yml. This can be done using Anaconda 3.
pathogen.yml
Contained are the versions of packages used for analyses. The correct Python version to run scripts is included. Packages can be installed using Anaconda 3.
Pathogen_Sup_Table.csv
contained are all features and feature importances used by models. Features contained were selected using sklearn's random forest (see Pathogen_Scripts.py for details of analysis). A complete description of included features can be found in Tran et al. (2021). Feature importances were calculated automatically during the catboost training process (also contained in Pathogen_Scripts.py.
References
Tran T, Prusinski MA, White JL, Falco RC, Vinci V, Gall WK, Tober K, Oliver J, Sporn LA, Meehan L, Banker E, Backenson PB, Jensen ST, Brisson D (2021a). Spatio-temporal variation in environmental features predicts the distribution and abundance of Ixodes scapularis. International Journal for Parasitology 51, 311–320