Skip to main content

A machine learning approach to integrating genetic and ecological data in tsetse flies (Glossina pallidipes) for spatially explicit vector control planning

Cite this dataset

Bishop, Anusha et al. (2021). A machine learning approach to integrating genetic and ecological data in tsetse flies (Glossina pallidipes) for spatially explicit vector control planning [Dataset]. Dryad.


Introduction - Control of vector populations is an effective strategy for addressing vector-borne disease transmission. Effective vector control requires knowledge of habitat use and connectivity. Our goal was to improve this knowledge for the tsetse species Glossina pallidipes, a vector of animal African trypanosomiasis, which is a wasting disease in livestock and represents a serious socioeconomic burden across sub-Saharan Africa. Methods and Results - We used random forest regression to: (i) Build and integrate models of G. pallidipes habitat suitability and genetic connectivity across Kenya and northern Tanzania, and (ii) provide novel vector control recommendations. Inputs for the models included field-survey records from 349 trap locations, genetic data from 11 microsatellite loci from 659 flies and 29 sampling sites, and remotely sensed environmental data. The suitability and connectivity models explained approximately 80% and 67% of the variance in the occurrence and genetic data, and exhibited high accuracy based on cross-validation. The bivariate map showed that suitability and connectivity vary independently across the landscape and inform vector control recommendations. Post-hoc analyses show spatial variation in the correlations between the most important environmental predictors from our models and each response variable (e.g. suitability and connectivity) as well as heterogeneity in expected future climatic change of these predictors. Discussion - The bivariate map suggests vector control is most likely to be successful in the Lake Victoria basin, and supports the previous recommendation that most of eastern Kenya should be managed as a single unit. We further recommend that future monitoring efforts should focus on tracking potential changes in vector presence and dispersal around the Serengeti and the Lake Victoria basin based on projected local climatic shifts. The strong performance of the spatial models suggests potential for our integrative methodology to be used to understand future impacts of climate change in this and other vector systems. 


A description of the methods used to collect and process this dataset is available in the corresponding paper (Bishop et al., 2021).

Usage notes

The Bishop2021_HabitatSuitability_Data.csv file contains the data used in the habitat suitability model (i.e. information about the trap locations). Abbreviations: TrapNo (Trap Number), Lat (Latitude), Long (Longitude), NumberDays (number of days between StartDate (date traps were set out) and EndDate (date flies were collected from traps)).

The Bishop2021_GenConModel_AllData.csv file contains the data used in the genetic connectivity model. All columns starting with "BIO" are the median values of each bioclimatic variable along straight paths between sites. The "kernel" column contains the median values along straight paths between sites from the kernel density layer. The "pixvals" column contains the geographic distance between sites in units of pixels (1 km resolution). The "Distance" column contains the Cavalli-Sforza and Edwards’ chord (CSE) genetic distances between sites. See methods of the paper (Bishop et al., 2021) for more detail.

The Gpd_KenTza_11loci_659indv_genepop.txt file contains the microsatellite genotypes for the 659 individuals used in this study in GenePop format ( and the Gpd_KenTza_11loci_659indv_sample_info.csv file provides information about these individuals.


Foundation for the National Institutes of Health, Award: U01 AI115648

Foundation for the National Institutes of Health Fogarty Global Infectious Diseases Training Grant, Award: D43TW007391