A deep learning and digital archaeology approach for mosquito repellent discovery
Data files
Jun 16, 2025 version files 1.89 MB
-
1947-King-USDA_Dataset.csv
306.15 KB
-
1954-King_dataset.csv
394.57 KB
-
1967_USDA_datasetcsv.csv
500.41 KB
-
Fig5B.csv
57.94 KB
-
Fig5B.ipynb
170.86 KB
-
Fig5C.csv
10.13 KB
-
Fig5C.ipynb
90.68 KB
-
Fig6B.csv
1.47 KB
-
Fig6B.ipynb
74.80 KB
-
README.md
5.12 KB
-
README.txt
5 KB
-
Subplot_C_DEET_Curves.py
4.24 KB
-
Subplot_C_DEET_Heatmaps.py
11.71 KB
-
USDA_compounds.csv
256.78 KB
Abstract
Insect-borne diseases kill >0.5 million people annually. Currently available repellents for personal or household protection are limited in their efficacy, applicability, and safety profile. Here, we describe a machine-learning-driven high-throughput method for the discovery of novel repellent molecules. To achieve this, we digitized a large, historic dataset containing ~19,000 mosquito repellency measurements. We then trained a graph neural network (GNN) to map molecular structure and repellency. We applied this model to select 317 candidate molecules to test in parallelizable behavioral assays, quantifying repellency in multiple insect vectors of the pathogens of disease and in follow-up trials with human volunteers. The GNN approach outperformed a chemoinformatic model and produced a hit rate that increased with training data size, suggesting that both model innovation and novel data collection were integral to predictive accuracy. We identified >10 molecules with repellency similar to or greater than the most widely used repellents. We analyzed the neural responses from the mosquito antennal (olfactory) lobe to selected repellents and found strong responses to many of the tested compounds, including those predicted to be strong repellents. Results from the AL recordings also demonstrated a correlation between the evoked responses to strong repellents and our GNN representation. This approach enables computational screening of billions of possible molecules to identify empirically tractable numbers of candidate repellents, leading to accelerated progress towards solving a global health challenge.
Dataset DOI: https://doi.org/10.5061/dryad.73n5tb38b
This dataset supports the analyses presented in the study titled “A deep learning and digital archaeology approach for mosquito repellent discovery”. The data include experimental repellency assays conducted on ticks and mosquitoes, digitized historical datasets, and scripts for data analysis and figure generation.
Description of the data and file structure
The dataset consists of experimental data files and associated Jupyter notebooks and Python scripts used to analyze repellency data and generate the figures presented in the publication. The files are structured as follows:
- Fig5B.csv: Contains 18 columns documenting repellency testing against Anopheles stephensi. Empty fields indicate unavailable data. Below is a detailed account of the data in these columns.
1 - Molecule name: Unique alphanumeric identifier for each test compound.
2 - SMILES: Simplified Molecular Input Line Entry System. This column contains an ASCII string representing the chemical structure of each of the molecules tested.
3 - Collections: Provides details about batches and sets from which compounds were sourced, and information about the trials in which they were used.
Columns 4 to 18 contain raw data associated with “hand to cage” behavioral assays on Anopheles stephensi, as follows:
4 - Experiment number
5 - Duration of recording in minutes
6 - Concentration of the test compound (µg/cm2)
7 - Evaporation time (Min.)
8 - Compound volume (ml)
9 - Surface (cm2)
10 - Material
11 - Volunteer name
12 - Volunteer alias
13 - Test Date
14 - Repellency (%)
15 - Average repellency (%)
16 - Average repellency (%) Standard Deviation (±)
17 - Average repellency (%) Count
18 - Concentration percentage (%)
- Figure5B.ipynb: Jupyter notebook for visualizing repellency performance using barplots and stripplots, color-coded by the number of unique volunteers. Requires Python 3.8.17 with matplotlib.pyplot, pandas, numpy, and seaborn.
- Fig5C.csv: Contains binary repellency outcome data for Aedes aegypti and ticks. Columns 15–24 represent repellency response variables. The column labeled ED50 contains the effective dose at which the compound tested achieves 50% repellency. ED50_binary is a binary variable that indicates whether the ED50 value is above (1) or below (0) a threshold of 1.2.
- Figure5C.ipynb: Jupyter notebook generating joint plots comparing average repellency across the two taxa.
- Fig6B.csv: Contains repellency values (column 5) for three taxa: Anopheles (column 2), Aedes (column 3), and ticks (column 4), associated with specific chemical compounds (column 1). Empty cells in this list mean that the compound in question was not tested under that particular condition.
- Figure6B.ipynb: Notebook to generate comparative plots across taxa. Uses pandas, matplotlib.pyplot, and seaborn.
- USDA_compounds.csv: Digitized list of chemical compounds used to train the neural network. Repellency values for these compounds under different conditions in Aedes aegypti and Anopheles stephensi were collected from published documents (King, 1947, 1954; USDA, 1967).
- 1947-King-USDA_dataset.csv: Moderately curated data table with repellency values on skin and clothing against yellow fever mosquitoes. Missing values are marked as N/A.
- 1954-King_dataset.csv: Moderately curated dataset including repellency on skin and clothing against yellow fever and malaria mosquitoes. Missing values are represented as ‘-‘. The non-numeric value “4A” denotes maximum repellency, as defined by the original source.
- 1967-USDA_dataset.csv: Digitized and moderately curated repellency dataset against Aedes aegypti in multiple conditions (e.g., skin, clothing, olfactometer). Missing values from OCR are labeled as N/A, and blanks from the original are marked as ‘-‘.
- Subplot_C_DEET_Curves.py: Script for processing calcium imaging data from the antennal lobe of Ae. aegypti. Outputs response plots. Depends on numpy, matplotlib.pyplot, os, glob, and csv.
- Subplot_C_DEET_Heatmaps.py: Script for generating heatmap visualizations of average neural activity during chemical stimulation. Uses os, glob, csv, numpy, pandas, sklearn.decomposition.PCA, and matplotlib.pyplot.
Sharing/Access Information
Data was derived from the following sources:
- King (1947, 1954)
- USDA (1967)
Digitized and curated versions of these datasets are included in this submission.
Code/Software
The dataset includes Jupyter notebooks and Python scripts. All scripts were run using Python 3.8.17 and rely on the following packages:
- pandas
- numpy
- matplotlib.pyplot
- seaborn
- os
- glob
- csv
- sklearn.decomposition.PCA (for calcium imaging heatmaps)
The notebooks are named by figure (e.g., Figure5B.ipynb, Figure6B.ipynb) and correspond to specific plots in the manuscript. Each notebook is self-contained and includes comments guiding the user through the data visualization process.