Evaluation of BugBox, a software platform for AI-assisted bioinventories of arthropods

Welch, Kelton 1 ; Lundgren, Jonathan 1 ; Wilson, Mikayla1

Published Oct 30, 2025 on Dryad. https://doi.org/10.5061/dryad.7sqv9s51k

Data files

Oct 30, 2025 version files 3.73 MB

BBE_Dryad_Full_Evaluation.xlsx

3.58 MB
BBE_dryad_version_training_stats.xlsx

138.17 KB
README.md

9.85 KB

Abstract

Artificial intelligence (AI) technology has the potential to revolutionize entomology and biodiversity research, allowing entomologists to address biodiversity questions on a larger scale than ever before. A new software program, called BugBox, has been developed to facilitate large-scale arthropod bioinventories. BugBox uses an AI algorithm to rapidly classify arthropods from specimen photographs and calculates per-sample diversity indices from its classifications. We evaluated the performance of the AI algorithm over three consecutive training cycles by comparing the AI’s classifications to classifications by an expert human taxonomist. BugBox demonstrated substantial improvement in all test metrics over the three cycles as it was allowed to incorporate the human expert’s corrections into each new model version (e.g., raw accuracy improved from 44% to 78% over the three consecutive model versions). We also used both AI and human data to separately test the hypothesis that regenerative agricultural practices increase arthropod biodiversity in a bioinventory from central North American rangelands. AI classifications were strongly correlated with human identifications, and the AI drew the same conclusion as the human data when comparing diversity indices (Hill numbers): both found evidence that regenerative practices increased arthropod diversity. These results demonstrate that, while the AI was less accurate than the human, it is still able to provide useful surrogate data at scale very rapidly. It can also improve over time under the guidance of human expertise. This technology has profound implications for the scalability of entomological science.

Dataset DOI: 10.5061/dryad.7sqv9s51k

Principle Investigator Contact Information

Name: Kelton Welch
Institution: Ecdysis Foundation
Email: kelton.welch@ecdysis.bio

Dataset Overview

This dataset includes two spreadsheets with complete data used in analyses for Welch et al. (in review). This manuscript details the use of artificial intelligence (AI) technology to evaluate biodiversity in agricultural habitats in various regions of the United States and Canada. The research implements a work pipeline in which biodiversity surveys are conducted using the AI for identifications in conjunction with continual human reviews and model improvements over successive versions of the model.

The file "BBE_Dryad_Full_Evaluation.xlsx" contains data collected from agricultural habitats and classified by both the AI agent and a human expert (KDW).

The file "BBE_dryad_version_training_stats.xlsx" contains summaries of model-performance metrics from all versions of the machine-learning model included in the study.

Experiment Design and Data Structure

Data for this dataset represent six different experiments in six different cropping systems and regions. Data for all six experiments were collected using the same standardized methods (vegetation sweeps, quadrats and beat sheets), as described in Welch et al. (in review).

All arthropods collected in these samples were photographed, and the machine-learning model was asked to identify the arthropods according to its most recent training. Then, a human expert (KDW) reviewed the AI's identifications and corrected errors, recording the AI's accuracy on this "production data".

On a monthly basis, our local server was programmed to gather all photos that had been reviewed by the human expert, and retrain the AI model. Thus, "production data" from prior experiments became "training data" for future experiments. Each time a new training was run, performance metrics were recorded, a new model version was automatically released, and AI identifications from that moment forward were performed by the new model version.

Data Scope

Data for this research are a subset of the data from the broader 1000 Farms Initiative, in which broad-spectrum data have been collected on over 1300 regenerative and non-regenerative food-production operations across North America since 2022. Data included in this dataset were collected in 2022 and 2023 from a total of 136 different agricultural locations (fields, orchards and pastures) in 11 different U.S. States and one Canadian province, and organized into six datasets: NW Cherries, WA Apples, NW Wheat, MT Grains, MI Dairies and Rangelands.

Files and variables

File: BBE_dryad_version_training_stats.xlsx

Description: This file contains lists of categories (morphotaxa) that each version of the AI was trained to recognize, as well as the number of images, and the test metrics (precision, recall and F1) achieved for each category.

There are five tabs: one tab with complete morphotaxon lists for each model version evaluated in the study (1.8, 1.9, 1.11 and 1.17), and a fifth tab with training results summarized for all model versions 1.8 - 1.22.

Variables

Each tab has a row for each category (i.e., taxon) and a column for each variable.

Category_name: This is a unique identifier associated with each morphotaxon or outgroup category. Morphotaxon names are given as the Family name and a three-digit number.
Category_type: Categories the AI learns are divided into three types: Adult Morphotaxon, Immature Bin and Outbin.
- Adult Morphotaxon: These categories are for adult specimens that have been separated (but not necessarily identified) to roughly Species level. These categories are used to calculate biodiversity indices.
- Immature Bin: These categories are used for non-adult specimens. Immatures are pooled at the Family level, and are excluded from biodiversity index calculations.
- Outbin: These categories are used for specimens belonging to taxonomic outgroups, such as non-arthropods and arthropods deemed unidentifiable. These categories are excluded from biodiversity index calculations.
Class, Order, Family, Subfamily and Genus_Species: These columns report all taxonomic information currently available (as of the date of publication) for each morphotaxon and outgroup category.
Total_Images: The total number of images used for training, validation and testing.
Training: The number of images used for training (80% of the total images).
Validation: The number of images used for model validation (10% of the total images).
Test: The number of images used for model testing (10% of the total images).
TP: True positives (the number of images correctly assigned to this category during testing).
FP: False positives (the number of images incorrectly assigned to this category during testing).
TN: True negatives (the number of images correctly not assigned to this category during testing).
FN: False negatives (the number of images incorrectly not assigned to this category during testing).
Accuracy: A simple test metric that reports how many images were placed in the correct categories. This metric can only be calculated for the full dataset, and not for individual morphotaxa, so it is only present on the "version_train_results" tab.
Precision: A test metric that measures the rate of false positives.
Recall: A test metric that measures the rate of false negatives.
F1: A composite test metric that combines the information of both Precision and Recall.

File: BBE_Dryad_Full_Evaluation.xlsx

Description: This file organizes all biodiversity data analyzed in Welch et al. (in review).

In the file, data are organized into two tabs: the "Evaluation_Datasets" tab includes all the data from five different datasets (i.e., experiments), which were used during the preliminary "model evaluation" phase. The "Rangelands" tab includes data from the final experiment in rangelands of the Great Plains, which were used to compare biodiversity in rangelands managed regeneratively and conventionally.

In both tabs, there is a row for each sample (4 samples per sample site), and a column for each morphotaxon of arthropods collected. The first 27 columns contain metadata and sample-level statistics for each sample.

In both tabs, the first 7 rows are used to record metadata for the Morphotaxa used in the study. Here are the Row Headers:

Morphotaxon: The identifier used for each Morphotaxon, consisting of the Family name and a three-digit number.
"Order", "Family", and "Genus & Species": Taxonomic information for each morphotaxon.
"Known to AI v1.x?": This row is marked "known" if model 1.x was trained on this morphotaxon, or "unknown" if model 1.x was not trained on this morphotaxon.

The first 27 columns record metadata and biodiversity index values for each sample. The next set of columns (979 in "Evaluation_Datasets" and 582 in "Rangelands") record count data for each individual Morphotaxon. Another set of columns beyond the Morphotaxa records count data for immature arthropods (which are not included in biodiversity index calculations).

Variables

Site: A 4-digit code assigned to each sample site.

Dataset: A name assigned to each of the 6 datasets (NW Cherries, WA Apples, NW Wheat, MT Grains, MI Dairies and Rangelands)

Country: USA or Canada
State_Province: Idaho, Kansas, Michigan, Montana, Nebraska, North Dakota, Oklahoma, Oregon, Saskatchewan, South Dakota, Texas or Washington
Lat: Latitude
Long: Longitude
Crop_Habitat: The crop habitat sampled (alfalfa, apples, barley, cherries, corn, cover crop mix, hay, kernza, rangeland, soybean or wheat)
Treatment: Regenerative or Conventional (only applicable for the Rangelands dataset)
Sample_Type: Beat tray, Quadrat or Vegetation sweep
Transect: One sample was taken along each of four 50m transects at each site
Collection_Date: The date when samples were collected in the field
Upload_Date: The date when specimen photographs were submitted to the AI for identification
BugBox_Version: The model version that performed the identifications of the specimens in this sample
Review_Date: The date when the human expert reviewed the sample, evaluated AI identifications, and performed his own specimen identifications
Photos_Reviewed: The number of specimen photos that were taken
BugBox_Correct: The number of specimen photos that were correctly identified by the AI, as determined by human expert review
Prop_Correct: The proportion of specimen photos that were correctly identified
N_human: The number of adult specimens (Abundance) in the sample, as determined by the human expert
D0_human: Hill number D0 (Species Richness), as determined by the human expert
D1_human: Hill number D1 (exponential Shannon index), as determined by the human expert
D2_human: Hill number D2 (inverse Simpson index), as determined by the human expert
Dinf_human: Hill number D-infinity (inverse Berger-Parker index), as determined by the human expert

N_ai The number of adult specimens (Abundance) in the sample, as determined by the AI
D0_ai: Hill number D0 (Species Richness), as determined by the AI
D1_ai: Hill number D1 (exponential Shannon index), as determined by the AI
D2_ai: Hill number D2 (inverse Simpson index), as determined by the AI
Dinf_ai: Hill number D-infinity (inverse Berger-Parker index), as determined by the AI

Evaluation of BugBox, a software platform for AI-assisted bioinventories of arthropods

Data files

Abstract

README: Evaluation of BugBox, a software platform for AI-assisted bioinventories of arthropods

Principle Investigator Contact Information

Dataset Overview

Experiment Design and Data Structure

Data Scope

Files and variables

File: BBE_dryad_version_training_stats.xlsx

Variables

File: BBE_Dryad_Full_Evaluation.xlsx

Variables

Methods