Machine learning analysis of wing venation patterns accurately identifies Sarcophagidae, Calliphoridae and Muscidae fly species
Data files
Jul 18, 2023 version files 2.67 GB
-
preprocessedData.zip
3.48 MB
-
rawData.zip
2.66 GB
-
README.md
1.77 KB
Jul 18, 2023 version files 2.67 GB
-
preprocessedData.zip
3.48 MB
-
rawData.zip
2.66 GB
-
README.md
1.81 KB
Abstract
In medical, veterinary, and forensic entomology, the ease and affordability of image data acquisition have resulted in whole-image analysis becoming an invaluable approach for species identification. Krawtchouk moment invariants are a classical mathematical transformation that can extract local features from an image, thus allowing subtle species-specific biological variations to be accentuated for subsequent analyses. We extracted Krawtchouk moment invariant features from binarised wing images of 759 male fly specimens from the Calliphoridae, Sarcophagidae, and Muscidae families (13 species and a species variant). Subsequently, we trained the Generalized, Unbiased, Interaction Detection and Estimation (GUIDE) random forests classifier using linear discriminants derived from these features and inferred the species identity of specimens from the test samples. Five-fold cross validation results show a 98.56 ± 0.38% (standard error) mean identification accuracy at the family level, and a 91.04 ± 1.33% mean identification accuracy at the species level. The mean F1-score of 0.89 ± 0.02 reflects good balance of precision and recall properties of the model. The present study consolidates findings from previous small pilot studies of the usefulness of wing venation patterns for inferring species identities. Thus, the stage is set for the development of a mature data analytic ecosystem for routine computer image-based identification of fly species that are of medical, veterinary, and forensic importance.
Authors: Ling, M.H., Ivorra, T., Heo, C.C., Wardhana, A.H., Hall, M.J.R., Tan, S.H., Mohamed, Z., Khang, T.F.
Email: tfkhang[at]um[dot]edu[dot]my
Date: 19 May 2023
This dataset contains the fly wing images used in the above-mentioned work for species indentity inference,
using the GUIDE random forests classifier. Researchers may find them a useful example of demonstrating
how machine learning approaches may illuminate taxonomy.
Description of the Data and file structure
-
Image data
This is the dataset for the wing venation patterns of 13 fly species and a species variant from 3 families (Sarcophagidae, Calliphoridae, Muscidae).
The images were captured from specimens from three collections using a digital camera and subsequently preprocessed using ImageJ and PixlrE.
The rawData.zip file contains the raw image files for the 759 samples.
The preprocessedData.zip file contains binarised images of the raw image files. -
fly_wing_samples_metadata.pdf
This file documents the location where the specimens were collected, and the collection from which they come from. -
mainscript.R
This file is the R script used to process the raw data and run the analyses described in the manuscript. -
GUIDE_random_forest.zip
This zip file contains three text types of text files associated with running GUIDE:
(i) ".IN" input files; (ii) ".OUT" output files; (iii) ".PRO" class probabilities file.
The filenames "gf_1" to "gf_5" refer to the runs for each of the folds from five fold cross-validation.
Sharing/access Information
The data published here are original and have not been published elsewhere.
The specimens used in this study came from three separate collections. Collection 1 consists of specimens collected in Malaysia. It includes three Calliphoridae species: Ch. megacephala, Ch. nigripes, Ch. rufifacies, and all the five species of Sarcophagidae. The specimens were collected from various geographical localities and habitats (e.g., primary forests, farms, mangrove swamps, beaches, and national parks) in Malaysia. Flies were collected with a handheld insect net by sweeping method and decomposed beef was used as bait. Collection 2 consists of specimens collected in the province of Alicante, Spain. It includes three Calliphoridae species: C. vicina, Ch. albiceps (normal and wing mutant variant), L. sericata, and a Muscidae species: Sy. nudiseta. For specimens in Collection 2, C. vicina and L. sericata specimens were captured using pork liver baits. Specimens from Ch. albiceps and Sy. nudiseta were obtained by growing larvae obtained from a human autopsy at the Institute of Legal Medicine of Alicante (IMLA, Spain). Collection 3 consists of specimens collected mostly from some islands of Indonesia. It includes three Calliphoridae species: Ch. bezziana (collected from Java, Sulawesi, Sumatra Sumba islands and Malaysia; 4 specimens from Africa, 1 from India), Ch. megacephala (collected from Java, Kalimantan, Lombok, Sumatra, Sulawesi, West Papua and West Timor islands) and Ch. rufifacies (collected from Sumatera and Sumba islands). Ch. bezziana samples were grown in the laboratory from larvae found at a myiasis-infected wound. The Ch. megacephala and Ch. rufifacies samples were captured using a Lucitrap Modification (LTM) or sticky trap with Bezzilure as a bait.
We binarised all raw image data to focus on the wing venation patterns and remove unnecessary features such as background noise and wing membrane details. This was done using ImageJ version 1.53k was used to binarise the images. PixlrE (https://pixlr.com/e/) was used for manual denoising of the binarised images. Different configurations were applied to different sets of images to accentuate the wing venation patterns. Raw images that could not be properly binarised or contained broken venation patterns were removed. The images were centered and then oriented with the wing costa parallel to the horizontal axis. Subsequently, they were cropped into images of dimension 724 x 254 pixels and saved in PNG file format. We further resized the images to 256 x 90 pixels to avoid the machine learning model from learning unnecessary features for identification and to improve model training speed. The time to pre-process each image ranged from 3 to 8 minutes, with noisier images requiring more time to process.
The binarised image files can be read using R for subsequent analyses. The raw image files are in TIF or PNG format and the binarised image files are in PNG format. Both types of formats can be opened using standard image softwares.
