Tsetse fly wing landmark data for morphometrics (Vol 20, 21)
Data files
Dec 02, 2022 version files 830.62 MB
-
morphometric_data.csv
12.78 MB
-
README.md
2.63 KB
-
tsetsedata_2019_left_commas.zip
285.92 MB
-
tsetsedata_2019_right_commas.zip
531.92 MB
Apr 06, 2023 version files 3.72 GB
-
missing_landmarkwings_L.zip
2.65 GB
-
model_resnet50_regressor_finetuned.pth
94.54 MB
-
model_unet___segment_finetune_.pth
146.65 MB
-
morphometric_data.csv
12.78 MB
-
README.md
3.18 KB
-
tsetsedata_2019_left_commas.zip
285.92 MB
-
tsetsedata_2019_right_commas.zip
531.92 MB
Abstract
Single-wing images were captured from 14,354 pairs of field-collected tsetse wings of species Glossina pallidipes and G. m. morsitans and analysed together with relevant biological data. To answer research questions regarding these flies, we need to locate 11 anatomical landmark coordinates on each wing. The manual location of landmarks is time-consuming, prone to error, and simply infeasible given the number of images. Automatic landmark detection has been proposed to locate these landmark coordinates. We developed a two-tier method using deep learning architectures to classify images and make accurate landmark predictions. The first tier used a classification convolutional neural network to remove most wings that were missing landmarks. The second tier provided landmark coordinates for the remaining wings. For the second tier, compared direct coordinate regression using a convolutional neural network and segmentation using a fully convolutional network. For the resulting landmark predictions, we evaluate shape bias using Procrustes analysis. We employ a data-centric approach paying particular attention to consistent labelling and data augmentations in training data to improve model performance. The classification model used for the first tier achieved perfect classification on the test set. For an image size of 1024×1280, data augmentation reduced the mean pixel distance error from 8.3 (95% CI [4.4,10.3]) to 5.34 (95% CI [3,7]) for the regression model. For the segmentation model, data augmentation did not alter the mean pixel distance error of 3.43 (95% CI [1.9,4.4]). Segmentation had a higher computational complexity and some large outliers. Both models showed minimal shape bias. We chose to deploy the regression model on complete unannotated data since the regression model had a lower computational cost and more stable predictions than the segmentation model. The resulting landmark dataset was provided for future morphometric analysis.
Methods
This data was collected via field traps designed to catch tsetse flies. The fly wings were processed from the flies and laminated on an A4 sheet of paper along with various biological recordings from a lab dissection of the fly. This data was subsequently digitised by recording the data for each fly in excel spreadsheets. A microscope camera was used to capture a digital image of the fly wings at a resolution of 1024×1280. A subset of images was annotated and used to train machine learning models. The wing images were then given as inputs to machine learning models which located and recorded various landmarks in each fly wing image. These landmarks were appended to the dataset of biological recording taken during the lab dissection. This data was processed to remove outliers and other erroneous instances in the data set.
The different files in the dataset are described below.
tetse_data.csv
Column names and description
- vpn: vpn is the filename, identified by the volume (v), page (p), and number (n) of the fly. The numbers go up to 20 (20 pairs of wings per a page)
- cd: Day of month captured
- cm: Month of year captured
- cy: Calender year
- md: Capture method
- g: Genus
- s: Sex; 1 = male; 2 = female
- c: Ovarian age category i.e. the number of times a female has ovulated, varying from 0 to 7.
- wlm: Wing length in millimeters (measured from landmark 1 to 6)
- f: Wing fray, varying from 1-6
- lmkr: Number of missing landmarks for the right wing (Not accurate)
- lmkl: Number of missing landmarks for the left wing (Not accurate)
- hc: Hatchet cell measurement in millimeters (measured from landmark 11 to 7. This was sometimes measured instead of the wing length if landmark 1 or 6 was missing)
- left_good: Classifier prediction for the left wing; 0 = incomplete wing; 1 = complete wing
- right_good: Classifier prediction for the right wing; 0 = incomplete wing; 1 = complete wing
- l<x/y><#>: The rest of the columns indicate the pixel location of landmark # , with x and y coordinate
The zipped files tsetsedata_2019_right_commas.zip and tsetsedata_2019_left_commas.zip contain all the images and landmark coordinate labels that were used the train the landmark detection models referred to in the linked article.
The files model_unet__segmen_finetune_.pth and model_resnet50_regressor_finetuned.pth are trained models used for landmark detection.
The missinglandmark_wings_L.zip contains all the broken wings used for training the classifier model.