Skip to main content
Dryad

Random forests for predicting species identity of forensically important blow flies (Diptera: Calliphoridae) and flesh flies (Sarcophagidae) using geometric morphometric data: proof of concept

Cite this dataset

Khang, Tsung Fei; Mohd Puaad, Nur Ayuni Dayana; Teh, Ser Huy; Mohamed, Zulqarnain (2024). Random forests for predicting species identity of forensically important blow flies (Diptera: Calliphoridae) and flesh flies (Sarcophagidae) using geometric morphometric data: proof of concept [Dataset]. Dryad. https://doi.org/10.5061/dryad.95x69p8hf

Abstract

Wing shape variation has been shown to be useful for delineating forensically important fly species in two Diptera families: Calliphoridae and Sarcophagidae. Compared to DNA-based identification, the cost of geometric morphometric data acquisition and analysis is relatively much lower because the tools required are basic, and stable softwares are available. However, to date, an explicit demonstration of using wing geometric morphometric data for species identity prediction in these two families remains lacking. Here, geometric morphometric data from 19 homologous landmarks on the left wing of males from seven species of Calliphoridae (n=55), and eight species of Sarcophagidae (n=40) were obtained and processed using Generalized Procrustes Analysis. Allometric effect was removed by regressing centroid size (in log10) against the Procrustes coordinates. Subsequently, principal component analysis of the allometry-adjusted Procrustes variables was done, with the first 15 principal components used to train a random forests model for species prediction. Using a real test sample consisting of 33 male fly specimens collected around a human corpse at a crime scene, the estimated percentage of concordance between species identities predicted using the random forests model and those inferred using DNA-based identification was about 80.6% (approximate 95% confidence interval = [68.9%, 92.2%]). In contrast, baseline concordance using naive majority class prediction was 36.4%. The results provide proof of concept that geometric morphometric data has good potential to complement morphological and DNA-based identification of blow flies and flesh flies in forensic work. 

README: Random forests for predicting species identity of forensically important blow flies (Diptera: Calliphoridae) and flesh flies (Sarcophagidae) using geometric morphometric data: proof of concept

https://doi.org/10.5061/dryad.95x69p8hf
Manuscript title: Random Forests for Predicting Species Identity of Forensically Important Blow Flies (Diptera: Calliphoridae)
and Flesh Flies (Diptera: Sarcophagidae) using Geometric Morphometric Data: Proof of Concept
Authors: T.F.Khang, N.A.D. Mohd Puaad, S.H.Teh, Z. Mohamed.
Email: tfkhang[at]um[dot]edu[dot]my or zulq[at]um[dot]edu[dot]my

Journal: Journal of Forensic Sciences
Date: 17 October 2020

Description of the data and file structure

  1. Image_data
    This folder contains three subfolders. From these images, landmark coordinate data were taken using the tpsDIG2 software.
    The folder Calliphoridae contains JPEG images of the left wing taken from seven species.
    The folder Sarcophagidae contains JPEG images of the left wing taken from eight species.
    The folder Test samples contains JPEG images of the left wing taken from fly samples collected at a murder crime scene.

  2. tpsdata
    This folder contains 16 subfolders. One folder ("test_samples") contains the tpsdata for the test samples from the folder "Test samples" in
    Image_data. The remaining subfolders are organised according to species, and they contain text files in .TPS extension generated from
    the JPEG images in Image_data folder.

  3. DNA_data
    This folder contains four files:
    (i) IQtree_result.NWK - The inferred maximum likelihood phylogenetic tree (Fig.2) using IQ-TREE, in newick file format.

(ii)translatorX_result.FAS - A fasta file containing the multiple sequence alignment result obtained using the translatorX server.

(iii) trimmed_MSA_data.FAS - A fasta file containing the multiple sequence alignment of the sequences used in Fig. 2 that has been
manually trimmed at the right end.

(iv) sequence_data_partition.PART - a text file containing instructions for IQ-TREE to apply different DNA-substitution models
for different codon positions of the multiple sequence alignment file "trimmed_MSA_data.FAS".

In principle, one should be able to reproduce the tree in IQtree_result.NWK using the files (ii)-(iv) using
the Stable Release Version 1.6.12 (August, 2019) of IQ-TREE. Slight variations are possible if the job is submitted to the webserver
at http://iqtree.cibiv.univie.ac.at/ since it may run on some later version.
4. GUIDE_files
This folder contains files associated with the GUIDE program.

(i) fly_gpa_pca_allo.csv - A csv file containing the training and the test samples and 15 principal component scores based on
the residuals from regressing Procrustes coordinates against log10 of centroid size.
(ii) fly_gpa_pca_allo.DSC - The GUIDE description file which specifies how the variables in fly_gpa_pca_allo.csv should
be used by GUIDE.
(iii) flywings_rf.IN - A text file containing the input parameters of the GUIDE program used for the analysis.
5. mainscript.R

This file is the R script used to process the raw data and run the analyses described in the manuscript.

Methods

The fly training set samples were taken from archived fly collection at the Molecular Genetics Laboratory MP2, University of Malaya.

The fly test samples were collected from a murder crime scene in 2016 at the state of Selangor, Malaysia, where the victim was estimated to have been dead for more than 24 hours. The collected flies were anesthetised in ethyl acetate in a covered bottle.

The left wing of each fly specimen was detached after overnight relaxation, mounted onto a glass slide using euparal as the mounting medium, and covered with a coverslip. The slides were left overnight at 56°C to clear out bubbles. Wing images were captured using a digital camera (20X magnification) attached to a binocular microscope (Motic Microscope 2.0, China). For each specimen, coordinate data from 19 homologous landmarks on the left wing image were recorded by a single person (NADMP), using the tpsDig 2.0 (Version 2.17) software and saved in tps file format.

Sequence data from the COII gene were obtained from eight Sarcophagidae test samples, and five out of 26 of the Calliphoridae test samples. DNA was extracted from two legs of a specimen using QIAamp® DNA and Blood Mini Kit (Qiagen, USA). For detailed molecular protocols, see the associated publication.

For processing and analysing geometric morphometric data, we used R Version 3.2.1. General Procrustes Analysis (GPA) (geomorph R package, Version 3.0.3) was applied to two separate data sets: one containing only the training samples (for inspection of patterns of shape variation within and among species), and another containing both the training and the test samples (for prediction using random forests). 

Linear regression was applied on the resultant Procrustes coordinate variable against the logarithm (base 10) of the centroid size was done to remove potential effects of wing shape allometry. The data were subsequently transformed to uncorrelated principal component scores using principal component analysis (PCA). 

Usage notes

The data set can be used as it is. See the README.txt file for descriptions of file contents.

Funding

University of Malaya, Award: PG074-2015A