Skip to main content
Dryad

Data for: Morphological species delimitation in the Western Pond Turtle (Actinemys): Can machine learning methods aid in cryptic species identification?

Cite this dataset

Angielczyk, Kenneth et al. (2024). Data for: Morphological species delimitation in the Western Pond Turtle (Actinemys): Can machine learning methods aid in cryptic species identification? [Dataset]. Dryad. https://doi.org/10.5061/dryad.wm37pvmv1

Abstract

As the discovery of cryptic species has increased in frequency, there has been interest in whether geometric morphometric data can detect fine-scale patterns of variation that can be used to morphologically diagnose such species. We used a combination of geometric morphometric data and an ensemble of five supervised machine learning methods to investigate whether plastron shape can differentiate two putative cryptic turtle species, Actinemys marmorata and Actinemys pallida. Actinemys has been the focus of considerable research due to its biogeographic distribution and conservation status. Despite this work, reliable morphological diagnoses for its two species are still lacking. We validated our approach on two datasets, one consisting of eight morphologically disparate emydid species, and the other consisting of two subspecies of Trachemys (T. scripta scripta, T. scripta elegans). The validation tests returned near-perfect classification rates, demonstrating that plastron shape is an effective means for distinguishing taxonomic groups of emydids via machine learning methods. By contrast, the same methods did not return high classification rates for a set of alternative phylogeographic and morphological binning schemes in Actinemys. All classification hypotheses performed poorly relative to the validation datasets and no single hypothesis was unequivocally supported for Actinemys. Two hypotheses had machine learning performance that was marginally better than our remaining hypotheses. In both cases, those hypotheses favored a two-species split between A. marmorata and A. pallida specimens, lending tentative morphological support to the hypothesis of two Actinemys species. However, the machine learning results also underscore that Actinemys as a whole have lower levels of plastral variation than other turtles within Emydidae, but the reason for this morphological conservatism is unclear.

README: Data for: Morphological species delimitation in the Western Pond Turtle (Actinemys): Can machine learning methods aid in cryptic species identification?

https://doi.org/10.5061/dryad.wm37pvmv1

This dataset includes:

-2D landmark data for the turtle plastra used in the paper

-classification data for the different Actinemys marmorata subgroups tested

-R scripts and formatted data used in the analyses described in the paper

-photographic voucher images of specimens observed in the field and in a private collection that are included in the dataset but that are not accessioned in a natural history museum

Description of the data and file structure

The top directory of the dataset includes three sub-directories (morphometric_data; R; turtle_images) and three standalone files (GIS_workflow.pdf; institutional_abbreviations.txt; marmorata_metadata_rounded_coordinates.csv).

  • The morphometric_data directory includes the 2-dimensional landmark coordinate data used in the geometric morphometric analyses. Within this directory, the marmorata_data directory includes a .txt file with the landmark data for Actinemys marmorata. Each line in the file corresponds to a specimen, and the landmark coordinates are organized as X1, Y1, X2, Y2,...Xn, Yn, centroid size. Identifications for the specimens are given in the .csv file in the directory. Institutional abbreviations are provided in the institutional_abbreviations file in the top directory. Note that the landmark coordinate data provided in the file is the result of the reflection/averaging process for symmetric landmarks described in the main text, so they describe the shape of 'half' plastra. The other_species_data directory includes the 2-dimensional landmark coordinate data for the eight other emydine species examined in the paper, the outgroup species Chrysemys picta, and the two subspecies of Trachemys scripta. The landmark data are organized into separate .txt files for each species; the species files each have a corresponding .csv file that provides specimen identifications. Landmark coordinates in these files are organized in the same fashion as for the A. marmorata data and data for symmetric landmarks have been reflected and averaged to produce 'half' plastra. The digitization_error_test_data directory includes the 2-dimensional landmark coordinate data for the replicate digitizations of a single specimen used to quantify digitization error. The landmark data are presented in the .txt file in the directory, and a .jpg image of the specimen used for the test is also included, the coordinate data in the file are organized in the same fashion as for the A. marmorata dataset, although note that this file includes data for all 19 landmarks (i.e., symmetric landmarks were not reflected and averaged for this file).
  • The turtle_images directory includes .jpg photographs of 81 specimens that are included in the dataset, but that are not accessioned into a public trust natural history museum. Therefore, these images serve as vouchers for these specimens and were the source of the morphometric data collected for these specimens. The majority of the specimens pertain to A. marmorata, but a few specimens of other species are included. The file names of the images are used to identify the specimens in the .csv files in the morphometric_data directory.
  • The R directory contains formatted input data (.csv files) corresponding to the different binning schemes used in the machine learning analyses: two validation schemes ('emydines'; 'trachemys') and six alternate binning schemes for A. marmorata ('Morpho'; 'SP10_1'; 'SP10_2'; 'SP10_3'; 'SP14_1'; 'SP14_2'). Commented R scripts (.R files) used to analyze these data are available as software on Zenodo.
  • The institutional_abbreviations.txt file provides an explanation of institutional abbreviations used to identify specimens in the dataset.
  • The marmorata_metadata_rounded_coordinates.csv file provides metadata used in the analyses for A. marmorata specimens examined here. The column 'individual' corresponds to the individual number in the various morphometric data files. The column 'specimen' provides the institutional abbreviations and specimen numbers for the specimens. The columns 'lat' and 'long' provide latitude and longitude data for the specimens. Because A. marmorata has an IUCN Red List classification of Threatened, we followed best practices and provided coordinate data rounded to the nearest 0.1 degrees. Unrounded data are available from the authors on request. The columns 'sp10.1', 'sp10.2', 'sp10.3', 'sp14.1', 'sp14.2', and 'morph' correspond to the binning schemes described in the paper. The bin names used in these columns correspond to the bins shown in Figure 1 of the paper. Instances for which a specimen could not be assigned to a bin are labeled NA. The column 'sex' provides our sex determination for the specimen (M = male; F = female). Specimens for which sex could not be determined are labeled NA. The column 'inguinal scute' provides our determination of whether inguinal scutes were present or absent (0 = absent; 1 = present). Specimens for which the presence/absence of inguinal scutes could not be determined are labeled NA. The columns 'average flow rate', 'slope', and 'altitude' provide the values for a given specimen that were calculated from the GIS data following the procedures described in the main text and GIS_Workflow.pdf document. Many specimens in the metadata file are blank for all columns except 'individual' and 'specimen'. The majority of these specimens are sub-adults that were not included in the analyses described in the paper and therefore do not have the associated metadata. A small number of additional specimens had no available locality data, or locality data that were too imprecise to allow confident geo-referencing.

GIS Workflow

The GIS_Workflow.pdf file available on Zenodo describes the process by which elevation, slope, and flow estimates used in some analyses were extracted from publicly available datasets. The resulting values are presented in the marmorata_metadata_rounded_coordinates.csv file in the Dryad archive.

Sharing/Access information

Aside from the specimens whose photos are included in the turtle_images directory, data used in this analysis were collected from photos taken of turtle specimens from the collections of natural history museums that hold their material in the public trust. Institution names and specimen numbers are provided here, and qualified researchers can access the specimens by contacting their respective host institutions.

Funding

National Science Foundation, Award: DBI-0306158

National Science Foundation, Award: DBI-0353797

National Institute of General Medical Sciences, Award: K12GM102778