Data from: A novel approach to quantifying mammal locomotor repertoires using scoring and cluster analysis
Data files
Dec 11, 2025 version files 91.87 KB
-
Anderson_Additional_Data_and_Results_ESM2.xlsx
82.27 KB
-
README.md
9.60 KB
Abstract
Describing behaviour using qualitative categories is a staple of studies on tetrapod functional morphology, ecology, and evolution. However, such categorisation has several issues, primarily subjectivity and the loss of important behavioural repertoire information. Here, we propose a novel method for quantifying behaviour, using mammal locomotion as a case study to demonstrate its utility and efficacy. Species are scored from 0-4 on their proficiency in five locomotor modes (swimming, climbing, digging, running and aerial movement), then Ward’s hierarchical clustering is used on the resulting data matrix to group species into biologically informative categories (the number of which can be optimised using clustering validation methods), thus producing a mathematically defined categorical variable. The method is demonstrated on a dataset of 250 mammal species, representing every extant mammal family. We show that this approach successfully quantifies mammal locomotion, producing both a data matrix that can be used as a set of covariates in multivariate analyses and a categorical variable. This method introduces a replicable technique for quantifying animal behaviour and subsequently deriving a categorical variable, which is highly versatile and can be tailored to analyse a variety of other behaviours and taxonomic groups for widespread use across evolutionary and ecological research.
Data for:
Sophia C. Anderson, Philip G. Cox, Laura C. Fitton, Karl T. Bates, Eloy Gálvez-López (2025) A novel approach to quantifying mammal locomotor repertoires using scoring and cluster analysis. Royal Society Proceedings B. DOI 10.1098/rspb.2025.2515
This README.md file was generated on 10/12/2025
GENERAL INFORMATION
1. Title of dataset: ‘Data for: A novel approach to quantifying mammal locomotor repertoires using scoring and cluster analysis’
2. Date of data collection: 2023-2025
3. Keywords: cluster analysis, Ward's hierarchical clustering, mammal locomotion, behaviour quantification
4. Language: English (UK)
DATA AND FILE OVERVIEW
1. Description of dataset
This is a dataset containing raw data in which 250 extant mammals have been scored from 0-4 on their ability to engage in five types of locomotion (swimming, climbing, digging, running, and aerial movement), where 0 = physically unable; 4 = extremely able).
Scoring follows strictly laid out criteria available in the associated manuscript submitted to Proceedings of the Royal Society B and is based on literature review (full reference list for scoring is provided in File 2).
The scoring matrix dataset was then used as the input for Ward's hierarchical cluster analysis and subsequent cluster number optimisation. The output of this process was an optimal cluster number of 9, and each species is subsequently assigned to one of the 9 clusters accordingly.
The annotated R code used to carry out analyses for the associated manuscript is provided here in File 1.
Additionally, interobserver repeatability of the scoring process was investigated on a subset of 30 species from the original 250 which were independently scored by 8 individuals. Results are provided in Files 2 & 4.
2. File list
File 1: Anderson_Annotated_R-Script_ESM1.txt (Zenodo, Software)
Text file containing annotated R code used to perform all the analyses in the associated manuscript, using the datasets provided here.
File 2: Anderson_Additional_Data_and_Results_ESM2.xlsx
This workbook contains 8 sheets: 'Locomotor scores', 'Reference list', 'Polychoric PC scores', 'Interobserver scoring matrices', 'Interobserver Medians ± QDs', 'Participant information', 'P1-P8 results', and 'Interobserver LMs'.
'Locomotor scores' sheet
Table containing 250 rows (no. of species in dataset) and 12 columns (5 of which contain scoring data, and 7 of which contain additional information). The rows are ordered alphabetically by species binomial. Variables:
- Binomial – binomial species name following VertLife (Upham et al. 2019)
- Order - taxonomic order
- Family - taxonomic family
- Subfamily - taxonomic subfamily where such designation exists
- Common Name - common name for species following Animal Diversity Web (University of Michigan, 2020)
- Cluster - simplified descriptor for cluster assigned in cluster analysis, where cluster number = 9
- Swimming - score for swimming ability (0-4) for each species
- Climbing - score for climbing ability (0-4) for each species
- Digging - score for digging ability (0-4) for each species
- Running - score for running ability (0-4) for each species
- Aerial - score for aerial ability (0-4) for each species
- References - numerical references for scoring, referring to full reference list in 'Reference list' sheet
'Reference list' sheet
Table containing 250 rows (no. species in dataset) and 3 columns. Variables:
- Citation number - number referring to the references in the 'Locomotor scores' sheet
- Short citation - in the form of an in-text citation
- Full citation - complete reference
'Polychoric PC scores'
Table containing 250 rows (no. species in dataset) and 6 columns. Variables:
- Binomial – binomial species name following VertLife (Upham et al. 2019)
- PC1 to PC5 - principal component scores arising from transforming the scoring matirx via polychoric correlation and PCA
'Interobserver scoring matrices' sheet
A table containing 240 rows (30 species scored by 8 participants) and 9 columns (5 of which are scoring data). Variables:
- Binomial – binomial species name following VertLife (Upham et al. 2019)
- Participant - numerical ID for each participant
- Swimming, Climbing, Digging, Running and Aerial - ability scores (0-4) for each species for each participant
- k9 - numerical ID of cluster assignment when k = 9
- Designation - categorical descriptor for cluster assignment
'Interobserver Medians ± QDs' sheet
A table containing 30 rows (species subset for repeatability analyses) and 6 columns. Variables:
- Binomial – binomial species name following VertLife (Upham et al. 2019)
- Swimming, Climbing, Digging, Running and Aerial - median score ± quartile deviation (QD) across the 8 participants. Cells are coloured according to QD value, with a key provided.
'Participant information' sheet
A table containing 8 rows (no. participants) and 13 columns. Variables:
- Participant - numerical ID for each participant
- Background - indicates if the participant's background includes the study of locomotion behaviour
- Career stage - indicates if the participant is an early career researcher (ECR) or principal investigator (PI)
- Cladistic XP - indicates if the participant has previous experience of cladistic scoring. 1 = yes, 0 = no.
- Mamm Spp - indicates if participants used articles from the journal Mammalian Species as reference for scoring. 1 = yes, 0 = no.
- Books - indicates if participants used books as reference for scoring. 1 = yes, 0 = no.
- Papers - indicates if participants used published papers as reference for scoring. 1 = yes, 0 = no.
- Google - indicates if participants used Google search as reference for scoring. 1 = yes, 0 = no.
- YouTube - indicates if participants used YouTube videos as reference for scoring. 1 = yes, 0 = no.
- Personal XP - indicates if participants used personal experience and knowledge as reference for scoring. 1 = yes, 0 = no.
- Other sources - indicates if participants used other, unspecified sources as reference for scoring. 1 = yes, 0 = no.
- Source diversity - indicates the number of source types utilised by participants
- Time (minutes) - time taken by participant to complete scoring of the 30 species in minutes
'P1-P8 results' sheet
Table containing 72 rows (9 clusters for each of 8 participants) and 14 columns. Variables:
- Participant - numerical identity of participant, 1 to 8
- Cluster - numerical ID of cluster where k = 9
- N - number of species assigned to each cluster
- Swim.median - median swimming score within each cluster
- Swim.QD - quartile deviation (QD) in swimming score in each cluster
- Climb.median - median climbing score within each cluster
- Climb.QD - quartile deviation (QD) in climbing score in each cluster
- Dig.median - median digging score within each cluster
- Dig.QD - quartile deviation (QD) in digging score in each cluster
- Run.median - median running score within each cluster
- Run.QD - quartile deviation (QD) in running score in each cluster
- Aer.median - median aerial score within each cluster
- Aer.QD - quartile deviation (QD) in aerial score in each cluster
- Cluster descriptor - short descriptor of cluster
'Interobserver LMs' sheet
12 results tables for linear models. These models assess the influence of 12 variables provided in 'Participant information' on:
- mean MSE - mean mean squared error. We calculated the mean square error (MSE) of each participant by subtracting their scores from those of each other participant and calculating an averaged sum of those values squared. These pairwise interobserver MSEs were then averaged for each participant to calculate a general mean - the mean MSE. This is provided for each level of the independent variable. E.g. within the career stage model, there is a mean MSE for ECRs and a mean MSE for PIs.
- cluster diffs. - mean number of species which differ in cluster assignment between participants. This is provided for each level of the independent variable.
Each table provides:
- R squared - value of R squared (coefficient of determination) for mean MSE and cluster diffs. in each model
- F - value of F statistic (comparing variances) for mean MSE and cluster diffs. in each model
- p - p-value for mean MSE and cluster diffs. in each model
File 3: Anderson_Labelled_Dendrogram_ESM3.png (Zenodo, Supplemental Information)
A .png image of the dendrogram produced via Ward's hierarchical clustering, with tips labelled with species binomials.
File 4: Anderson_Interobserver_repeatability_analyses_ESM4.docx (Zenodo, Supplemental Information)
A Word document containing a full description of interobserver repeatability analyses carried out as part of this study, referencing data and results available in File 2: Anderson_Additional_Data_and_Results_ESM2.xlsx
SHARING/ACCESS INFORMATION
These data are used in the associated manuscript.
CODE/SOFTWARE
All analyses using these data were carried out in The R statistical environment v. 4.2.2. (R Core Team, 2024). The code is provided here in File 1 and is annotated throughout.
Human subjects data
Explicit consent from all human participants was obtained to publish de-identified data. Information regarding career stage and relevant educational background are included in these data, but all participants are referred to across all materials by a number for anonymity.
1. Sample
A total of 250 extant mammal species were included in this study. The species were selected to represent one species per subfamily, or family where such divisions have not been defined. Decisions regarding which particular species within their subfamilies would be scored were largely based on the availability of locomotor information within the literature to avoid introducing missing data into the scoring matrix.
In the case of Chiroptera, not all subspecies are included, regardless of the presence of many subfamilies within the order, because the Chiroptera broadly represent a single behavioural repertoire which would be overrepresented if all subfamilies were included. Similarly, not all rodent subfamilies were included as it would risk overrepresenting this very large order. Note, however, that the reduced coverage still represents the whole locomotor variation within Rodentia.
2. Scoring
Each species was given a score from 0-4 in each of five locomotor modes: swimming, climbing, digging, running, and aerial movement. A score of 0 represents an animal being physically unable to engage in a behaviour (e.g. a whale cannot run), while a score of 4 represents an animal being able to perform this behaviour highly proficiently (e.g. whales are extremely good swimmers).
The criteria for scoring are given in Table 1 of the associated manuscript; these were determined prior to scoring and based on function rather than frequency (i.e., an animal’s ability to perform a behaviour informs its score in the column, not the frequency of use of that behaviour). Strict adherence to these criteria is essential, and thus the wording of the criteria is intended to give as little room for ambiguity as possible in order to reduce subjectivity, and example taxa are provided to further clarify and differentiate the scoring levels.
Scoring was based on information from literature, with full reference list provided in File 2.
- Anderson, Sophia; Cox, Philip; Fitton, Laura et al. (2025). Data from: A novel approach to quantifying mammal locomotor repertoires using scoring and cluster analysis. Zenodo. https://doi.org/10.5281/zenodo.11106346
- Anderson, Sophia; Cox, Philip; Fitton, Laura et al. (2025). Data from: A novel approach to quantifying mammal locomotor repertoires using scoring and cluster analysis. Zenodo. https://doi.org/10.5281/zenodo.11106345
