Does ecology predict taxonomy? How ecological differentiation can be used to spatially infer intra-specific diversity
Data files
Nov 12, 2025 version files 11.22 MB
-
ClusterStabilityFunction.R
15.38 KB
-
OccDataAnalysisWithVars.RData
11.20 MB
-
README.md
801 B
Abstract
Assessing the true dimension of biodiversity is a major challenge. Many species hide within them a diversity that is now being uncovered using molecular data. However, population genetic studies tend to be resource-consuming and more difficult to apply to a broader range of taxa, limiting scalability. Moreover, the growing shortage of trained taxonomists makes it difficult to rely on comparative morphological studies to assess divergence and speciation processes for the vast majority of species-rich taxonomic groups. Here, we explore the usefulness of the “ecological speciation” concept and explore how these hidden lineages tend to occupy a distinct environmental niche that can be used to identify natural groups in the geographical space. From a total of 298 species complexes, we assess the accuracy of five clustering methods and Random Forest models for correctly classifying the occurrences of the different subspecies based only on environmental data. For the best performing clustering method (Gaussian Mixture Models), we obtained that species can be predicted above random classification with a median Adjusted Rand Index of 0.37, only by their environmental profile. Random Forest, on the other hand, showed high accuracy for most of the species (>0.75). We believe accuracy could be further improved by using species-specific climatic variables, although this study focuses on a widely applicable method. Our goal is to demonstrate that clustering methods can be used on a large scale to reveal the true diversity hidden within taxonomic complexes, thereby reducing the time and budget required for exploratory analysis. We also aim to demonstrate the extent to which different taxa are determined and delimited by the environment.
Dataset DOI: 10.5061/dryad.6q573n6c2
Description of the data and file structure
The code ClusterStabilityFunction.R is used to clusterize environmental data extracted for each occurrence point (see OccDataAnalysisWithVars.RData) to check correspondence with recognized taxonomic units. The code is designed to perform 500 replicates and is intended to run in parallel.
taxonKey: Identifier for each unique group of taxa, used to differentiate between subspecies.
speciesKey: Identifier at the species level, which can include multiple subspecies (taxonKey).
Code/software
The file is developed for the R programming language.
