Skip to main content
Dryad logo

Predicting amphibian intraspecific diversity with machine learning: Challenges and prospects for integrating traits, geography, and genetic data

Citation

Barrow, Lisa (2020), Predicting amphibian intraspecific diversity with machine learning: Challenges and prospects for integrating traits, geography, and genetic data, Dryad, Dataset, https://doi.org/10.5061/dryad.0cfxpnvzh

Abstract

The growing availability of genetic datasets, in combination with machine learning frameworks, offer great potential to answer long-standing questions in ecology and evolution. One such question has intrigued population geneticists, biogeographers, and conservation biologists: What factors determine intraspecific genetic diversity? This question is challenging to answer because many factors may influence genetic variation, including life history traits, historical influences, and geography, and the relative importance of these factors varies across taxonomic and geographic scales. Furthermore, interpreting the influence of numerous, potentially correlated variables is difficult with traditional statistical approaches. To address these challenges, we analyzed repurposed data using machine learning and investigated predictors of genetic diversity, focusing on Nearctic amphibians as a case study. We aggregated species traits, range characteristics, and >42,000 genetic sequences for 299 species using open-access scripts and various databases. After identifying important predictors of nucleotide diversity with random forest regression, we conducted follow-up analyses to examine the roles of phylogenetic history, geography, and demographic processes on intraspecific diversity. Although life history traits were not important predictors for this dataset, we found significant phylogenetic signal in genetic diversity within amphibians. We also found that salamander species at northern latitudes contain lower genetic diversity. Data repurposing and machine learning provide valuable tools for detecting patterns with relevance for conservation, but concerted efforts are needed to compile meaningful datasets with greater utility for understanding global biodiversity.

Methods

Data were compiled from open-access databases and were processed using a series of python and R scripts (included with the Dryad package).

Usage Notes

A README file is provided to orient users to the files included. The final datasets analyzed are provided in a single directory with commented R scripts which can be opened and run in RStudio.

Funding

National Science Foundation, Award: DBI 1910623

Coordenação de Aperfeiçoamento de Pessoal de Nível Superior, Award: process #88881.170016/2018