Predicting amphibian intraspecific diversity with machine learning: Challenges and prospects for integrating traits, geography, and genetic data
Data files
Nov 12, 2020 version files 65.44 MB
-
extract_and_assemble_data.zip
28.97 KB
-
final_datasets_analyses.zip
4.77 MB
-
NorthAmericaSpecies_Revised.zip
3.66 MB
-
pi_subsampling_cytb.zip
18.47 MB
-
README.docx
16.86 KB
-
revised_cytb_alignments.zip
337.26 KB
-
SDM_files.zip
38.16 MB
Abstract
The growing availability of genetic datasets, in combination with machine learning frameworks, offer great potential to answer long-standing questions in ecology and evolution. One such question has intrigued population geneticists, biogeographers, and conservation biologists: What factors determine intraspecific genetic diversity? This question is challenging to answer because many factors may influence genetic variation, including life history traits, historical influences, and geography, and the relative importance of these factors varies across taxonomic and geographic scales. Furthermore, interpreting the influence of numerous, potentially correlated variables is difficult with traditional statistical approaches. To address these challenges, we analyzed repurposed data using machine learning and investigated predictors of genetic diversity, focusing on Nearctic amphibians as a case study. We aggregated species traits, range characteristics, and >42,000 genetic sequences for 299 species using open-access scripts and various databases. After identifying important predictors of nucleotide diversity with random forest regression, we conducted follow-up analyses to examine the roles of phylogenetic history, geography, and demographic processes on intraspecific diversity. Although life history traits were not important predictors for this dataset, we found significant phylogenetic signal in genetic diversity within amphibians. We also found that salamander species at northern latitudes contain lower genetic diversity. Data repurposing and machine learning provide valuable tools for detecting patterns with relevance for conservation, but concerted efforts are needed to compile meaningful datasets with greater utility for understanding global biodiversity.
Methods
Data were compiled from open-access databases and were processed using a series of python and R scripts (included with the Dryad package).
Usage notes
A README file is provided to orient users to the files included. The final datasets analyzed are provided in a single directory with commented R scripts which can be opened and run in RStudio.