Supplementary material: Machine learning and phylogenetic models identify predictors of genetic variation in Neotropical amphibians
Data files
Jan 08, 2024 version files 26.32 MB
Abstract
Aim: Intraspecific genetic variation is key for adaptation and survival in changing environments and is known to be influenced by many factors, including population size, dispersal, and life history traits. We investigated genetic variation within Neotropical amphibian species to provide insights into how natural history traits, phylogenetic relatedness, climatic, and geographic characteristics can explain intraspecific genetic diversity.
Location: Neotropics.
Taxon: Amphibians.
Methods: We assembled datasets using open-access databases for natural history traits, genetic sequences, phylogenetic trees, climatic, and geographic data. For each species, we calculated overall nucleotide diversity (π) and tested for isolation by distance (IBD) and isolation by environment (IBE). We then identified predictors of π, IBD, and IBE using Random Forest (RF) regression or RF classification. We also fitted phylogenetic generalized linear mixed models (PGLMMs) to predict π, IBD, and IBE.
Results: We compiled 4,052 mitochondrial DNA sequences from 256 amphibian species (230 frogs and 26 salamanders), georeferencing 2,477 sequences from 176 species that were not linked to occurrence data. RF regressions and PGLMMs were congruent in identifying range size and precipitation (σ) as the most important predictors of π, influencing it positively. RF classification and PGLMMs identified minimum elevation as an important predictor of IBD; most species without IBD tended to occur at higher elevations. Maximum latitude and precipitation (σ) were the best predictors of IBE, and most species without IBE occur at lower latitudes and in areas with more variable precipitation.
Main conclusions: This study identified predictors of genetic variation in Neotropical amphibians using both machine learning and phylogenetic methods. This approach was valuable to determine which predictors were congruent between methods. We found that species with small ranges or living in zones with less variable precipitation tended to have low genetic diversity. We also showed that Western Mesoamerica, Andes, and Atlantic Forest biogeographic units harbor high diversity across many species that should be prioritized for protection. These results could play a key role in the development of conservation strategies for Neotropical amphibians.
README: Machine learning and phylogenetic models identify predictors of genetic variation in Neotropical amphibians - Supplementary Material
This dataset contains all data files, including genetic sequences, input text files, supplementary tables, and R scripts used in the analyses to address the determinants of intraspecific amphibian genetic diversity in the Neotropic. We repurposed this data from 256 Neotropical amphibians (frogs and salamanders) directly from open-access databases, for example, DNA sequences and occurrences from phylogatR, and natural history traits from AmphiBIO. We generated several tables and figures derived from the analyses that did not have room in the main manuscript and that we are including here as supplementary material.
Description of the data
MS_Fastas_Neotropic_species.tar.xz
This folder contains the CytB sequence alignments of 256 species in FASTA format.
MS_Occurrences_Neotropic_species.zip
This folder contains the geographic coordinate information associated with each sequence in the alignments of 256 species.
Table_S1.csv
This table is the main dataset used in the analyses. Contains information on all variables (dependent and independent) for each of the 256 species.
neotropical_tree.tre
Phylogenetic tree in newick format used in the phylogenetic analyses.
MMRR_results
Results of the Multiple Matrix Regression with Randomization (MMRR) for each of the 256 species. MMRR provides a method for quantifying and disentangling the relative effects of isolation by distance (IBD) and isolation by environment (IBE).
Appendix_S1-S4.pdf
Supporting information indicated in the main manuscript as appendices.
Appendix S1.
Literature used to retrieve occurrences associated with DNA sequences and geographical data.
Appendix S2.
Species not included in the phylogenetic analyses.
Appendix S3.
Supplementary tables.
Table S1. The data set used in this study. Nucleotide diversity (π), Isolation by Distance (IBD) and Isolation by Environment (IBE) were used as dependent variables in the analyses. Lar=Larvae, Dir=Direct, DD= Data Deficient, LC=Least Concern, NT=Near Threatened, VU= Vulnerable, EN=Endangered, CR= Critically Endangered, EX= Extinct, NE= Not Evaluated. * This table can be found in the dryad online repository,
Table S2. Different combination of number of predictors used in MCMCglmm models for nucleotide diversity (pi), Isolation by Distance (IBD) and Isolation by Environment (IBE).
Table S3. Species complex or cryptic species analyzed in this study.
Appendix S4.
Supplementary figures.
Figure S1. Number of observations/sequences per grid cell (a,c), and amphibian genetic diversity patterns in the Neotropics (b,d). The map uses equal-area grid cells of 150 (a,b) and 250 km (c,d).
Figure S2. Basic histogram showing the number of species associated with the number of CytB sequences.
Figure S3. Map of occurrences associated with mtDNA sequences from Neotropical frogs and salamanders.
Figure S4. Variance inflation factor (VIF) for each continuous predictor variable used in this study.
Figure S5. Nucleotide diversity (π) map of Neotropical amphibians using the mean π value per species. The map uses equal-area grid cells of 350 km.
Figure S6. Isolation-by-Distance (IBD) patterns of amphibians by Neotropical region.
Figure S7. Isolation-by-Environment (IBE) patterns of amphibians by Neotropical region.
Figure S8. Random forest results for all amphibians using reduced datasets showing the top predictors of nucleotide diversity, a) at least 10 sequences per species, b) at least 400 bp in each species alignment. The variables with an asterisk (*) indicate non-biologically motivated because of bias in data acquisition.
Figure S9. Random forest results for all amphibians using reduced datasets showing the top predictors of Isolation by Distance (IBD), a) at least 10 sequences per species, b) at least 400 bp in each species alignment. The variables with an asterisk (*) indicate non-biologically motivated because of bias in data acquisition.
Figure S10. Random forest results for all amphibians using reduced datasets showing the top predictors of Isolation by Environment (IBE), a) at least 10 sequences per species, b) at least 400 bp in each species alignment. The variables with an asterisk (*) indicate non-biologically motivated because of bias in data acquisition.
Figure S11. Phylogenetic signal for nucleotide diversity (π), Isolation by Distance (IBD), and Isolation by Environment (IBE). A) Likelihood surface of the phylogenetic signal based on Pagel’s lambda (λ) for π. A λ value = 0.768 indicates a correlation between species close to a Brownian motion model of evolution. B) Blomberg’s K statistic compared to a null distribution of K obtained via randomization. A K value = 0.151 indicates that the variance of π is high within clades. C-D) Phylogenetic signal D for a binary variable. A value close to 1 (D IBD = 1.01; D IBE = 1.05) indicates that IBD and IBE are randomly shuffled relative to the tips of the phylogeny (phylogenetic randomness).
Figure S12. Genetic diversity (π) of Neotropical amphibians across IUCN conservation categories for A) Anura (salmon) and Caudata (turquoise) combined. Each box-whisker plot indicates the median (bold lines), the interquartile range (boxes), white dots represent an observation of π, and black-white dots represent outliers. A single bold line with a white dot represents a single observation. DD= Data Deficient, LC=Least Concern, NT=Near Threatened, VU= Vulnerable, EN=Endangered, CR= Critically Endangered, EX= Extinct, NE= Not Evaluated.
Figure S13. Nucleotide diversity is high in latitudes near the equator for Neotropical amphibians following the latitudinal gradient of species richness.
Figure S14. Phylogeny of Neotropical frogs and salamanders, with nucleotide diversity (π) traced across branches. The phylogeny is based on CytB DNA sequences data for 241 species. Nucleotide diversity box plots per taxonomic family are indicated to the right.
R_scripts.zip
This folder contains R scripts used in the manuscript.
distances.R Script to calculate genetic, topographic and environmental distances needed for MMRR analyses.
mapping.R Script to map number of observations and nucleotide diversity.
MMRR.R Script to perform Multiple Matrix Regression with Randomization analysis.
nucleotide_diversity.R Script to calculate nucleotide diversity for all 256 species.
phylosignal.R Script to calculate different phylogenetic signal measurements.
vif_multicollinearity_test.R Script to calculate the Variance Inflation Factor (VIF).
Tutorial_RetrievingCoords_AssociatedDNAseqs.pdf
Tutorial prepared to share how we retrieved geographic coordinates associated with DNA sequences when the latter are not able in GenBank.
Sharing/Access information
Some data (e.g., sequences, geographic coordinates, etc.) were derived from the following sources:
- https://phylogatr.org/
- AmphiBIO, a global database for amphibian ecological traits. https://www.nature.com/articles/sdata2017123
- https://amphibiansoftheworld.amnh.org/
- https://amphibiaweb.org/index.html