Skip to main content
Dryad

Data from: Integrating genomic data and simulations to evaluate alternative species distribution models and improve predictions of glacial refugia and future responses to climate change

Cite this dataset

Robinson, John et al. (2024). Data from: Integrating genomic data and simulations to evaluate alternative species distribution models and improve predictions of glacial refugia and future responses to climate change [Dataset]. Dryad. https://doi.org/10.5061/dryad.s7h44j1fx

Abstract

Climate change poses a threat to biodiversity, and it is unclear whether species can adapt to or tolerate new conditions, or migrate to areas with suitable habitats. Reconstructions of range shifts that occurred in response to environmental changes since the last glacial maximum from species distribution models (SDMs) can provide useful data to inform conservation efforts. However, different SDM algorithms and climate reconstructions often produce contrasting patterns, and validation methods typically focus on accuracy in recreating current distributions, limiting their relevance for assessing predictions to the past or future. We modeled historically suitable habitat for the threatened North American tree green ash (Fraxinus pennsylvanica) using 24 SDMs built using two climate models, three calibration regions, and four modeling algorithms. We evaluated the SDMs using contemporary data with spatial block cross-validation and compared the relative support for alternative models using a novel integrative method based on coupled demographic-genetic simulations. We simulated genomic datasets using habitat suitability of each of the 24 SDMs in a spatially-explicit model. Approximate Bayesian Computation (ABC) was then used to evaluate the support for alternative SDMs through comparisons to an empirical population genomic dataset. Models had very similar performance when assessed with contemporary occurrences using spatial cross-validation, but ABC model selection analyses consistently supported SDMs based on the CCSM climate model, an intermediate calibration extent, and the generalized linear modeling algorithm. Finally, we projected the future range of green ash under four climate change scenarios. Future projections using the SDMs selected via ABC suggest only minor shifts in suitable habitat for this species, while some of those that were rejected predicted dramatic changes. Our results highlight the different inferences that may result from the application of alternative distribution modeling algorithms and provide a novel approach for selecting among a set of competing SDMs with independent data.

README: Integrating genomic data and simulations to evaluate alternative species distribution models and improve predictions of glacial refugia and future responses to climate change

https://doi.org/10.5061/dryad.s7h44j1fx

This repository contains analysis scripts and input data files associated with the integrated distributional, demographic, and coalescent (iDDC; He et al. 2013) modeling analysis conducted to compare alternative species distribution models (SDMs) for green ash (Fraxinus pennslvanica) in Naugthin et al. (2024) Ecography.

Description of the data and file structure

Simulations for the project were conducted using the R script hSC_Ash_with_enms.R.  This script uses the R package holoSimCell (available from https://github.com/stranda/holoSimCell to simulate post-glacial expansion of green ash in eastern North America.  We also include the singularity (now apptainer, https://apptainer.org) container used for these simulations (holosim.simg) in the repository.  The most recent version of the holosim container can also be pulled from https://hub.docker.com/r/astrand/holosim.

Outputs of the simulations under 24 different SDMs are saved in the reference table: ENMcomparison_RT_50k_subset.csv, which is provided in zipped format in ENMcomparison_RT_50k_subset.zip.  This file contains simulation parameters, measures of biotic velocity (see Castilla et al. 2024, https://doi.org/10.1111/jbi.14754) and a total of 473 summary statistics retained for each simulation replicate.  Each of the 24 SDMs were simulated 50,000 times for ABC analyses.  We also include a .csv file with observed values for these same summary statistics: Ash_obs_subset_2Oct20.csv.  The observed summary statistic file is shared with Castilla et al. 2024, but the reference table has been expanded considerably for this project.  Descriptions of columns in ENMcomparison_RT_50k_subset.csv and Ash_obs_subset_2Oct20.csv can be found below.

Scripts to perform ABC model selection and cross validation analyses presented in our manuscript are included in ABC.zip (see Code/Software below for descriptions).  ABC analyses can be repeated using these scripts and the two .csv files described above.

This repository also contains R scripts used to define the study region for our analysis (in study_region.zip), fit correlative species distribution (or ecological niche) models, and predict habitat suitability in 2080 under each of the models considered in this project (both in the code/ subdirectory in enms.zip).  Hindcasted and forecasted SDMs are also provided in raster format in the predictions/and predictions_future/ subdirectories in enms.zip (see Code/Software below for more information).

Columns of ENMcomparison_RT_50k_subset.csv

The first 220 columns of the reference table provide parameter values used in each simulation and measures of biotic velocity recorded during the associated forward demographic simulation.

  1. date - the date and time the simulation was completed, unused in ABC analyses
  2. node - an index used in filenames produced during simulations.  Prevents overwriting of output files produced by separate jobs on a computing cluster, unused in ABC analyses.
  3. rep - an index for the replicate number of each simulation on each node, unused in ABC analyses.
  4. xdim - the number of cells in a row of the simulation landscape, the x dimension of the simulation, fixed across simulation replicates and unused in ABC analyses.
  5. ydim - the number of cells in a column of the simulation landscape, the y dimension of the simulation, fixed across simulation replicates and unused in ABC analyses.
  6. maxtime - the maximum number of generations that the forward simulation is allowed to run, fixed across simulation replicates and unused in ABC analyses.
  7. K - the maximum per-cell carrying capacity (in individuals) used in the forward simulation, identical to Ne described below.
  8. xsz - the size of a cell in the simulated landscape in the x-dimension (longitude), measured in meters.
  9. ysz - the size of a cell in the simulated landscape in the y-dimension (latitude), measured in meters.
  10. samptime - variable that allows the same habitat suitability layer to be applied to multiple generations during the forward simulation.  Habitat suitability layers are changed every samptime generations, fixed across simulation replicates and unused in ABC analyses.
  11. pois.var - a binary (0/1) indicator specifying whether population size in each cell and time step was drawn from a Poisson distribution to incorporate demographic stochasticity, fixed across simulations and unused in ABC analyses.
  12. shortscale - the scale parameter for the distribution of short-distance dispersal events (k parameter of Weibull distribution), fixed across simulations and unused in ABC analyses.
  13. shortshape - the shape parameter for the distribution of short-distance dispersal events (lambda parameter of Weibull distribution), fixed across simulations and unused in ABC analyses.
  14. sz - DEPRECATED, the size of a square cell in the simulated landscape, replaced with xsz and ysz to allow rectangular cells.
  15. nloci - the number of genetic marker loci included in the observed dataset, fixed across simulations and unused in ABC analyses.
  16. seq_length - the sequence length of the marker in base pairs, used along with mu below in the fastsimcoal DNA model, fixed across simulations and unused in ABC analyses.
  17. mu - the mutation rate (per base pair, per generation) of the simulated locus, used with seq_length above in the fastsimcoal DNA model, fixed across simulations and unused in ABC analyses.
  18. G - the generation time of the species in years, fixed across simulations and unused in ABC analyses.
  19. longmean - the average distance of long-distance dispersal events in units of cell widths, fixed across simulations and unused in ABC analyses.
  20. lambda - the rate of population growth within cells (discrete rate of increase), fixed across simulations and unused in ABC analyses.
  21. mix - the mixture parameter for the distribution of dispersal distances, the proportion of dispersal events that are long-distance.  Randomly drawn from a prior distribution for each simulation replicate.
  22. Ne - the maximum effective population size of a cell in the landscape.  Randomly drawn from a prior distribution for each simulation replicate.
  23. preLGM_t - the time at which refugial populations diverged from one another (e.g., the last interglacial).  Randomly drawn from a prior distribution for each simulation replicate.
  24. preLGM_Ne - the effective population size of the species prior to refuge divergence.  Randomly drawn from a prior distribution for each simulation replicate.
  25. found_Ne - the effective population size of newly colonized populations (used in coalescent simulations only).  Randomly drawn from a prior distribution for each simulation replicate.
  26. ref_Ne - the effective population size of cells included in refugia (scaled by cell-specific habitat suitability in the first time step).  Randomly drawn from a prior distribution for each simulation replicate.
  27. refs - the species distribution model (ENM_1 through ENM_24) used to define habitat suitability across the landscape in each simulation.  Each model is represented 50,000 times in the reference table.
  28. BVprev_<DATE>ybp - the proportion of habitable (non-NA) cells with abundance >0 in the DATE time step relative to all habitable cells. Unitless.
  29. BVmean_<DATE>ybp - mean abundance in DATE time step. In the same units as the values of the cells (individuals).
  30. BVtotal_<DATE>ybp - total abundance across all cells in the simulated landscape at the DATE time step.  In the same units as the values of the cells (individuals).
  31. BVshared1k_<DATE1>to<DATE2>ybp - abundance-weighted centroid biotic velocity between the DATE1 and DATE2 time steps in the simulation, calculated using only cells that are not NA in either of the two time points (to control for changes in available land due to sea level rise).  Velocities are given in meters per year and are always positive (direction does not affect velocity).
  32. BVNQshared1k_<DATE1>to<DATE2>ybp - velocity of the 0.95th quantile weight in the north-south direction between the DATE1 and DATE2 time steps in the simulation, calculated using only cells that are not NA in either of the two time points (to control for changes in available land due to sea level rise).  Velocities are given in meters per year and are positive for northward movement and negative for southward movement.
  33. BVSQshared1k_<DATE1>to<DATE2>ybp - velocity of the 0.05th quantile weight in the north-south direction between the DATE1 and DATE2 time steps in the simulation, calculated using only cells that are not NA in either of the two time points (to control for changes in available land due to sea level rise).  Velocities are given in meters per year and are positive for northward movement and negative for southward movement.
  34. BVall1k_<DATE1>to<DATE2>ybp - abundance-weighted centroid biotic velocity between the DATE1 and DATE2 time steps in the simulation, calculated using all cells.  Velocities are given in meters per year and are always positive (direction does not affect velocity).
  35. BVNQall1k_<DATE1>to<DATE2>ybp - velocity of the 0.95th quantile weight in the north-south direction between the DATE1 and DATE2 time steps in the simulation, calculated using all cells.  Velocities are given in meters per year and are positive for northward movement and negative for southward movement.
  36. BVSQall1k_<DATE1>to<DATE2>ybp - velocity of the 0.05th quantile weight in the north-south direction between the DATE1 and DATE2 time steps in the simulation, calculated using all cells.  Velocities are given in meters per year and are positive for northward movement and negative for southward movement.
  37. BV_21kyr -  abundance-weighted centroid biotic velocity over the entire 21kyr simulation history, calculated using only cells that are not NA in either of the two time points (to control for changes in available land due to sea level rise).  Velocities are given in meters per year and are always positive (direction does not affect velocity).
  38. BVNQ_21kyr - velocity of the 0.95th quantile weight in the north-south direction over the entire 21kyr simulation history. Quantiles are cumulated starting from the south (0.05th quantile).  Velocities are given in meters per year and are positive for northward movement and negative for southward movement.
  39. BVSQ_21kyr - velocity of the 0.05th quantile weight in the north-south direction over the entire 21kyr simulation history. Quantiles are cumulated starting from the south (0.05th quantile).  Velocities are given in meters per year and are positive for northward movement and negative for southward movement.
  40. tot_SNPs - the total number of single nucleotide polymorphisms in the population genetic dataset output from fastsimcoal.  Not used as a summary statistic for ABC analyses, but included to verify that all simulations produce the expected (nloci) number of polymorphic markers.

The remaining 473 columns in the reference table file contain summary statistics calculated from population genetic datasets produced by the coalescent simulation.  The 473 columns in Ash_obs_subset_2Oct20.csv are shared with the remainder of the reference table, but were calculated from empirical data for green ash (Fraxinus pennsylvanica).

  1. Fst_<POP_ID1>.<POP_ID2> - Pairwise Fst (=1-Hs/Ht) between populations (Wright 1949, 1950).  Unitless, ranging from 0 to 1.  210 total statistics.
  2. helat.* - Summaries of a polynomial model (intercept, first, and second coefficients) relating expected heterozygosity to latitude.  3 total statistics.
  3. helong.* - Summaries of a polynomial model (intercept, first, and second coefficients) relating expected heterozygosity to longitude.  3 total statistics
  4. HRi_<POP_ID> - Harpending's raggedness index calculated from the Geographic Spectrum of Shared Alleles (Alvarado-Serrano & Hickerson 2018) for each population.  21 total statistics.
  5. Spca.Dmean_<POP_ID> - The mean inter-individual distance in PCA space among individuals within a population (Alvarado-Serrano & Hickerson 2016), calculated from a spatial PCA analysis.  21 total statistics.
  6. Moran.Beta - Estimate of Moran's I (Moran 1950) measuring spatial autocorrelation in genetic data.  1 total statistic.
  7. Var.* - Summaries of the variogram (Goovaerts 1998) measuring spatial autocorrelation in the genetic data - beta, sill, nugget, and range of the variogram.  4 total statistics.
  8. Psi_<POP_ID1>.<POP_ID2> - Peter & Slatkin's (2013) directionality index between a pair of populations.  210 total statistics.

Sharing/Access information

Code to define the study region and construct correlative species distribution models is shared with that used in Castilla et al. (2024) and also available from https://github.com/TIMBERhub

Observed data, simulation scripts, and the holoSimCell R package are available from https://github.com/stranda/holoSimCell.  

A docker image containing the holoSimCell package, all dependencies, and fastsimcoal v. 2.6 is available from https://hub.docker.com/r/astrand/holosim

Code/Software

We include the R script for delineating the study region, plus the raster masks defining the region.  These files are included in the study_region/ directory in study_region.zip.

  1. study_region/defining_study_region.r - R script for defining the study region based on watershed basins, distribution of Fraxinus pennsylvanica records, pollen cores, and genetic samples.
  2. study_region/study_region_daltonIceMask_lakesMasked_linearIceSheetInterpolation.tif - Multi-layer raster in GeoTIFF format with a mask of the area of interest (generally, eastern portion of North America) from 1 Kybp to 0 bp (1950 CE). The “last” or “lowest” layer represents available land (uncovered by sea and ice) 21 Kybp, and the “top” or “first layer” the present, with one layer per 30 years across this period. Values are 1 (available) and NA (unavailable).
  3. study_region/study_region_resampled_to_genetic_demographic_simulation_resolution.tif - Raster in GeoTIFF format with all cells equal to 1 and in the equal-area spatial resolution used in the genetic/demographic simulations. This is a single layer raster.

We also include R scripts used for calibrating the species distribution models and creating projections of past and future habitat suitability and the outputs of these projections in raster format.  These files are  included in the enms/ directory in enms.zip.

  1. enms/code/enms_for_fraxinus_pennsylvanica.r - Collates specimen data and environmental rasters, constructs data partitions, calibrates and evaluates SDMs, projects models to past, interpolation of rasters to finer timescales, and calculation of biotic velocity. 
  2. enms/code/enms_for_fraxinus_pennsylvanica_projected_to_future.r- Takes models from first script and projects them to future climate scenarios.
  3. enms/predictions/ folder -  Contains rasters in GeoTIFF format with predictions to the past. Each file is a “stack” of layers with predictions, from 21 Kybp (“lowest” or “last” layer) to the “present” (1950, “top” or “first” layer). Original values were in the range [0, 1], but they have been rescaled and rounded to {0, 1, 2, 3, … , 100}. File names are as: <paleo_GCM>_<calibration_extent>kmExtent_<SDM_algorithm>.tif. For example: ecbilt_80kmExtent_brt.tif.
  4. enms/predictions_future/ folder -  Contains rasters in GeoTIFF format with predictions to the future. Original values were in the range [0, 1], but they have been rescaled and rounded to {0, 1, 2, 3, … , 100}. File names are as: <sdm_algorithm>_<paleo_GCM_name>_<calibration_extent>kmExtent_rcp<rcp_number>_<average_year_of_future_period>_<future_GCM_name>.tif. For example: brt_ecbilt_320kmExtent_rcp45_2050_GFDL-CM3.tif.

The apptainer container (holosim.simg) in the repository was used for all simulations in this study and includes the necessary R packages (including holoSimCell) and the coalescent simulation software fastsimcoal v. 2.6.  Coupled demographic-genetic simulations can be run using the hSC_Ash_with_enms.R script.  This containerized version of R and the installed packages can also be used to recreate the SDMs tested.  

We include several R scripts used for ABC analysis.  The following scripts are saved in the ABC/ directory in ABC.zip.  These scripts use the reference table (ENMcomparison_RT_50k_subset.csv) and observed summary statistics (Ash_obs_subset_2Oct20.csv) files described above.

  1. ABC/CheckECDF.R - calculates the location of observed summary statistics within the distribution of statistic values from 50,000 simulations under each SDM.  
  2. ABC/AshENM_modsela_subset_mnlog.R - performs ABC model selection with multinomial logistic regression
  3. ABC/AshENM_modsela_subset_nnet.R - performs ABC model selection with neural networks
  4. ABC/ENMcomparison_RF_pred_subset.R - performs ABC model selection with random forests
  5. ABC/ENMcomparison_CV4MS.R - performs cross validations for model selection using either multinomial logistic regression or neural networks
  6. ABC/ENMcomparison_numreps_modsel_nnet.R - performs model selection with subsets of the reference table to evaluate changes in model posterior probabilities as the total number of replicates declines from 50,000 to 5000

References

Alvarado-Serrano, D. F. and Hickerson, M. J. 2016. Spatially explicit summary statistics for historical population genetic inference. – Methods Ecol. Evol. 7: 418–427.

Alvarado-Serrano, D. F. and Hickerson, M. J. 2018. Detecting spatial dynamics of range expansions with geo-referenced genomewide SNP data and the geographic spectrum of shared alleles. – bioRxiv 457556.

Goovaerts, P. 1998. Geostatistical tools for characterizing the spatial variability of microbiological and physico-chemical soil properties. – Biol. Fertil. Soils 27: 315–334.

He, Q., Edwards, D. L. and Knowles, L. L. 2013. Integrative testing of how environments from the past to the present shape genetic structure across landscapes. – Evolution 67: 3386–3402.

Moran, P. 1950. Notes on continuous stochastic phenomena. – Biometrika 37: 17–23.

Peter, B.M. and Slatkin, M. 2013. Detecting range expansions from genetic data. – Evolution 67: 3274−3289.

Wright, S. 1949. The genetical structure of populations. – Ann. Eugen. 15: 323−354.

Wright, S. 1950. The genetical structure of populations. – Nature 166: 247-249.

Funding

National Science Foundation