Data from: Agent-based versus correlative models of species distributions: Evaluation of predictive performance with real and simulated data
Abstract
Species distribution models (SDMs) have been widely used in ecology to understand how species relate to environmental variation. Most SDMs are correlative and they lack explicit reference to the underlying processes and therefore the reliability of their predictions might be questionable. Mechanistic models that incorporate components that relate to underlying processes such as trophic interactions or dispersal have been less utilized due to their case-specificity and difficulties related to their parametrization, which typically requires significantly more data than parametrization of correlative models.
We compare correlative and mechanistic species distribution models in prediction tasks under different scenarios. We define a mechanistic agent-based models of resource-consumer dynamics to generate data with known processes and parameter values. We fit correlative and mechanistic models to these data to study under which conditions mechanistic models might give more accurate predictions and how robust they are for possible model misspecification. The mechanistic models provided better extrapolation predictions than the correlative model in a simulated setting, when the model used for fitting the data matched the data-generating model.
The mechanistic model predictions were sensitive to the correctness of the model, and the quality of them dropped significantly even under slight model misspecification. In real data analyses, the correlative models consistently out-performed the mechanistic models that were not tailored for the specific situations.
Mechanistic species distribution models may provide significant advantage in prediction compared to more commonly used correlative models, when predicting to new environmental conditions. However, this requires that the model is carefully tailored for the specific system, because the predictions from the mechanistic models are sensitive to model misspecification.
Overview of the contents
The package contains simulation code, analysis code and data necessary to run all the analyses in the paper.
Simulation code
Simulation code is located in the simulation_code directory. This directory contains software and data to simulate generalists, specialists, and partial generalists (middle) with five datasets (birds, butterfly, plant, trees, vegetation).
Analysis code
Analysis code is located in the analyses folder. The code subfolder contains the scripts, most of which are described below in the analysis worklow part. In addition, there are two scripts:
utils.R
(some functions used by the other scripts)
plotBirdDivision.R
(plotting of the sites and divisions to training and test sets for all data sets).
There are also empty figures and results subfolders that are used when running the analyses.
Data
The data are located in the analyses/data/ folder ready to be used by the analysis code.
Real data sets
Located in the subfolder unified data. The data sets are modified versions of the data from Anna Norberg. (2019). aminorberg/SDM-comparison: Norberg et al. (2019) (publication). Zenodo. https://doi.org/10.5281/zenodo.2637812
Y_birds.csv
Y_butterfly.csv
Y_plant.csv
Y_trees.csv
Y_vegetation.csv
Species observations for all 1200 sites in each of the five data sets. Each row contains presence / absence observations of the study species for a single site. Each column corrseponds to a single species.
X_birds.csv
X_butterfly.csv
X_plant.csv
X_trees.csv
X_vegetation.csv
Environmental covariates for all 1200 sites in each of the five data sets. Each row contains 3 to 5 habitat quality values for a single site. The habitat quality values are PCA-transformed from environmental covariates, and their number varies between the data sets.
S_birds.csv
S_butterfly.csv
S_plant.csv
S_trees.csv
S_vegetation.csv
Geographical coordinates for all 1200 sites in each of the five data sets. Each row contains X and Y coordinates for a single site.
Simulated data sets
Simulated data sets created by the simulation code and used in the analyses of the paper. The simulated data sets are located in three subfolders under subfolder simudataN: true_generalist, true_middle, true_specialist. Each subfolder contains simulated data corresponding to one data generating model (generalist, middle or specialist).
In each true_X folder there are 10 instances of data from each of 1000 simulations from model X.
- countsY files in m1counts subfolder contain details of simulation Y (simulation time, number of particles etc).
- m1out.ZZ subfolders contain simulated data using observation range parameter between .ZZ and .ZZ+0.01. For example, in the m1out.01 folder the observation range parameter was between 0.01 and 0.02.
- In each m1out.ZZ subfolder there are 1000 simulated data sets outY.ve, that have been created from simulation Y using the .ZZ observation range. Each outY.ve contains presence/absence observations for all 1200 sites and for a single species.
Simulation workflow
Step 1. Compile realppsim4sign_timer.c and real_virtual_ecologist.c
gcc realppsim4sign_timer.c -lm -O2 -o realppsim4sign_timer
gcc real_virtual_ecologist.c -lm -O2 -o real_virtual_ecologist
Step 2. Create a file with model parameters.
Example file 'model100Kprior10' contains parameters for 100K models. Instructions how to do it is in file 'readme.samplemodel'
Step 3. Create output directories.
See the example in file 'readme.mkdirs'. ID can be 'birds', 'buttefly', 'plant', 'trees', or 'vegetation'
Step 4. Run simulations.
There are 3 example files to simulate generalist, specialist, and middle behavior:
example_simulate_generalist
example_simulate_specialist
example_simulate_middle
Define data to be 'birds', 'buttefly', 'plant', 'trees', or 'vegetation'
Variable REP in for loop defines the models whose parameters are used in simulations.
They are line numbers of MODELSFILE ignoring header line.
When running several scripts in parallel, give different value for variable TMP in ordernot to overwrite the results of parallel runs.
Analysis workflow
The workflow for replicating the analyses in the manuscript is following:
1. Load data sets
a) Simulated data
loadData_simuN.R
(Data included zip compressed as simudataN.zip in data/ folder. Needs to be unzipped for use).
b) Real data sets
loadData.R
2. Load simulations and calculate summaries.
a) Summaries for simulated data:
loadSimulations_simudataN.R
b) Summaries for real data sets:
loadSimulations_alldatab.R
c) Summaries for all simulations used in ABC:
loadSimulations_N.R
(Used for loading other simulations except the big simulation set).
loadSimulations_big.R
(Used for loading simulations of the big simulation set in two parts).
mergeBigs.R
(Merge files containing big simulation set summaries to single files)
3. Run ABC
a) Simulated data:
runABC_simuN.R
(ABC predictions for all simulated test data sets with all models and simulations sets, to run with arguments 1-400)
mergeABCresults_simuN.R
(calculate measures of prediction quality).
b) Real data:
runABC_Nb.R
(to run with 2 arguments: job index (1-15) and simulation set (wide, narrow, big))
4. Run HMSC
a) Simulated data:
runHMSC_simuN.R
(to run with arguments 1-1200)
mergeHMSCresults_simuN.R
(calculate measures of prediction quality)
b) Real data sets:
runHMSC_cluster_singleb.R
(to run with arguments 1-342)
mergeHMSCresultsb.R
(Calculate measures of prediction quality)
5. Run ensemble models
a) Simulated data:
runBIOMOD_simuN.R
(to run with arguments 1-4)
b) Real data sets:
runBIOMOD_cluster.R
(to run with arguments 1-5)
6. Plot results
a) Simulated data:
plotAUC_simuN.R
b) Real data:
plotAUC_wN.R
