Code and data for Bayesian joint species distribution model selection for community-level prediction
Data files
Nov 24, 2023 version files 1.13 MB
Abstract
Code and data for reproducing the analysis in the manuscript “Bayesian joint species distribution model selection for community-level prediction.” Provided data include percent cover observations for 39 modeled vascular plant species within boreal forest understory communities and environmental model covariates. R code is provided to generate model inputs, apply alternative models, generate out-of-sample predictions, and calculate associated community and species log scores and alternative model evaluation metrics. Further, R source code is provided to implement the multinomial joint species distribution model defined in the manuscript. Details on the data, its processing, and the alternative model definitions and structure can be found in the main text of the manuscript. Provided data are currently being used in ongoing analyses and coordination with authors may be warranted to avoid duplicate publication. Potential users are encouraged to consider collaboration with authors when useful and appropriate. Misinterpretation of data may occur if used outside the context of the original analysis. All data are made available in their current state. While significant efforts have been made to ensure data accuracy, complete accuracy cannot be guaranteed. Data may be updated periodically. It is the responsibility of the data user to check for updated versions of the data.
README
Code and data for Bayesian joint species distribution model selection for community-level prediction
Boreal forest understory community data and associated environmental variables for reproducing the analysis in the manuscript "Bayesian joint species distribution model selection for community-level prediction." Data include percent cover observations of 39 vascular plant species across 1,700 unique vegetation survey sites sampled in 1985-1986, 1995, and 2006 in conjunction with the 8th Finnish National Forest Inventory. Environmental variables characterizing the conditions for each site-year combination are provided. These are the same variables used to predict species relative abundances at each site within the associated manuscript.
Description of the data and file structure
Understory community data is provided in a single comma-separated text file named "understory_comm_dat.csv". The rows of the data file correspond to unique site-by-inventory year combinations. The columns provide percent cover observations for vascular plant species along with several identifier and environmental variables. Metadata for each variable is provided below.
Identifier variables
- Site: Unique numeric identifier for each vegetation survey site
- Year: The year in which the survey site was measured (1985, 1995, 2006)
- Site_year: Unique combinations of site-by-year
- BZ: Bioclimatic zone in which the site is located (SB = south boreal; MB = mid boreal; NB = north boreal) based on Ahti et al., (1968)
Environmental variables
- aveGDD74_84: Growing degree days over the 1974-1984 period estimated as the 10-year moving average of the total number of days with a daily mean temperature exceeding +5 deg. C per site over the reported decadal period based on 10 sq. km interpolated daily temperature values modeled by the Finnish Meteorological Institute (Venäläinen et al., 2005)
- aveGDD84_94: Growing degree days over the 1984-1994 decadal period (defined as under the aveGDD74_84 variable)
- aveGDD95_05: Growing degree days over the 1995-2005 decadal period (defined as under the aveGDD74_84 variable)
- fertility: Soil fertility based on site-level indicator vegetation observed during the inventory year and broken into six ordinal classes (1 indicates highest fertility; 6 indicates lowest fertility) based on Cajander (1949) and described in Tomppo et al., (2011)
- shrub_cover: Projected percent cover of shrubs and 0.5-1.5 m tall trees located within a 9.8 m radius circular plot centered on the vegetation survey site
- ba: The basal area reported in m^2 per ha of live overstory trees derived from measurements of stem diameter at 1.3 m collected during the inventory year (Tomppo et al., 2011)
Percent cover of vascular plants
Remaining columns in the data file report the mean percent cover of vascular plants across four 2 m^2 quadrats located 5 m apart within each vegetation survey site. Column names correspond to species abbreviations defined below.
- AGROCAPI: Agrostis capillaris
- BETUPUB3: Betula pubescens
- CALAARUN: Calamagrostis arundinacea
- CALLVULG: Calluna vulgaris
- CAREDIGI: Carex digitata
- CAREGLOB: Carex globularis
- CONVMAJA: Convallaria majalis
- DESCCESP: Deschampsia cespitosa
- DESCFLEX: Deschampsia flexuosa
- DRYOCART: Dryopteris carthusiana
- EMPENIGR: Empetrum nigrum
- EPILANGU: Epilobium angustifolium
- EQUISYLV: Equisetum sylvaticum
- FRAGVESC: Fragaria vesca
- GYMNDRYO: Gymnocarpium dryopteris
- JUNICOM3: Juniperus communis
- LEDUPALU: Ledum palustre
- LINNBORE: Linnaea borealis
- LUZUPILO: Luzula pilosa
- LYCOANNO: Lycopodium annotinum
- MAIABIFO: Maianthemum bifolium
- MELAPRAT: Melampyrum pratense
- MELASYLV: Melampyrum sylvaticum
- MELINUTA: Melica nutans
- ORTHSECU: Orthilia secunda
- OXALACET: Oxalis acetosella
- PICEABI3: Picea abies
- PINUSYL3: Pinus sylvestris
- POPUTRE3: Populus tremula
- PTERAQUI: Pteridium aquilinum
- RUBUIDA4: Rubus idaeus
- RUBUSAXA: Rubus saxatilis
- SOLIVIRG: Solidago virgaurea
- SORBAUC3: Sorbus aucuparia
- TRIEEURO: Lysimachia europaea
- VACCMYRT: Vaccinium myrtillus
- VACCULIG: Vaccinium uliginosum
- VACCVITI: Vaccinium vitis-idaea
- VIOLRIVI: Viola riviniana
Code for processing the data and reproducing the analysis in the corresponding manuscript are described under Code/Software below.
Sharing/Access information
Data and code were derived from the following sources.
- Ahti, T., Hämet-Ahti, L., and Jalas, J. (1968). Vegetation zones and their sections in northwestern Europe. Annales Botanici Fennici, 5(3):169–211.
- Venäläinen, A., Tuomenvirta, H., Pirinen, P., and Drebs, A. (2005). A basic Finnish climate data set 1961–2000–description and illustrations. Finnish Meteorological Institute, Reports, 5:1–27.
- Cajander, A. K. (1949). Forest types and their significance. Acta For. Fenn., 56:1–71.
- Tomppo, E., Heikkinen, J., Henttonen, H. M., Ihalainen, A., Katila, M., Mäkelä, H., et al. (2011). Designing and conducting a forest inventory-case: 9th National Forest Inventory of Finland, volume 22. Springer Science & Business Media.
- Tikhonov, G., Ovaskainen, O., Oksanen, J., de Jonge, M., Opedal, O., and Dallas, T. (2021). Hmsc: Hierarchical Model of Species Communities. R package version 3.0-11.
Code/Software
One R script file and several R functions are provided to reproduce the analysis in the corresponding manuscript. All provided R files are described below.
"applied_model_selection.R": R script file that loads the data file described above, processes the data to generate inputs associated with the models described in the corresponding manuscript, demonstrates the implementation of each model including model post-processing and out-of-sample prediction, and generates log scores and alternative scoring metrics for each applied model. All source files needed to run the R script are provided and described below.
"Hmsc-Understory": Repository including R source code for a modified version of the Hierarchical Model of Species Communities R package (Tikhinov et al., 2021) used to implement the Poission approximation to the multinomial described in the corresponding manuscript. Calls to the modified source files are demonstrated in the "applied_model_selection.R" file.
"ls_approx_pois_multinom.R": R source code approximating the joint community log score for a multinomial data model fit using the Poisson approximation to the multinomial.
"ls_approx_poisson.R": R source code approximating the joint community log score for a log normal Poisson data model (equivalent to the independent community log score).
"ls_approx_spp_poisson.R": R source code approximating individual species log scores for models fit applying either a multinomial or log normal Poisson data model.
"spp_diversity_rmse.R": R source code estimating the total squared error for the Shannon true diversity index for models fit applying either a multinomial or log normal Poisson data model.
"jaccard_idx.R": R source code estimating the sum of the posterior mean Jaccard community dissimilarity index across all sample sites for models fit applying either a multinomial or log normal Poisson data model.
"spp_rmse.R": R source code estimating the total squared error for species-level predictions for models fit applying either a multinomial or log normal Poisson data model.
"spp_pred_var.R": R source code estimating the total posterior predictive variance for species-level predictions for models fit applying either a multinomial or log normal Poisson data model.
Methods
Data include percent cover observations of 39 vascular plant species across 1,700 unique vegetation survey sites sampled in 1985-1986, 1995, and 2006 in conjunction with the 8th Finnish National Forest Inventory. Values report the mean percent cover of vascular plants across four 2 m^2 quadrats located 5 m apart within each vegetation survey site. Environmental variables characterizing the conditions for each site-year combination are also provided. These are the same variables used to predict species relative abundances at each site within the associated manuscript.
Usage notes
No special program or software is required to open data. R source code and data processing file require the R statistical computing environment.