Data for: Revealing hidden sources of uncertainty in biodiversity trend assessments
Abstract
Idiosyncratic decisions during the biodiversity trend assessment process may limit reproducibility, whilst “hidden” uncertainty due to collection bias, taxonomic incompleteness, and variable taxonomic resolution may limit the reliability of reported trends. We model alternative decisions made during assessment of taxon-level abundance and distribution trends using an 18-year time series covering freshwater fish, invertebrates, and primary producers in England. Through three case studies, we test for collection bias and quantify uncertainty stemming from data preparation and model specification decisions, assess the risk of conflating trends for individual species when aggregating data to higher taxonomic ranks, and evaluate the potential uncertainty stemming from taxonomic incompleteness. Choice of optimizer algorithm and data filtering to obtain more complete time series explained 52.5% of the variation in trend estimates, obscuring the signal from taxon-specific trends. The use of Penalized Iteratively Reweighted Least Squares, a simplified approach to model optimization, was the most important source of uncertainty. Application of increasingly harsh data filters exacerbated collection bias in the modelled dataset. Aggregation to higher taxonomic ranks was a significant source of uncertainty, leading to conflation of trends among protected and invasive species. We also found potential for substantial positive bias in trend estimation across six fish populations which were not consistently recorded in all operational areas. We complement analyses of observational data with in silico experiments in which monitoring and trend assessment processes were simulated to enable comparison of trend estimates with known underlying trends, confirming that collection bias, data filtering and taxonomic incompleteness have significant negative impacts on the accuracy of trend estimates. Identifying and managing uncertainty in biodiversity trend assessment is crucial for informing effective conservation policy and practice. We highlight several serious sources of uncertainty affecting biodiversity trend analyses and present tools to improve the transparency of decisions made during the trend assessment process.
Data for: Revealing hidden dources of uncertainty in biodiversity trend assessments
https://doi.org/10.5061/dryad.np5hqc034
Description of the data and file structure
Raw data to run analyses presented in the paper. Comprised of csvs containing count or detection-nondetection data, survey level metadata, and taxonomic information for four taxonomic groups (diat, fish, invs, macp); a csv of model arguments passed to generalized linear models; and shapefiles and rasters of spatial environmental datasets.
Files and variables
File: home.zip
Description:
home/mod_args.csv is tabular CSV data containaing information on the task id used for grid job submission. It contains the following variables:
- task.id: The unique task identifier used for grid job submission
- group: The taxonomic group (abbreviated as Diatoms=diat, Fish=fish, Invertebrates=invs, Macrophytes=macp)
- full: A binary flag (TRUE if the full trend model was fitted)
- filt: A binary flag (TRUE if trend models were fitted to data filtered by sampling frequency)
- nested: A binary flag (TRUE if trend models were fitted to lower rank taxa nested within higher ranks)
- season: A binary flag (TRUE if trend models were fitted without a season effect for testing purposes)
home/data_preparation contains tabular CSV files containing count or detection-nondetection data (suffix _y.csv), survey level information (suffix _metadata.csv), and taxonomic information (suffix _taxa_info.csv) for four taxononmic groups (prefixes; abbreviated as Diatoms=diat, Fish=fish, Invertebrates=invs, Macrophytes=macp).
In the detection-nondetection data (diat_y.csv, fish_y.csv, invs_y.csv, macp_y.csv), row names in the first column correspond to a unique sample identifier in the survey level information. For fish, this is the variable SURVEY_ID. For other taxonomic groups, it is the variable ANALYSIS_ID. Further columns correspond to each taxon included in the study.
In the tabular survey level information (diat_metadata.csv, fish_metadata.csv, invs_metadata.csv, macp_metadata.csv), variables in upper case accompany the source. Additional variables in lower case relate to the following:
- easting: The x coordinates of the survey location, in British National Grid (EPSG:27700)
- northing: The y coordinates of the survey location, in British National Grid (EPSG:27700)
- year: The year of the survey
- time: Time elapsed in decimal years from 1 January 2002
- julian: Julian day (day of the year) on which the survey took place
- n.years: The number of years for which survey data are available for the surveyed site
In the tabular taxonomic information (diat_taxa_info.csv, fish_taxa_info.csv, invs_taxa_info.csv, macp_taxa_info.csv), variables in upper case accompany the sources. Additional variables in lower case relate to the following:
- phylum: The taxon name at the rank of phylum
- class: The taxon name at the rank of class
- order: The taxon name at the rank of order
- family: The taxon name at the rank of family
- genus: The taxon name at the rank of genus
- species: The taxon name at the rank of species
- given.name: The taxon name used in the study
- given.rank: The rank at which the taxon was named in the study
- final.name: For fish only, the final taxon name used after aggregating to a rank higher than species
- final.rank: For fish only, the final rank at which the taxon was named after aggregating to a rank higher than species
- designation: Conservation designation of the taxon (non-native, no designation, protected)
- prevalence: The number of occurrences recorded for the taxon in the full data set
- total_abundance: The total abundance of the taxon in the full data set (not available for macrophytes)
- minor: For fish only, a binary flag (TRUE if the species is considered a minor species for monitoring purposes)
home/UK_GIS_data/England_basins contains a shapefile of river basin areas within the study area. Also available from https://www.data.gov.uk/dataset/368ae5fb-65a1-4f19-98ff-a06a1b86b3fe/wfd-river-basin-districts-cycle-2
home/UK_GIS_data/LCMs contains raster images of land cover in the study area for the years 2000, 2007. 2015, 2017, 2018, 2019 and 2020. These are also available from https://www.ceh.ac.uk/data/ukceh-land-cover-maps
home/UK_GIS_data/RICT_Database_Revised_June_2018/rict_rasters_gb contains raster images of 11 environmental variables in the study area
home/UK_GIS_data/WorldPop contains a raster images of 1km aggregated human population density in 2020 in the study area. Also available from https://hub.worldpop.org/geodata/summary?id=39457
Code/software
R version 4.2.2
lme4_1.1-31
car_3.1-1
optimx_2022-4.30
ggplot2_3.5.1
gridExtra_2.3
grid_4.2.2
stringr_1.5.1
Metrics_0.1.4
pROC_1.18.0
reshape2_1.4.4
DHARMa_0.4.6
MuMIn_1.47.1
cowplot_1.1.1
performance_0.12.2
dplyr_1.1.4
ggpp_0.5.0
OCNet_1.2.2
mcomsimr_0.1.0
sf_1.0-14
abind_1.4-5
magrittr_2.0.3
tidyr_1.3.1
parallel_0.2.6
merTools_0.5.2
ape_5.7-1
ggtree_3.6.2
ggforce_0.4.1
ggpubr_0.5.0
raster_3.6-26
seegSDM_0.1-9
File: Code.zip
Description:
fit_glmer.R is the R script used to fit Generalized Linear Mixed Models (GLMMs) to count and detection-nondetection data
glmer_functions.R is a script containing the functions required to fit and analyse the GLMMs
process_mods.R is a script for summarising model outputs and producing model diagnostics
summary.R is a script for analysing the final model outputs
model_processing_functions.R is a script containing the functions required for summarising model outputs, producing model diagnostics and analysing the final model outputs
snapping_extraction.R is a script for extracting and wrangling environmental data on the surveyed sites and wider river network
representativeness.R is a script for analysing the collection bias in the analysed data sets
metacom.R is a script for running experiments on synthetic river networks and species
Usage notes
Microsoft Excel can be used to view mod_args.csv and tabular csv files contained within home/data*preparation. R or QGIS can be used to open shapefiles in home/UK_GIS_data/England_basins and raster files in home/UK_GIS_data/LCMs, home/UK_GIS_data/RICT_Database_Revised_June_2018/rict_rasters_gb, and home/UK_GIS_data/WorldPop. R is required to open fit_glmer.R, glmer_functions.R, process_mods.R, summary.R, model_processing_functions.R, snapping_extraction.R, representativeness.R, and metacom.R.
Handling shapefiles
All the shapefiles (.shp) used in this study contain the geometry and attributes of geospatial (polygon) features. The file bundle contains the main file .shp and companion files as follows:
- .shp: The main geospatial data file that contains feature geometry
- .cpg: The file specifying the codepage to identify the characterset
- .dbf: The dBASE that contains the attributes of features
- .prj: The file that contains the coordinate system and map projection information
- .sbn: The file containing the spatial index of features
- .sbx: The file containing the spatial index of features
- .shx: The file containing the index of feature geometry
Handling raster files
All raster files (.tif, .img) used in this study to store raster data and its associated geospatial information. The file bundle contains the main file .tif and companion files as follows:
- .tif or .img: Stores image information and raster graphics
- .aux: Stores additional image or raster information that cannot be stored in the TIFF file
- .cpg: The file specifying the codepage to identify the characterset
- .dbf: The dBASE that contains the attributes of an image or a raster file
- .rrd: Stores a pyramid file for display and visualization
- .tfw: Contains georeferencing information for a raster.
- .xml: Metadata of an image or a raster file.
