Data from: Black-grass monitoring using hyperspectral image data is limited by between-site variability
Abstract
Many important ecological processes play out over large geographic ranges, and accurate large-scale monitoring of populations is a requirement for their effective management. Of particular interest are agricultural weeds which cause widespread economic and ecological damage. However, the scale of weed population data collection is limited by an inevitable trade-off between quantity and quality. Remote sensing offers a promising route to large-scale collection of population state data. However, a key challenge is to collect high enough resolution data and account for between-site variability in environmental (i.e. radiometric) (Peleg, Andersen, & Yang 2005, Nansen, Mantri, & Lee, 2023) conditions that may make prediction of population states in new data challenging. Here we use a multi-site hyperspectral image data set in conjunction with ensemble learning techniques in an attempt to predict densities of an arable weed (Alopecurus myosuroides) across an agricultural landscape. We demonstrate reasonable predictive performance when classifiers are used to predict new data from the same site. However, even using flexible ensemble techniques to account for environmental or biological variability in spectral data, we show that out-of-field predictive performance is poor. This study highlights the difficulties in identifying weeds in situ even using high resolution and band-width remote sensing.
https://doi.org/10.5061/dryad.qz612jmqp
Description of the data and file structure
Data:
Contains data files to replicate analyses:
multi_state_fields.rds - index of fields containing more than one density state.
opt_hypars.rds - contains values of optimal random forest hyperparameters - please see the supplementary material from the paper for an explanation.
sub_samp_HS_data.rds - contains training data for raw hyperspectral bands.
sub_samp_VGI_data.rds - contains training data for derived vegetation indeces.
Within the last two files the data have the following columns:
DS - the density state of the quadrat.
image_strip - an index of the combined images used to derive the hyperspectral or vegetation index data.
grid - a quadrat ID index.
x - an X coordinate from the given field.
y - a Y coordinate from the given field.
field - a field index for the given field.
The remaining columns will refer to the values of the vegetation indeces (indexed by name), or the hyperspectral bands (e.g. XXXXX.Nanometers)
R:
Contains all R code to reproduce analyses.
helper_functions.R - contains R code used in model fitting and data manipulation.
performance_metrics.R - contains R code used to evaluate predictive performance of classifiers.
ensemble_fit - contains all R scripts to fit field-level classifiers to HS and VGI data.
ensemble_predict - contains all R scripts to predict (within field) from individual field-level classifiers to HS and VGI data.
weight_ensemble - contains all R scripts to weight individual field-level classifiers and predict (out of field) from the ensemble.
plot_ensemble_results_updated contains a script to plot the results from the weighted ensemble, and some other figures ast they appear in the paper.
Results:
Contains .csv files containing the predictive performance of ensemble model fits as well as individual classifier fits from random forests fit to individual fields.
all_ensemble_HS_pred_perf.csv - Predictive performance from models fit to HS data.
all_ensemble_VGI_pred_perf.csv- Predictive performance from models fit to VGI data.
Within these files the column headers are as follows:
X - a row ID index
Field - The field for which predictions were made.
hyp_mar_NN - the distance to the nearest neighbour calculated from the hypothesis margin.
Columns labelled acc_X_XX represent the geometric mean scores for each prediction, columns labelled with l/m/h (e.g. acc_l_xx) represent the geometric mean score for each of the respective (low, medium, high) density states. Columns labelled w0/w2/w10 represent the geometric mean score for unweighted (w0), and weighted (w2,w10), ensembled predictions respectively.
Cells containing NA values in these performance data reflect that fields had none of the corresponding density state for which to calculate an accuracy score.
Files labelled:
XXXX_XX_grid_X_fit_HS - contain .rds files of model fits to HS data from an individual field.
XXXX_XX_grid_X_fit_VGI - contain .rds files of model fits to VGI data from an individual field.
The .Rproject file relates an R studio session to the folder, meaning all files can be run from that R studio session without having to set a working directory.
Data consists of categorical assessments of weed (black-grass) density from 31 fields across the UK. Each field is divided up into 20x20m quadrats and assigned one of 3 density states (low, medium, high) as in Queenborough et al (2011). For each quadrat we provide 1000 pixel level samples of 120 spectral bands (hyperspectral image data used in the manuscript), as well as 14 vegetation indeces derived from these bands. Code is included to fit an ensemble of random forests to these data to attempt to predict the quadrat-level density states from hyperspectral (HS) and vegetation index (VI) data. Cross validation code is included to assess whether out-of-sample (i.e. new fields) predictive performance can be increased via ensemble models. Ensembles are weighted towards fields with more similar spectral similarity to the out-of sample data.
