Occurrences and R code for: Dynamic distribution modeling of the Swamp Tigertail dragonfly Synthemis eustalacta (Odonata: Anisoptera: Synthemistidae) over a 20-year bushfire regime
Goodman, Aaron (2022), Occurrences and R code for: Dynamic distribution modeling of the Swamp Tigertail dragonfly Synthemis eustalacta (Odonata: Anisoptera: Synthemistidae) over a 20-year bushfire regime, Dryad, Dataset, https://doi.org/10.5061/dryad.tx95x6b23
Intensity and severity of bushfires in Australia have increased over the past few decades due to climate change, threatening habitat loss for numerous species. Although the impact of bushfires on vertebrates is well-documented, the corresponding effects on insect taxa are rarely examined, although they are responsible for key ecosystem functions and services. Understanding the effects of bushfire seasons on insect distributions could elucidate long-term impacts and patterns of ecosystem recovery. Here, we investigated the effects of recent bushfires, land-cover change, and climatic variables on the distribution of a common and endemic dragonfly, the swamp tigertail (Synthemis eustalacta (Burmeister, 1839)), which inhabits forests that have recently undergone severe burning. We used a temporally dynamic species distribution modeling approach that incorporated 20 years of community-science data on dragonfly occurrence and predictors based on fire, land cover, and climate to make yearly predictions of suitability. We also compared this to an approach that combines multiple temporally static models that use annual data. We found that for both approaches, fire-specific variables had negligible importance for the models, while percent of tree and non-vegetative cover were the most important. We also found that the dynamic model outperformed the static ones when evaluated with cross-validation. Model predictions indicated temporal variation in area and spatial arrangement of suitable habitat but no patterns of habitat expansion, contraction, or shifting. These results highlight not only the efficacy of dynamic modeling to capture spatiotemporal variables, such as vegetation cover for an endemic insect species, but also provide a novel approach to mapping species distributions with sparse locality records.
We acquired occurrence records of adult S. eustalacta from the Global Biodiversity Information Facility (GBIF, DOI: https://doi.org/10.15468/dl.c256kv). We first subsetted the raw data by selecting occurrence records identified as museum samples, curated research-grade community science observations, and published sightings from scientific surveys. We then filtered out records without coordinate information and those recorded outside the years 2001–2020. Finally, we restricted our analysis dataset to records from scientific institutions (Australian Museum, Queensland Museum, Naturalis Biodiversity Center, Murray Darling Basin Authority), community science websites (iNaturalist, Atlas of Living Australia), and governmental organizations (New South Wales Department of Planning, Industry, and Environment, South Australia Department for Environment and Water). In total, we acquired 483 occurrences (Table 1). Further occurrence filtering consisted of removing sightings with erroneous localities (specimens located at known institutions).
We generated yearly sets of environmental predictor variables for modeling that included bioclimatic variables, as well as vegetation cover and seasonal burned area for the years 2001-2020. All analyses were conducted using the statistical programming language R v. 4.1.2 (Team, 2021), and all layers have a geographic coordinate system (i.e., degrees) with a WGS84 datum. We acquired monthly minimum and maximum temperature and precipitation rasters for Australia at 2.5 arcminutes resolution (approx. 5 km at the equator), produced by the Australia Bureau of Meteorology (Jones et al., 2009). From these rasters, we created a set of 19 bioclimatic variables representing means, variabilities, and extremes for temperature and precipitation using the dismo package (Hijmans et al., 2017). Due to known spatial artifacts, we omitted four of these variables that include temperature-precipitation interactions (bio08, bio09, bio18, bio19) from the analysis (Moo-Llanes et al., 2021). We also acquired remotely sensed variables from NASA’s Moderate Resolution Imaging Spectroradiometer (MODIS) using the MODISstp package (Busetto & Ranghetti, 2016). Landscape variables describing fire and land cover have been shown to be important predictors of dragonfly range (Jolly et al., 2022). Increases in regional fire frequency lead to higher rates of river contamination via burned carbon and metal leaching (Kelly et al., 2020; Nasirian & Irvine, 2017; Nunes et al., 2018). Tree cover heavily affects other odonate species in terms of landscape patchiness (Dolný et al., 2014; Rith-Najarian, 1998; Suhonen et al., 2010; Suhonen et al., 2013). Loss of plant cover due to fires increases ambient temperature, which can drastically affect odonate survival (Castillo-Pérez et al., 2021). Based on this information and on knowledge of the species, we selected variables that we hypothesized were drivers of S. eustalacta distribution (Collins & Mcintyre, 2015; Theischinger & Hawking, 2006): vegetation continuous fields (VCF; defined as percent of pixel covered by each field) for percent tree cover, non-tree cover, and non-vegetated cover (percent tree cover subtracted from percent non-tree cover) (MOD44B, 250 m yearly resolution), annual evapotranspiration (MOD16A3, 500 m yearly resolution), Normalized Difference Vegetation Index (NDVI; MOD13A2, 1 km monthly resolution), and Burned Area data product (MDC64A1, 500 m monthly resolution). We calculated yearly averages of each MODIS variable to capture annual variability and resampled all variables to the coarsest resolution (2.5 arcminutes). Pixel values for Burned Area Product range from 0 (unburned) to 365 (366 for leap year), corresponding to the days of the year. From these data, we generated annual Burned Area layers by converting these values to binary (pixel values >1 = burned, 0 = unburned). Finally, we acquired categorical rasters representing Australia’s major vegetation groups from The Biodiversity and Climate Change Virtual Laboratory (BCCVL, https://bccvl.org.au/), and global terrestrial ecoregions from the World Wildlife Fund (https://www.worldwildlife.org). All raster analyses were conducted with the R package raster v3.5-29 (Hijmans et al., 2021).
Species Distribution Modeling
Before modeling, we processed our occurrences to account for sampling bias, delineated a study extent to sample background points, and omitted highly correlated environmental variables. We spatially thinned occurrences by 10 km to reduce the effects of sampling bias and artificial clustering (Veloz, 2009) using the spThin package (Aiello-Lammens et al., 2015), which resulted in more even sample sizes across years (n = 133 total; Table 1). Synthemis eustalacta is endemic to southern Australia and disperses roughly 500 m from streams in early adulthood, but upon reaching maturity, returns to its site of emergence (Theischinger & Hawking, 2006). We thus chose a study extent to include potentially unsampled areas yet exclude large areas outside the species’ dispersal limitations (Peterson & Soberón, 2012), defined as a minimum convex polygon around all localities (2001-2020) buffered by 1 degree (approx. 111 km at equator). Within this extent, we randomly sampled 50,000 background points for modeling and extracted their yearly environmental values. We used these values to calculate correlations between variables using the ‘vifcor’ and ‘vifstep’ functions in the usdm package v 1.1-18 (Naimi, 2017) and filtered out variables by year with correlation coefficients higher than 0.9 and a VIF threshold of 10. Finally, we retained variables for analysis which were kept among all yearly environmental backgrounds.
To model the distribution of S. eustalacta, we used the presence-background algorithm Maxent v3.4.4 (Phillips et al., 2017), which remains one of the top-performing models for fitting SDMs with background data (Valavi et al., 2021). To automate model building and evaluation with different complexity settings and reporting of results, we used the R package ENMeval 2.0.0 (Kass et al., 2021). We constructed both dynamic models that incorporated data across years and static models that used year-specific data (Fig. 1). For the dynamic models, we extracted the environmental predictor values for each year from the occurrence and background points for that year and assembled them into a single training dataset. To construct a single background point dataset for the dynamic models, we extracted yearly environmental values for the same set of background points, then averaged these values across years per background point (we used the mode for categorical variables; Fig. 1). We evaluated models using random k-fold cross-validation, in which occurrences are randomly partitioned into a specified number of groups (i.e., “folds”), then models are sequentially trained on all groups but one (training data) and evaluated on the withheld group (validation data) (Hastie et al., 2009)—we used four folds (k = 4) for our evaluations. As random partitioning can result in spatial autocorrelation due to clustering within folds, spatial block cross-validation techniques are often prescribed to address this (Roberts et al., 2017). However, as our occurrences varied not only in space but also in time, and as we lacked enough records per time bin to additionally separate by temporal block, we chose to use simpler random partitioning for evaluation. In contrast to the single dynamic model, we also constructed one static model per year that used only occurrence and background environmental values for that year. For this approach, we did not make models for years with fewer than five associated occurrences (Phillips et al., 2017; Phillips & Dudík, 2008). For static models, we partitioned our data using the ‘leave-one-out’ strategy (referred to as “jackknife” in ENMeval), whereby one occurrence record is withheld from each model during cross-validation (k = n, or the number of occurrences). This cross-validation technique is most appropriate for small sample sizes, as it allows for the largest training datasets possible for model-building during cross-validation (Pearson et al., 2006; Shcheglovitova & Anderson, 2013). All final models were fitted to the full datasets.
As Maxent models can lead to very different results when complexity settings are changed (Radosavljevic & Anderson, 2014; Warren & Seifert, 2011), we compared dynamic and static models with settings representing “simple” and “complex”. Feature classes determine the shape of the model fit, while regularization multipliers control how much complexity is penalized—high regularization multipliers can result in predictor variable coefficients shrinking to zero and thus dropping out of the model (Phillips et al., 2017). Model tuning consists of varying complexity settings, running models with combinations of these settings, and selecting optimal settings based on performance metrics. For tuning, we chose a simple model with a single linear feature class (L) and a complex model with a combination of linear, quadratic, and hinge feature classes (LQH). Hinge features provide flexible linear fits that can change direction and can considerably inflate the number of model parameters when included (Phillips & Dudík, 2008). To both allow for flexibility in the growth of model complexity and simplify comparisons between models, we set the regularization multiplier to one (the Maxent software default) for each model.
We assessed the dynamic and static models using averages of threshold-dependent (omission rate) and threshold-independent (AUC) discrimination performance metrics calculated on withheld validation data, but also checked the results against the Akaike Information Criterion with correction for small samples sizes (AICc; calculated on the full dataset) (Warren & Seifert, 2011). Omission rate measures the proportion of withheld occurrence localities (i.e., those in the validation dataset) that fall outside a designated model prediction threshold. We assessed the 10-percentile omission rate, which sets this threshold as the lowest suitability value for occurrences after removing the lowest 10% suitability values—this avoids choosing very low thresholds that may represent outliers (Kass et al., 2021; Radosavljevic & Anderson, 2014). Validation AUC is a measure of discrimination accuracy that can be used to make relative comparisons between SDMs with different settings fit on the same data (Radosavljevic & Anderson, 2014), but is problematic to use as an absolute performance measure for presence-background models (Lobo et al., 2008).
To investigate model behavior, we examined predictor variable importance values and compared background density distributions of important variables for intense fire years between models. First, we recorded the permutation importance values for predictor variables reported by Maxent. These values are calculated by randomly permuting the values of all variables but one, building a new model, then calculating the difference between each model’s training AUC and that of the empirical model (Phillips, 2021). Next, as we attributed summaries of predictor variables to the background points for the dynamic models, it was possible that years with extreme values (i.e., intense fire years) may have environmental distributions that are very different than these summaries, which would indicate that dynamic models could have overly smooth predictions for such years. To determine if any key differences existed between the dynamic model’s averaged background and the static models’ yearly backgrounds, we plotted background density distributions of variables with high importance for the years 2009 and 2019, then compared the shapes of these distributions with pairwise Kolmogorov-Smirov goodness-of-fit (KS) tests. We chose these two years because as they had the largest and most intense bushfires in the past 20 years in southern Australia (Van Eeden et al., 2020), we expected them to have the most extreme environmental differences between yearly and averaged backgrounds. For these plots, we chose the top three variables with the highest permutation importance shared between the simple and complex settings separately for the dynamic and static models.
We made habitat suitability predictions for S. eustalacta over the study extent for each year using both the simple and complex settings for the dynamic and static models. We generated predictions by transferring all models onto our environmental variables, creating an envelope of suitability across our study extent. Maxent raw predictions were transformed to a scale of 0-1 to approximate probability of occurrence using the ‘cloglog’ transformation (Phillips et al., 2017). We generated predictions for all years with the dynamic models, but only for years with sufficient (>4) occurrence records for our static models. As an additional check for differences between the two approaches, we mapped the difference in predictions of our simple and complex dynamic and static models for the intense fire years of 2009 and 2019. To better visualize raster subtractions, we generated a binary threshold prediction for our fire-prone years, calculated from the 10-percentile omission rate from our model evaluation.
Data is stored as comma-separated values (CSV) for occurrences, Tag Image File Format (.tiff) for environmental variable raster data, and R for programming and data manipulation. Programs required to open data include R and RStudio and QGIS.
National Science Foundation, Award: 2002473
Japan Society for the Promotion of Science