Skip to main content

Data from: Temperature, topography, soil characteristics, and NDVI drive habitat preferences of a shade-tolerant invasive grass

Cite this dataset

Bowen, Anna; Stevens, Martin (2021). Data from: Temperature, topography, soil characteristics, and NDVI drive habitat preferences of a shade-tolerant invasive grass [Dataset]. Dryad.


Aim: Despite the large literature documenting the negative effects of invasive grasses, we lack an understanding of the drivers of their habitat suitability, especially for shade-tolerant species that do not respond positively to canopy disturbance. We aimed to understand the environmental niche and potential spatial distribution of a relatively new invasive species, wavyleaf basketgrass (Oplismenus undulatifolius (Ard.) Roem. & Schult, WLBG) by leveraging data available at two different spatial scales.

Location: Mid-Atlantic region of the United States

Methods: Maximum entropy modeling (Maxent) was used to predict the habitat suitability of WLBG at the regional and the landscape scale. Following variable evaluation, model calibration, and model evaluation, final models were created using 1,000 replicates and projected to each study area.

Results: At the regional scale, our best models show that suitability for WLBG was driven by relatively high annual mean temperatures, low temperature seasonality and monthly range, low slope, and high cumulative normalized difference vegetation index (NDVI). At the landscape scale, suitability was highest near roads and streams, far from trails, at low elevations, in sandy, moist soil, and in areas with high NDVI. Main

Conclusions: We found that invasion potential of this relatively new invader appears high in productive, mesic habitats at low slope and elevations. At the regional scale, our model predicted areas of suitable habitat far outside areas where WLBG has been reported, including large portions of Virginia and West Virginia, suggests serious potential for spread. However, large portions of this area carry a high extrapolation risk and should therefore be interpreted with caution. In contrast, at the landscape level, the suitability of WLBG is largely restricted to areas near current presence points, suggesting that the expansion risk of this species within Shenandoah National Park is somewhat limited


1. Regional scale model

            To create a maximum entropy model (Maxent) for WLBG at the regional scale, we first selected a study area that included Delaware, District of Columbia, Maryland, New Jersey, Pennsylvania, Virginia, and West Virginia (Fig. 2a).  These states were chosen because WLBG either occurs in that state or a directly adjacent state (EDDMapS, 2019).

We compiled 2,364 WLBG presence points from three sources: a 2016 survey for WLBG in Shenandoah national park (Bowen and Stevens, unpub. data), previously discovered presence points from Shenandoah park staff (J. Hughes, pers. communication), and from the early detection and distribution mapping system (EDDMapS, 2019).  To reduce the amount of overfitting around spatially autocorrelated points, we spatially filtered these points down to one point per 250 m2 cell (Beauchamp et al., 2013; Jarnevich & Reynolds, 2011; Mainali et al., 2015), which resulted in 404 points total (Fig. 2a). The minimum, maximum, and mean distance between all points was 250 m, 368,037 m, and 1,800 m, respectively.

            Thirty-eight environmental layers were obtained for predictors in Maxent and were clipped to the study area at 250 m2 resolution (Table S1).  Predictors included climate variables from BIOCLIM v. 2 (R.J. Hijmans, Cameron, Parra, Jones, & Jarvis, 2005), phenological variables (U.S. Geological Survey, 2016), topographical variables (U.S. Geological Survey, 2013), distance to roads (USDA NRCS, 2016), distance to streams (USDA NRCS, 2016), and soil variables (Soil Survey Staff, 2018) (Table S1).  Pairwise Pearson correlations for predictor values across the entire study area were used to determine collinearity between all variables (r ≥ 0.8 or r ≤ -0.8) (Jarnevich & Reynolds, 2011; Lemke et al., 2011).  Half of these 38 variables were dropped due to collinearity, resulting in 19 variables.  We chose to avoid categorical variables such as land cover and soil type in favor of continuous variables due to the tendency for maxent to overemphasize the importance of categorical variables (M. Cobos, pers. comm.).

These 19 variables were further investigated and reduced to 10 variables to avoid potential overfitting and computational limitations (Cobos et al., 2019; Townsend Peterson, Papeş, & Eaton, 2007).  This investigation was done by evaluating the output of a preliminary set of 70 candidate models and selecting the top predictors for use in final model calibration.  This set of 70 candidate maxent models contained all possible combinations of all 19 predictors, 14 regularization parameter values (0.1-1 at intervals of 0.1 and 2-5 at intervals of 1), and five feature classes (l, lq, lqp, lqpt, and lqpth, where l = linear, q = quadratic, p = product, t = threshold, and h = hinge), using the kuenm package in R (Cobos et al., 2019).  The regularization parameter controls the intensity of the chosen feature class and can smooth these functions in order to prevent overfitting (Morales, Fernández, & Baca-González, 2017).  Feature classes, on the other hand, allow mathematical transformations of the data in order for complex (or simple) relationships to be modeled (Morales et al., 2017).  There have been several recent calls for investigating these parameter options, as default values may not always be appropriate and do not penalize model complexity (Cobos et al., 2019; Phillips & Dudík, 2007; Warren & Seifert, 2011).

Receiver operating characteristic (ROC) curves, which measure model performance, can become inflated with over-parameterization (Jane Elith et al., 2010; Vanderwal et al., 2009), so the following steps were taken both in preliminary and final model calibration to avoid ROC inflation.  We restricted the area from which background points were selected using a buffer indicating how far WLBG could have reached if conditions were suitable (Jane Elith et al., 2010) using the distance between the earliest record of WLBG (K. Kyde, pers. communication) and the maximum distance from that record to another presence point.  Background points (10,000) were then created within that buffer and used for model creation.  25% of presence points were set aside for model testing (Liang et al., 2014; Phillips et al., 2006).  Model iterations, therefore, contained 303 training and 101 testing points. 

We evaluated the model performance of these 70 candidate models based on significance (partial ROC with 500 iterations and 50% of data for bootstrapping), omission rates (E = 5%), and model complexity (AICc) (Cobos et al., 2019).  We then used the jackknife output of the best model to select the top ten contributing variables, using the sum of the gain with only that variable and the largest gain lost without that variable. The final set of ten variables were: annual mean temperature, temperature seasonality, temperature annual range, diurnal range, precipitation seasonality, end of season time (EOST), time-integrated NDVI (TIN), elevation, and slope.  Temperature seasonality represents the variation in temperature across the year (standard deviation * 100), while diurnal range is a measure of monthly temperature range (mean of monthly (max temp - min temp)).  EOST is the ending time of the growing season (in day of the year), while TIN is the cumulative value of NDVI from the start to the end of the growing season (unitless-based on accumulated NDVI units).

Using these ten predictor variables, 70,910 candidate models were created, using all possible combinations of at least two of those ten variables (1,013 sets of variables), the 14 regularization parameters and 5 response feature classes as above, as well as the same background point selection and proportion of testing and training points. 

Once the best model was selected using the same measures of performance as with preliminary model calibration (above), a final model was created by transferring this best model to the full study area.  We ran 1,000 replicates of this model by bootstrapping 50% of the testing data. We compared the response curves for each of the following model outputs: free extrapolation, extrapolation with clamping, and no extrapolation (Cobos et al., 2019) and selected the model with the most realistic output curves, which was extrapolation with clamping.  Finally, these extrapolation risks were evaluated with the mobility-oriented parity (MOP) metric, which calculates multivariate distances between points in the projection and calibration regions (Owens et al. 2013).  It is vital to understand extrapolation risks when transferring a model to a new geographical area or to a new time period to avoid making overly-strong conclusions regarding predicted suitable areas (Owens et al., 2013).  Extrapolation may occur in a variety of ways.  Strict extrapolation occurs when values within the study area are outside the range of those within the background training area, while multivariate or combinational extrapolation occurs when values may be within the same range but represent new combinations of predictors (Owens et al., 2013).

Four maps were created with the results of final model creation: a map of mean predicted suitable habitat between the 1,000 replicates, a binary map showing predicted suitable and unsuitable habitat using the maximum training sensitivity plus specificity threshold calculated by maxent (mean threshold value across the 1,000 runs) (Lemke et al., 2011; Liang et al., 2014), a map of standard deviation between predictions from these replicates (Jarnevich & Reynolds, 2011), and a map of extrapolation risk.  In addition, the area of suitable habitat (prediction ≥ 0.224 via the maximum training sensitivity plus specificity threshold) and highly suitable habitat (prediction ≥ 0.5) (Beauchamp et al., 2013) were calculated using the mean model and the latter was compared to the area found in Beauchamp et al.’s (2013) model within an estimate of their study area.

2. Landscape scale model

            Shenandoah National Park (SHEN) was used as a case study for a landscape-scale model of habitat suitability for WLBG.  Not only does SHEN occupy a large elevational range and a variety of forest habitats, the staff at this park have had an active interest in the spread of WLBG within the park since its discovery in 2005 (J. Hughes, pers. comm.).  Over 1,000 presence points have been recorded in SHEN, both from haphazard staff surveys and from a stratified survey throughout the park in 2016 (Bowen and Stevens, unpub. data, J. Hughes, pers. comm.). 

            As with the regional-scale model, presence points (1,579) were compiled and spatially filtered to one point per 30 m2 cell to avoid overfitting (475 points).  However, initial runs of this model indicated that overfitting was likely to have occurred due to (i) the high concentration of higher prediction values immediately surrounding presence points and (ii) the appearance of high spatial autocorrelation for presence points despite the filtering process.  Therefore, points were further filtered to one point per 250 m2 cell as with the regional model, this time using the exact location of each point rather than the centroid of the 250 m2 cell.  This filtering reduced the number of points to 82 and reduced clustering (Fig. 2b).  The minimum, maximum, and mean distance between all points was 38 m, 78,021 m, and 274 m, respectively.

            Twelve environmental layers were obtained and clipped to the study area at a 30 m2 resolution (Table S1), where they were then investigated for collinearity and contribution using the methods described above.  Several of these layers were also used in the regional scale model, but climate and phenology layers were not used due to their low spatial resolution.  Distance to trails (SHEN park staff) and NDVI (Richardson et al., 2017) were added as new layers for this model.  Pairwise Pearson correlations for these twelve predictors were calculated as with the regional scale model, and one variable was removed (soil silt content) as it was correlated with soil sand content (r > 0.70).  Using a preliminary set of 70 candidate models as with the regional model and these eleven environmental variables, the jackknife of the best model was used to identify the top ten contributing variables.  One variable (soil clay content) contributed very little to the model and was removed.  Background point selection was not restricted as it was with the regional scale model as our interest was not with extrapolation to the region but with WLBG suitability in SHEN.  The final set of ten variables were distance to roads, distance to trails, distance to streams, elevation, slope, aspect, soil ph, soil sand content, soil available water storage, and NDVI.

            Candidate models were then created as with the regional scale model by using these ten predictor variables and the same combinations of regularization parameters and feature classes (70,910 candidate models).  25% of points were restricted for testing (20 points), leaving 62 points for model training.  Model evaluation was performed as above, using partial ROC, omission rates, and AICc. 

A final model with 1,000 replicates was created as with the regional scale model, from which mean suitability, suitabile vs. unsuitable (prediction ≥ 0.188 via the maximum training sensitivity plus specificity threshold and 0.50, respectively), and standard deviation maps were created.  The area of suitable and highly suitable habitat were also calculated for SHEN. 

All final maps were created in ArcMap version 10.4.1 (ESRI, 2016).  All analyses including maxent were done in R version 3.5.2 (R Core Team, 2018) using packages factoextra (Kassambara & Mundt, 2017), kuenm (Cobos et al., 2019), rattle (G. J. Williams, 2011), raster (Robert J. Hijmans, 2017), and sp (Bivand, Pebesma, & Gomez-Rubio, 2013; Pebesma & Bivand, 2005).  Candidate model creation and evaluation for the regional scale were run using the high-performance cluster at Miami University, and model creation and evaluation took 12 days and 41 days, respectively.    

Usage notes

We have uploaded a ReadMe text file with information on this dataset (Bowen_3_25_2020_README.txt). 

My contact information and some introductory information about the dataset is listed here in addition to in the README text file:
Miami University, OH
Data and scripts for creating a ecological niche model for wavyleaf basketgrass (WLBG)
Manuscript title: Temperature, topography, soil characteristics, and NDVI drive habitat preferences of a shade-tolerant invasive grass
Journal: Ecology and Evolution

The README file contains information about what files are available in this repository, including scripts, original data files, and created environmental layers. MD5 checksums codes are listed for each file.

All final maps were created in ArcMap version 10.4.1 (ESRI, 2016).  
All analyses including maxent were done in R version 3.5.2 (R Core Team, 2018) using packages factoextra (Kassambara & Mundt, 2017), kuenm (Cobos et al., 2019),
 rattle (G. J. Williams, 2011), raster (Robert J. Hijmans, 2017), and sp (Bivand, Pebesma, & Gomez-Rubio, 2013; Pebesma & Bivand, 2005).  
Candidate model creation and evaluation for the regional scale were run using the high-performance cluster at Miami University, and model creation and evaluation took 12 days and 41 days, respectively.