Modeling data and R code for Chrysodeixis chalcites ecological niche
Data files
Jul 08, 2024 version files 89.33 MB
-
Occurrence_Data.zip
184.59 KB
-
README.md
3.30 KB
-
Scripts.zip
15.83 KB
-
With_HI.zip
44.59 MB
-
Without_HI.zip
44.53 MB
Abstract
The golden twin-spot moth, Chrysodeixis chalcites Esper (Lepidoptera: Noctuidae), is a polyphagous, polyvoltine crop pest occurring natively from northern Europe to Mediterranean Africa and the Canary Islands. Larvae feed on a wide variety of naturally occurring plants as well as soybean and other legume crops, short staple cotton, tomato, potato, peppers, tobacco, and banana. Chrysodeixis chalcites has been recorded in agricultural lands in the Ontario peninsula in eastern Canada and in northern counties of Indiana, USA. Given the strong potential for C. chalcites to invade USA crop lands, it is important to identify environments most likely to sustain growing populations of this pest. Though C. chalcites is native to Europe and North Africa, it has invaded sub-Saharan Africa. Using occurrence data form the native and invaded ranges, and environmental predictors including bioclimatic conditions and human disturbance, we trained three ecological niche models to estimate an ensemble prediction of environmental suitability in the contiguous US. Because human impact is potentially a confounding predictor, models were trained both with and without it. High environmental suitability was projected for the Atlantic coast from New England to Florida, the Gulf coast, the lower Midwest, and the Pacific coast and Central Valley of California.
README: Raw data, supplemental data, and output
https://doi.org/10.5061/dryad.g1jwstqwv
Description of the data and file structure
Raw and output data are compressed as zip files that will open into four folders, three including data and one including scripts (see next heading)
Folder: Occurence_Dats contains input data for analyses.
Chrysodeixis_chalcites_home_clean.csv
: Occurrence data (lon and lat) for Chrysodeixis chalcites from the home range (Europe and the Mediterranean) with associated meta-data. Downloaded from GBIF.org and sanitized by removing duplicates and points outside the invaded range, and by stratified sampling to counter collector bias.Chrysodeixis_chalcites_invaded_clean.csv
: Occurrence data (lon and lat) for Chrysodeixis chalcites from the invaded range (subSaharan Africa) with associated meta-data. Downloaded from GBIF.org and sanitized by removing duplicates and points outside the invaded range, and by stratified sampling to counter collector bias.Chrysodeixis_chalcites_train.csv
: After combining data from home and invaded range, 80% of lon/lat data without associated metadata were randomly selected to train species distribution models.Chrysodeixis_chalcites_train.csv
: After combining data from home and invaded range, 20% of lon/lat data without associated metadata were randomly selected to evaluate model accuracy.
Folders: With HI and Without HI contain output files indicating whether models were trained with or without the Human Impact layer. File names all begin with Chrysodeixis_chalcites and indicate the model that generated them (GAM, ME, BRT, ENS), and the kind of output (.mod/.tif)
Models:
- -GAM: generalized additive model
- -ME: maximum entropy
- -BRT: boosted regression trees
- -ENS: ensemble; the Boyce Index weighted average of predictions from three models.
Output types
.mod
: an R model object (list). Because ensembles are averages of other model predictions, there is no .mod file for ensemble predictions. To import a .mod file into the R environment, use the readRDS() command, e.g. \GAM<-readRDS('/Users/cpr003/R-temps/Chrysodeixis_chalcites_MS/V3/With_HI/Chrysodeixis_chalcites_GAM.mod')
.tif
: geotif image file representing global predictions of environmental suitability derived from each model.
Additionally, each folder contains a csv table including evaluations from model predictions named Chrysodeixis_chalcites_evaluation.csv
. Evaluation metrics include the Boyce Index (Boyce), the Area Under the Receiver Operator characteristic Curve (AUC), Pearson's correlation coefficient (COR), and Cohen's Kappa (Kappa).
Missing data code: NA
Code/Software
Scripts were written by Chris Randle, Nick Galle, and Kayla Hankins, and relied heavily on tutorials provided by Robert Hijmans at rspatial.org. Scripts are stored as markdown files (.Rmd) and are heavily annotated for ease of use. There are two scripts, one for generating predictions using human impact as a predictor with_HI.Rmd
, and one for generating predictions without using human impact as a predictor without_HI.Rmd
. These can be found in the compressed file Scripts.zip
Methods
Collection and Sanitization of Occurrence Data
Occurrence records for Chrysodeixis chalcites were obtained through the Global Biodiversity Information Facility (GBIF 2022). All occurrence records in the dataset without latitude or longitude were removed. The native range of Chrysodeixis chalcites includes Europe, the Middle East, and North Africa (Sullivan and Molet 2007, CABI 2022). Occurrence points in sub-Saharan Africa represent invasive populations. On one hand, it has been suggested that invasive populations may not have fully occupied the entire landscape available to them, violating the assumption of species-environment equilibrium (Guisan and Thuiller 2005), which would warrant their exclusion. On the other hand, these populations may also have evolved to occupy additional environments not available to the native populations (niche extension), and therefore may broaden the predicted areas of environmental suitability (Ginal et al. 2022). Given the importance of early detection, we have incorporated occurrences from both ranges to maximize the predicted niche in the US. Occurrence points from Asia and the Pacific are likely C. eriosoma, sometimes considered to be a separate species from C. chalcites (Murillo et al. 2013), and so these were excluded. After data were downloaded and filtered to include only likely C. chalcites occurences, duplicate points were removed. To mitigate the effects of collector bias (Phillips et al. 2009), we stratified occurrences by removing all points with duplicate latitudes and longitudes when rounded to a tenth of a degree. After this cleaning process, the dataset included 1052 occurrence records (Fig. 1). 20% of these records were randomly chosen and set aside to assess model accuracy. The remaining 80% were used to train the models.
Generation of Background/Pseudoabsence Data
All models used in this study require background/absence points. To mitigate the effects of sampling bias inherent in occurrence data, background data were sampled from within the range of presence data (Phillips et al. 2009). First, k-means spatial clusters were identified (k=1:15), and the smallest number of clusters was chosen that simultaneously minimized the sum of squared average geographical distance. Circles were generated around each occurrence point in the training data set, with the radius being equal to the average distance among occurrence points within each cluster. Circles were merged into a buffer representing the range of occurrence points, and pseudoabsence and background points were drawn randomly from within this buffer.
Predictor Variables
Thirty-five environmental variables were obtained from the CliMond Bioclim dataset (Kriticos et al. 2012) drawn from the period 1961–1990. A layer summarizing human impact (HI), as a composite of nine global data layers covering human population pressure (population density), human land use and infrastructure (built-up areas, nighttime lights, land use/land cover), and human access in the form of coastlines, roads, railroads, navigable rivers (Wildlife Conservation Society 2005). The use of such a human impact layer may be confounding. On one hand, many invasive pests are well-adapted to disturbance. However, such a pattern could also be generated by collection practices that are biased toward locations close to human-made structures such as roads and settlements (Kadmon et al. 2004; Moerman and Estabrook 2006; Williams and Lutterschmidt 2006, Phillips et al. 2009; Daru et al. 2018). Therefore, models were trained both with and without HI predictor data.
To avoid model inaccuracy due to collinearity of predictor variables (De Marco et al. 2018) contribution of each variable to overall collinearity was estimated as the Variance Inflation Factor (VIF), which regresses each predictor against all the others simultaneously (Guisan et al. 2002, 2017). A stepwise algorithm for reducing collinearity was performed by first estimating VIF for all predictor values associated with occurrences, removing the layer contributing most to collinearity, and recalculating VIF, until remaining layers had VIF values less than 10 (Chatterjee and Yilmaz 1992); this was performed using the vifstep function in the ‘usdm’ package (Naimi et al. 2014).
For GAM, the relative effect of each predictor was measured as a z-score, or the number of standard deviations from the mean under the null expectation of no-effect. On the other hand, predictor importance for the machine learning models ME and BRT was measured by relative contribution. These evaluations are clearly meant to establish relative importance of predictors within rather than among trained models. However, for uniformity of presentation, z-scores for the GAM model were converted to relative contribution scores by calculating their contribution to overall variance.
Ecological Niche Models
We employed three models that differed markedly in underlying statistical properties, the Generalized Additive Model (GAM), Maximum Entropy (ME), and Boosted Regression Trees (BRT). The Generalized Additive Model (GAM) was trained using the ‘mgcv’ package (Wood 2022). Maximum Entropy models (ME) were constructed using MaxEnt software (Phillips et al. 2006, 2023), run through the ‘dismo’ package (Hijmans et al. 2023). Booted Regression Tree (BRT) models were constructed using the ‘dismo’ package as well using tree complexity of 5 with a learning rate of 0.001. To train the Generalized Additive Model (GAM), we sampled absence points equal to the number of occurrences used to train the model. For Maximum Entropy (ME) and Boosted Regression Tree (BRT) models, we used 10,000 background/pseudoabsence points. Models were then used to predict probability of presence, called ‘environmental suitability’ here, for the contiguous US.
For GAM, the relative effect of each predictor was measured as a z-score, or the number of standard deviations from the mean under the null expectation of no-effect. On the other hand, predictor importance for the machine learning models ME and BRT was measured by relative contribution. These evaluations are clearly meant to establish relative importance of predictors within rather than among trained models. However, for uniformity of presentation, z-scores for the GAM model were converted to relative contribution scores by calculating their contribution to overall variance.
Model Evaluation and Ensemble Prediction
Accuracy of the models in predicting withheld occurrence data was evaluated using four metrics: the Boyce Index (BI; Hirzel et al. 2006), Area Under the receiver operating characteristic Curve (AUC), Pearson’s correlation coefficient (COR), and Cohen’s kappa (kappa). BI was calculated using the ‘ecospat’ Package (Broenimman et al. 2017), while the other three metrics were estimated using the ‘dismo’ package (Hijmans et al. 2023). Because AUC and kappa have been criticized for assessing accuracy when true absence data are not available (Jarnevich et al. 2015), we have relied most heavily on the Boyce Index in assessing the predictive accuracy of models. Because these three modelling approaches are philosophically and mathematically different, agreement in prediction among them is likely due to correct discrimination between signal and noise. Therefore, we incorporated an ensemble approach (Araujo and New 2007; Stohlgren et al. 2010; Valavi et al. 2022) to summarize models, by calculating the BI-weighted average of GAM, ME, and BRT predictions of environmental suitability. User accuracy metrics were also calculated for ensemble predictions.