Data for: Spatial prediction of plant invasion using a hybrid of machine learning and geostatistical method
Data files
Apr 10, 2026 version files 22.40 MB
-
newinvasion.csv
22.39 MB
-
README.md
8.08 KB
Abstract
Modelling ecological patterns and processes often involves large-scale and complex high-dimensional spatial data. Due to the nonlinearity and multicollinearity of ecological data, traditional geostatistical methods have faced great challenges in model accuracy. As machine learning has increased our ability to construct models on big data, the main focus of the study is to propose the use of statistical models that hybridize machine learning and spatial interpolation methods to cope with the increasingly large-scale and complex ecological data. Here, two machine learning algorithms, boosted regression tree (BRT) and least absolute shrinkage and selection operator (LASSO), were combined with ordinary kriging (OK) to model plant invasions across the eastern United States. The accuracy of the hybrid models and conventional models was evaluated by 10-fold cross-validation. Based on an invasive plants dataset of 15 ecoregions across the eastern United States, the results showed that the hybrid algorithms were significantly better at predicting plant invasion when compared to commonly used algorithms in terms of RMSE and paired-samples t-test (with the p-value < 0.0001). Besides, the additional aspect of the combined algorithms is to have the ability to select influencial variables associated with the establishment of invasive cover, which can not be achieved by conventional geostatistics. Higher accuracy in the prediction of large-scale biological invasions improves our understanding of the ecological conditions that lead to the establishment and spread of plants into novel habitats across spatial scales. The results demonstrate the effectiveness and robustness of the hybrid BRTOK and LASOK that can be used to analyze large-scale and high-dimensional spatial datasets, and it has offered an optional source of models for spatial interpolation of ecology properties. It will also provide a better basis for management decisions in early-detection modelling of invasive species.
The U.S. Forest Inventory and Analysis program (FIA) has been collecting invasive plants occurrence and distribution through all public and private U.S. forests for several decades. It has provided large-scale samples and high-dimensional variables which can be used in statistical models to reflect local ecological differences of plots varying in environment and invasion of non-native species (Cleland et al., 1997).
Description of the data and file structure
newinvasion.csv: This is the main dataset at the plot level, containing invasive plant cover and 41 ecological variables used as auxiliary predictors to improve spatial prediction. Detailed descriptions of all variables are provided below.
- LAT: Latitude in decimal degrees (Iannone et al., 2016)
- LON: Longitude in decimal degrees (Iannone et al., 2016)
- Mean_Annual_Temp: Mean annual temperature (°C × 100) (Iannone et al., 2016)
- annual_Precip: Annual precipitation (mm) (Iannone et al., 2016)
- Seasonability: SD of mean annual temp (Iannone et al., 2016)
- Alt: Altitude in m (Iannone et al., 2016)
- PLT_TPA: Trees/acre (Iannone et al., 2016)
- Tpha: Trees/hectare (Iannone et al., 2016)
- RelDen: 0-1. reflects proportion of potential growth on the plot, or successional development (Iannone et al., 2016)
- Prpfor: Proportion of plot that is forested (Iannone et al., 2016)
- plt_drybio_adj: Aboveground dry-wt biomass of native trees, English tons/acre (Iannone et al., 2016)
- plt_drybio_ha: Aboveground dry-wt biomass of native trees, English tons/hectare (Iannone et al., 2016)
- native_spp: Native tree species richness (Iannone et al., 2016)
- PD_all: Phylogenetic diversity of tree species (Iannone et al., 2016)
- PSV_all: Phylogenetic tree species variability (Iannone et al., 2016)
- PSV_all_var: Variance of phylogenetic tree species variability (Iannone et al., 2016)
- PSR_all: Phylogenetic tree species richness (Iannone et al., 2016)
- PSR_all_var: Variance of phylogenetic tree species richness (Iannone et al., 2016)
- PSE_all: Phylogenetic tree species evenness (Iannone et al., 2016)
- PSC_all: Phylogenetic tree species clustering (Iannone et al., 2016)
- InvTotalCover: Sum of cover estimates for all invasive plants
(can be greater than 100%) (Iannone et al., 2016) - anmeantemp: Annual mean temperature (https://www.worldclim.org/)
- anprecip: Precipitation of coldest quarter (https://www.worldclim.org/)
- Isotherm: Isothermality (Mean diurnal range / temperature annual range) x 100 (https://www.worldclim.org/)
- maxtempwarm: Maximum temperature of warmest month (https://www.worldclim.org/)
- meandiurnrge: Mean diurnal range
(mean of monthly (max temp-min temp)) (https://www.worldclim.org/) - meantempwetq: Mean temperature of wettest quarter (https://www.worldclim.org/)
- meantempdryq: Mean temperature of driest quarter (https://www.worldclim.org/)
- meantempwarm: Mean temperature of warmest quarter (https://www.worldclim.org/)
- meantempcold: Mean temperature of coldest quarter (https://www.worldclim.org/)
- mintempcold: Minimum temperature of coldest month (https://www.worldclim.org/)
- precipwetm: Annual precipitation (https://www.worldclim.org/)
- precipdrym: Precipitation of wettest month (https://www.worldclim.org/)
- precipseason: Precipitation of driest month (https://www.worldclim.org/)
- precipwetqu: Precipitation seasonality (coefficient of variation) (https://www.worldclim.org/)
- precipdryqu: Precipitation of wettest quarter (https://www.worldclim.org/)
- precipwarmqu: Precipitation of driest quarter (https://www.worldclim.org/)
- precipcoldqu: Precipitation of warmest quarter (https://www.worldclim.org/)
- soilcarbon: Soil carbon in 0 to 20 cm depth (http://www.isric.org)
- tempanrge: Temperature annual range (warmest – coldest temperature) (https://www.worldclim.org/)
- tempseason: Temperature seasonality (standard deviation x 100) (https://www.worldclim.org/)
- Aridity: Aridity index (mean annual precipitation / mean annual potential evapotranspiration) (www.cgiar-csi.org/data).
Sharing/Access information
We obtained stand structural information (tree density and productivity), microhabitat (altitude, percent area forested), and diversity (species richness and phylogenetic diversity) of the tree communities based on FIA data from Iannone et al. (2016). Methods for these measurements can be found in Iannone et al. (2016). We obtained 19 climate variables from WorldClim Global Climate Data Version 1.4 (http://www.worldclim.org; Hijmans et al., 2005). We obtained an aridity index from Global Aridity Index (http://www.cgiar-csi.org/data; Trabucco & Zomer, 2017) and soil carbon from the World Soil Information (http://www.isric.org; Batjes, 2015).
Code/Software
All R scripts used in this study are available on Zenodo. The main scripts include:
- R_Code_for_15_ecoregions.txt: Model construction and prediction for 15 ecoregions
- R_Code_for_Figure3.txt: Code used to generate Figure 3 in the manuscript
- R_Code_for_FigureS1.txt: Code used to generate Supplementary Figure S1
- R_Code_for_Table2.txt: Code used to produce Table 2 results
- R_code_of_100_times_iteration.txt: Code for repeated model evaluation (100 iterations)
These scripts reproduce the main analyses and figures presented in the associated publication.
Missing data description
Missing values in the dataset are represented as "NA". These indicate missing data due to unavailable or unrecorded information during data collection or processing. All missing values are consistently coded as "NA" to ensure clarity and compatibility with data analysis workflows.
References
Batjes, N. H. (2015). World soil property estimates for broad-scale modelling (WISE 30sec, version 1.0) (Report 2015/01). ISRIC – World Soil Information. http://www.isric.org/
Cleland, D. T., Avers, P. E., McNab, W. H., Jensen, M. E., Bailey, R. G., King, T., & Russell, W. E. (1997). National hierarchical framework of ecological units. In M. S. Boyce & A. Haney (Eds.), Ecosystem management applications for sustainable forest and wildlife resources (pp. 181–200). Yale University Press.
Hijmans, R. J., Cameron, S. E., Parra, J. L., Jones, P. G., & Jarvis, A. (2005). Very high resolution interpolated climate surfaces for global land areas. International Journal of Climatology, 25, 1965–1978. https://doi.org/10.1002/joc.1276
Iannone, B. V., Potter, K. M., Hamil, K. A. D., Huang, W., Zhang, H., Guo, Q., Oswalt, C. M., Woodall, C. W., & Fei, S. (2016). Evidence of biotic resistance to invasions in forests of the eastern USA. Landscape Ecology, 31, 85–99. https://doi.org/10.1007/s10980-015-0243-0
Trabucco, A., & Zomer, R. J. (2017). Global aridity index and global potential evapotranspiration (global-PET) geospatial database. CGIAR Consortium for Spatial Information. https://www.cgiar-csi.org/data/global-aridity-and-pet-database. https://doi.org/10.6084/m9.figshare.7504448.v4
