Data from: Predicting invasion success of cultivated naturalized plants in China
Data files
Jan 03, 2025 version files 94.49 KB
-
chklist.invasion2018_20230929_v24.csv
88.16 KB
-
README.md
6.32 KB
Abstract
Plant invasions pose significant threats to native ecosystems, human health, and global economies. However, the complex and multidimensional nature of factors influencing plant invasions makes it challenging to predict and interpret their invasion success accurately. Using a robust machine learning algorithm, random forest, and an extensive suite of characteristics related to environmental niches, species traits, and propagule pressure, we developed a classification model to predict the invasion success of naturalized cultivated plants in China. Based on the final optimal model, we evaluated the relative importance of individual and grouped variables and their prediction performance. Our study identified key individual variables within each of three groupings: climatic suitability and native range size (environmental niches), phylogenetic distance to the closest native taxon and vegetative propagation mode (species traits), and the number of botanical gardens and provinces where species were cultivated (propagule pressure). Remarkably, when grouped variables were evaluated, the relative importance of grouped variables increased dramatically—by 13.5 to 17.7 times—compared to the cumulative importance of individual variables within a category. However, the relative importance of one category was primarily due to the number of variables within each category rather than its inherent characteristics.
Synthesis and applications. Our findings emphasize the necessity of developing data-driven predictive tools for effective invasion risk assessment using large datasets. We also highlight the importance of grouped variables in enhancing model interpretability. For practical application in China, we recommend prioritizing surveillance of alien plant species with large native ranges and high climatic suitability. Implementing a tiered risk assessment system based on our random forest model can allow for a more effective allocation of resources for monitoring and managing invasive species. Ultimately, interdisciplinary collaboration is crucial for implementing and applying these predictive tools, thereby protecting biodiversity, ecosystem services, and economic interests.
README: Data and R code for "Predicting invasion success of cultivated naturalized plants in China"
Prepared by Bi-Cheng Dong
19 December 2024
This README file lists and describes the files used for the analyses involved in the manuscript, "Predicting invasion success of cultivated naturalized plants in China", by Bi-Cheng Dong, Ran Dong, Qiang Yang, Nicole L. Kinlock, Fei-Hai Yu, and Mark van Kleunen, published in Journal of Applied Ecology. Please see the manuscript itself and the Supporting Information for additional details regarding this analysis.
Files included
Dataset of cultivated naturalized plants in China
- chklist.invasion2018_20230929_v24.csv: this dataset is the primary source used in all analyses. Each row represents a seed plant taxon that has been naturalized non-invasive or invasive in China, standardized by The Plant List (TPL). For each taxon, associated data on propagule pressure, environmental niches, and species traits are provided. The CSV file is encoded in UTF-8.
- Data description: below are detailed descriptions of each column in the dataset. Missing values are indicated by "NA" in the dataset.
- Basic information
- TPL_names: species names standardized by The Plant List
- invasion.status.2022: invasion status of naturalized alien taxa (binary; 0: non-invasive; 1: invasive)
- Propagule pressure
- bg_num: number of Chinese botanical gardens where the taxon is cultivated (integer)
- chklst_prov_num: number of Chinese provinces where the alien taxon is cultivated (integer)
- wcup_eco_use_AF: use as animal food (binary; 0: no; 1: yes)
- wcup_eco_use_EU: environmental uses (binary; 0: no; 1: yes)
- wcup_eco_use_FU: use as fuels (binary; 0: no; 1: yes)
- wcup_eco_use_GS: use as gene sources (binary; 0: no; 1: yes)
- wcup_eco_use_HF: use as human food (binary; 0: no; 1: yes)
- wcup_eco_use_IF: use as invertebrate food (binary; 0: no; 1: yes)
- wcup_eco_use_MA: use for materials (binary; 0: no; 1: yes)
- wcup_eco_use_ME: use as medicine (binary; 0: no; 1: yes)
- wcup_eco_use_PO: use as poisons (binary; 0: no; 1: yes)
- wcup_eco_use_SU: social uses (binary; 0: no; 1: yes)
- wcup_eco_use_num: number of economic use categories provided by the WCUPS dataset (integer)
- Environmental niches
- hab_suit_mean: mean climatic suitability of the taxon in China estimated by SDMs (continuous; 0-1)
- X1: native to Europe (binary; 0: no; 1: yes)
- X2: native to Africa (binary; 0: no; 1: yes)
- X3: native to Asia-Temp. (binary; 0: no; 1: yes)
- X4: native to Asia-Trop. (binary; 0: no; 1: yes)
- X5: native to Australasia (binary; 0: no; 1: yes)
- X6: native to Pacific Isl. (binary; 0: no; 1: yes)
- X7: native to North America (binary; 0: no; 1: yes)
- X8: native to South America (binary; 0: no; 1: yes)
- X9: native to Antarctic (binary; 0: no; 1: yes)
- no.NativeRange.level03: native range size calculated by the number of TDWG level-3 regions (integer)
- Species traits
- lf_short.herb: short-lived herbs (binary; 0: no; 1: yes)
- lf_long.herb: long-lived herbs (binary; 0: no; 1: yes)
- lf_woody: woody species (binary; 0: no; 1: yes)
- prop_type.seed.v2: propagation by seeds (binary; 0: no; 1: yes)
- prop_type.veg.v2: propagation by vegetative means (binary; 0: no; 1: yes)
- prop_type.both.v2: propagation by both means (binary; 0: no; 1: yes)
- max.height.new2023: maximum height (continuous; unit: m)
- pd.mean.to_native: the mean (pairwise) phylogenetic distance of the alien taxon to the 30,248 native taxa in China (continuous; unit: mya)
- pd.min.to_native: the phylogenetic distance of the alien taxon to the most closely related native taxon (continuous; unit: mya)
- pd.wmean.to_native: the weighted mean phylogenetic distance of the alien taxon to the 30,248 native taxa in China, weighted by occurrence in Chinese TDWG level-3 regions (n = 10) (continuous; unit: mya)
R code used to conduct analyses
These R script files form a complete machine learning analysis pipeline, covering everything from data preparation to model training, optimization, evaluation, and interpretation. The workflow primarily utilizes the caret package in R and encompasses all key steps in a machine learning analysis.
Here is the brief explanation of these R script files:
- NEW001.data.preparation.r
- Data preparation and preprocessing script
- format conversion and missing value handling
- Feature engineering and data transformation
- Code for figure s5 in the supplementary file.
- NEW002.caret.models.r
- Dataset splitting (training/test sets)
- Building various machine learning models using caret package
- NEW003.caret.model.selection.r
- Model selection and evaluation
- Performance comparison of different models
- Optimal model selection
- Code for figure s7 in the supplementary file.
- NEW003a.caret.optimal.som.r
- Construct Self-Organizing Map (SOM) model
- SOM results visualization
- Code for figure s8 in the supplementary file.
- NEW003b.caret.optimal.rf.r
- Run Random Forest model with case weights again
- Model evaluation metrics
- Code for figure s6 in the supplementary file.
- NEW004.caret.combination.of.variable.importances.step01.r
- Feature importance construction
- NEW004.caret.combination.of.variable.importances.step02v1.r
- Feature importance visualization
- Code for figures 2 and 3 in the main text
- Code for figures s2 and s3 in the supplementary file.
- NEW007.caret.pdp.plot.based.on.raw.data.v5.r
- Partial Dependence Plots (PDP) based on raw data
- Code for figure s1 in the supplementary file.
- NEW007.caret.pdp.plot.based.on.transformed.data.v5.r
- Partial Dependence Plots based on transformed data
- Code for figure 1 in the main text.
- NEW008.caret.randomzation.text.step01.r
- Randomization tests
- Data shuffling/resampling
- Caution: even with 30 cores running in parallel, the execution time of randomization tests is approximately five hours.
- NEW008.caret.randomzation.text.step02.r
- Result visualization of randomization test
- Code for figure s4 in the supplementary file.
Methods
Data compilation
We compiled a checklist of the 735 naturalized plant taxa introduced for cultivation in China, based on the Catalogue of Cultivated Plants in China (Lin, 2018) and The Checklist of the Naturalized Plants in China (Yan et al., 2019). The binomial names of these naturalized taxa were standardized according to The Plant List (TPL, version 1.1; http://www.theplantlist.org) using the R package 'Taxonstand' (Cayuela et al., 2021). Among these naturalized taxa, 435 were classified as non-invasive and 300 as invasive in China, based on information from Hao and Ma (2023). Non-invasive taxa are naturalized taxa that form self-sustaining populations outside cultivation but remain limited in spread. Invasive taxa are naturalized taxa that spread widely, causing economic, social, and ecological damage to the invaded ecosystems. These definitions are consistent with those used by Lin et al. (2021).
To identify the characteristics that could potentially drive the invasion success of naturalized taxa, we compiled data on 12 characteristics for each taxon, resulting in 34 variables. The 34 variables were grouped into three categories, including species traits and similarity to native species (life form, propagation mode, maximum height, phylogenetic mean pairwise distance [MPD], phylogenetic nearest pairwise distance [NNPD], and weighted mean pairwise distance [wMPD] to the native flora in China), propagule pressure (the number of botanical gardens, the number of provinces in which taxa were cultivated, economic use category and the number of economic use categories), and environmental niches (climatic suitability, native range size and continents of origin). Native range size and continents of origin were quantified using the Taxonomic Databases Working Group (TDWG) level-1 (continental) and level-3 (regional) geographical classifications. A phylogenetic tree was constructed using 735 naturalized plant taxa and 30,248 native plant taxa in China. Further details are provided in the Supplementary Material.
Data preparation
There were 38 naturalized taxa with missing values for some of the 34 variables. Following the guidance of Breiman (2003), missing data were imputed by weighting the frequency of the non-missing values with proximity values, using the 'rfImpute' function in the R package 'randomForest' (Feng et al., 2020).
To standardize the explanatory variables, we scaled all continuous variables to have a mean of zero and a standard deviation of one (Table S1). Before scaling, we also natural log(x + 1) transformed the number of botanical gardens, the number of provinces, the number of economic use categories, and native range size; natural log(x + 0.001) transformed climatic suitability; natural log transformed MPD, NNPD, wMPD, and maximum height to achieve more regular distributions of these variables. As maximum height is associated with life form, we scaled maximum height separately within each life-form category.
To preliminarily explore the rationality of variable grouping, we employed Pearson correlation analysis to examine bivariate relationships and Self-Organizing Map (SOM) analysis to visualize high-dimensional data clustering patterns through dimensionality reduction (Kohonen, 2001). Further details are provided in the Supplementary Material.