Biogeography of the world’s worst invasive species has spatially-biased knowledge gaps but is predictable

Jenkins, David 1 ; Bevan, Hannah1; Chen, Wei1; Hart, Jacob1; Lindsay, Amanda1; Macamo, Laura1; Negash, Mekail1; Ohyama, Leo2; Pandolfi, Alessandra1; Zaragoza, George1

Published Feb 29, 2024 on Dryad. https://doi.org/10.5061/dryad.zw3r228bh

Data files

Feb 29, 2024 version files 10.38 MB

datatoanalyze.csv

10.37 MB
README.md

3.25 KB
README.txt

1.58 KB

Abstract

The world’s “100 worst invasive species” were listed in 2000. The list is taxonomically diverse and often cited (typically for single-species studies), and its species are frequently reported in global biodiversity databases. We acted on the principle that these notorious species should be well-reported to help answer two questions about global biogeography of invasive species (i.e., not just their invaded ranges): (1) “how are data distributed globally?” and (2) “what predicts diversity?” We collected location data for each of the 100 species from multiple databases; 95 had sufficient data for analyses. For question (1), we mapped global species richness and cumulative occurrences since 2000 in (0.5 degree)² grids. For question (2) we compared alternative regression models representing non-exclusive hypotheses for geography (i.e., spatial autocorrelation), sampling effort, climate, and anthropocentric effects.

Reported locations of the invasive species were spatially-biased, leaving large gaps on multiple continents. Accordingly, species richness was best explained by both anthropocentric effects not often used in biogeographic models (Government Effectiveness, Voice & Accountability, human population size) and typical natural factors (climate, geography; R² = 0.87). Cumulative occurrence was strongly related to anthropocentric effects (R² = 0.62). We extract five lessons for invasive species biogeography; foremost is the importance of anthropocentric measures for understanding invasive species diversity patterns and large lacunae in their known global distributions. Despite those knowledge gaps, advanced models here predict well the biogeography of the world’s worst invasive species for much of the world.

https://doi.org/10.5061/dryad.zw3r228bh

The provided datatoanalyze.csv file represents data further processed in provided R code to include spatial autocorrelation for each of species richness and cumulative occurences analyses.

Description of the data and file structure

The data file includes 59586 rows and 19 columns. NAs indicate missing data. Columns include:

a row ID
lon: longitude (decimal degrees) for the center of a 0.5 degree grid cell
lat: latitude (decimal degrees) for the center of a 0.5 degree grid cell
UN: the UN code for the country
ISO3: the ISO3 code for the country
NAME: the country name
Country: may be identical to NAME, but some differences (e.g., The Republic of ...) occur via different data sources
corrupt: the corruption score (range = -2.5 to 2.5) for the country, from the World Bank (Kaufmann et al. 2011)
goveffect: the Government Effectiveness score (range = -2.5 to 2.5) for the country, from the World Bank (Kaufmann et al. 2011)
voiceacct: the Voice & Accountability score (range = -2.5 to 2.5) for the country, from the World Bank (Kaufmann et al. 2011)
gpwpop: the population estimate for the country, from the Socioeconomic Data and Applications Center (SEDAC; https://sedac.ciesin.columbia.edu/data/collection/gpw-v4)
medianGDPpc: representing per capita GDP ($) based on data from (Kummu et al. 2018) and using the median of 2000-2015.
Anthrome: categorical land uses from SEDAC: http://sedac.ciesin.columbia.edu/data/set/anthromes-anthropogenic-biomes-world-v2-2000.
Ann_mn_temp: annual mean temperature (degrees C), from WorldClim (Fick et al. 2017)
Ann_mn_precip: annual mean precipitation (mm), from WorldClim (Fick et al. 2017)
totrecords: cumulative total occurrence records
logtotrecs: log10 of totrecords
SR: species richness
fanth & fUN: factors of Anthrome and UN

Sharing/Access information

Socioeconomic Data and Applications Center (SEDAC) provided:
- population data https://sedac.ciesin.columbia.edu/data/collection/gpw-v4)
- anthromes http://sedac.ciesin.columbia.edu/data/set/anthromes-anthropogenic-biomes-world-v2-2000.

Governance indicators (Kaufmann et al. 2011) were provided by https://info.worldbank.org/governance/wgi
Climate data were provided by https://www.worldclim.org/data/bioclim.html

Code/Software

R code developed and used here is provided separately at Zenodo. Code was run using R version 4.3.2 (2023-10-31). R packages used in the code are: tidyverse, ggpointdensity, viridis, ggpubr, glmmTMB, GLMMadaptive, bbmle, performance, MASS, maps, wesanderson, RColorBrewer, ggmap, rworldmap, readxl.

Data Acquisition and Processing

Data were acquired from multiple data bases for the 100 invasive species in February 2022 using the spocc package in R (Chamberlain 2021). Data sources (in alphabetical order) included: the Atlas of Living Australia ('ALA'; https://www.ala.org.au); eBird (http://www.ebird.org/home; Sullivan et al. 2009); the Integrated Digitized Biocollections ('iDigBio'; https://www.idigbio.org; Matsunaga et al. 2013); the Global Biodiversity Information Facility (GBIF (https://www.gbif.org); Ocean 'Biogeographic' Information System ('OBIS'; https://portal.obis.org; Grassle and Stocks 1999); VertNet (https://vertnet.org; Constable et al. 2010); and the US Geological Survey’s Biodiversity Information Serving Our Nation ('BISON'; replaced December 2021 by GBIF). Several databases set limits to 100,000 initial point records (before cleaning, described below) when accessed using spocc. As a result, data for 19 species with >100,000 point records (e.g., the European starling (Sturnus vulgaris Linnaeus) had >23 million point records) were obtained directly from GBIF on 23–25 February 2022, which included records already contributed to GBIF from multiple databases.

All searches were based on genus and species epithets, where taxonomic changes in the historical records required decisions. Where an epithet changed since the 100 species were listed in 2000 (Lowe et al. 2000), both former and current names were searched and concatenated. For example, Lowe et al. (2000) listed the American bullfrog as Rana catesbeiana Shaw, 1802 which is now Lithobates catesbeianus (Shaw, 1802); both were included in searches, as well as synonyms. Taxonomic synonyms were resolved by referring to the Catalogue of Life (https://www.catalogueoflife.org/) Centre for Agriculture and Bioscience International (http://www.cabi.org) and World Flora Online (http://worldfloraonline.org). Listed synonyms and new combinations were included in data, whereas undocumented synonyms (i.e., provided in a database but not resolved above) were excluded. Database entries that lacked species epithets (i.e., genus only) were excluded and all identities were at the species level. Some taxa formerly identified in (Lowe et al. 2000) as a species are now subspecies (e.g., the red-ear slider Trachemys stricta (Thunberg in Schoepff, 1792) is now Trachemys stricta elegans (Wied-Neuwied, 1838)). For those taxa, data may be more inclusive in current taxonomy than the original intent. However, our use of species-level identities includes sub-specific hybrids (e.g., Parham et al. 2020). Overall, our approach: matched the taxonomic resolution of (Lowe et al. 2000); recognized variation through time and space; and included potential hybrids among subspecies.

We set a threshold for a species to be included in analyses at > 30 records because we judged distributions with fewer records to be inadequately represented. As a result, four species (notably disease agents or vectors) had too few data to be analyzed here: Aphanomyces astaci Schikora, 1906, Cinara cupressi (Buckton, 1881), Plasmodium relictum (Grassi & Feletti, 1891), and Trogoderma granarium Everts, 1898. Likewise, banana bunchy top virus was not present in databases, despite a reported global distribution (https://www.cabi.org/isc/datasheet/8161). As a potential alternative, we searched for its aphid vector (Pentalonia nigronervosa Coquerel, 1859) but obtained records that fully lacked Africa and Asia, despite the widespread tropical distribution of the virus. We thus treated banana bunchy top virus as an under-reported species and omitted it here. Finally, rinderpest was listed by Lowe et al. (2000) but since eradicated (Morens et al. 2011). Following Luque et al. (2014), we replaced rinderpest with Salvinia molesta D. S. Mitch, leaving 95 species to evaluate.

Species data were cleaned using two R packages. The scrubr package (https://github.com/ropensci/scrubr) was used with default settings to exclude records with geographic coordinates that were lacking, impossible, incomplete, imprecise, or unlikely. Data were further cleaned using the CoordinateCleaner package (Zizka et al. 2019), where records were excluded if geographic coordinates were zero (i.e., a flag for probable data error), near a country’s capital and geographic centroid, or near administrative locations (e.g., museums, GBIF headquarters). Data were then restricted to unique spatio-temporal records during the years 2000–2021 to exclude duplicate entries. This step also omitted older records that tend to have greater taxonomic and geographic uncertainty (e.g., GPS selective availability was removed in 2000). Finally, resulting maps were visually examined, where oddities (e.g., a tropical species located on Baffin Island or a terrestrial species in mid-ocean) were manually excluded from data. That last step removed a few locations per species, if any. As a result of the above process, data were cleaned to be conservative for errors in geographic distribution and consistent in taxonomy with Lowe et al. (2000).

Aggregation and Mapping

We spatially aggregated point data per species in (0.5 degree)² grid cells, using the World Geodetic System (WGS84); the basis for the geospatial positioning system. Thus analyses below and summarized data refer to (0.5 degree)² grid cells as units of study. Aggregation in space simplified variable coordinate accuracy in original records while retaining substantial resolution for global analyses. For two reasons that affected interpretations, we also aggregated data in time by pooling all records obtained for years 2000-2021. First, species richness is then based on presence/absence of reported species at any time during 22 consecutive years and should be sensitive to infrequent observations or occurrences. We reasoned that species absence maintained through two decades was either: (a) likely true or (b) due to lack of submitted records for that location, where the difference may be inferred from spatial patterns of records. Secondly, the difference between species richness (i.e., presence/absence) and cumulative occurrence was enhanced. Species richness is fully insensitive to commonality or rarity; a single record here obtains the same result as daily repeats for 22 years. In contrast, cumulative occurrences may range from 0 to thousands of (0.5 degree)² pixels during 22 years and could indicate commonality or rarity. Therefore, fundamental differences between species richness and occurrences were enhanced here by using data for years 2000–2021. We mapped species richness and cumulative records to address question 1 (“how are data distributed globally?”).

Potential Predictors of Invasive Species

We analyzed spatial autocorrelation (using longitude and latitude of 0.5^o grid cell centers) with local estimation (“loess”) regression to obtain a surface representing only geographic coordinate effects. Loess regression is a robust, nonparametric approach to represent a complex spatial surface (Ferrier et al. 2002, Helsel and Ryker 2002) and is not too computationally-intensive for fine-grained global data, unlike approaches based on covariance matrices or network meshes. The spatial texture of a loess regression surface is determined by its span, where values <1 have more texture and values >1 are smoother. We modeled species richness and cumulative records using the loess command in R, with degree = 2 (i.e., 2nd-order polynomial) and least-squares fitting. We iteratively adjusted span to minimize the residual standard error and maximize the correlation between predicted values and actual total records. Predicted values represented spatial autocorrelation alone.

Subsequent models using additional predictors (Table 1) included predictions from the loess model to evaluate those additional effects after already accounting for spatial autocorrelation. In addition, hierarchical structure (i.e., non-independence of grids) of grid cells within countries and anthropogenic biomes (anthromes) was handled using spatial GLMMs (Dormann et al. 2007).

All other predictors were matched to the 0.5° gridded species data using projectRaster in the R raster package. Climate conditions were obtained from WorldClim (Fick et al. 2017) per grid cell and represented by annual mean temperature (BIO1) and annual mean precipitation (BIO12). Initial scatter plots of data indicated potential curved relationships between invasive species measures (i.e., richness and total occurrence) and climate variables, consistent with expected general tolerance limits given global extremes. Accordingly, we included quadratic terms for those predictors in models (because a quadratic term is the most parsimonious means to include curvature in a linear model).

Anthropocentric effects were represented by four measures. Human population size in 2010 was used to represent the 2000-2021 interval based on the decadal census and obtained from the Socioeconomic Data and Applications Center (SEDAC; https://sedac.ciesin.columbia.edu/data/collection/gpw-v4). We expected human population size to represent multiple potential effects, including observer effects and/or demographic effects per se (e.g., more people releasing invasive species, and/or causing different land use intensities). We obtained per capita GDP (GDP_pc) data from (Kummu et al. 2018) and used the median of 2000–2015 per country to represent economic effects.

Potential governmental effects were evaluated using two indices of governmental systems from the World Bank (Kaufmann et al. 2011). We used median scores per country for 2000–2020 to represent a country for each index. Governmental Effectiveness scores the quality of public and civil services, governmental independence from political pressures, quality of policy formulation and implementation, and credibility of governmental commitment to policies. Governmental Effectiveness is strongly and inversely correlated with the Control of Corruption score, where greater corruption in a source nation predicted more invasive species intercepted in trade to New Zealand (Brenton-Rule et al. 2016). We expected greater Governmental Effectiveness would either negatively predict invasive species richness (Brenton-Rule et al. 2016) if effective governments also control invasive species well, or positively predict invasive species richness if they instead report them more often.

We also evaluated Voice & Accountability (Kaufmann et al. 2011) as a predictor, which indicates the extent that citizens can participate in selecting their government, an uncensored media and freedoms of expression and association. Voice & Accountability is a component of the more inclusive Governmental Effectiveness, and thus potentially collinear. We examined potential collinearity while including Voice & Accountability in analyses because we expected greater Voice and Accountability would predict greater invasive species if reporting was inversely related to censorship. To be clear, all human socioeconomic and governmental predictors may represent positive (e.g., sampling and reporting) and/or negative (e.g., extirpation) effects on invasive species distributions, where resulting signs of model coefficients would help infer net effects.

Statistical Analyses

To address question 2 (what predicts diversity?), we analyzed hypothesized drivers of patterns (Table 1) alone and in combinations using GLMMs with fixed and random effects. Estimated fixed effects represented mean effects for hypothesized predictors, whereas random effects represented the spatial nesting of grid cells within larger spatial categories of countries and anthromes. Both countries and anthromes were used here as random effects because we expected each to represent different effects and because preliminary modeling showed that using both terms was more effective than either alone. Models representing random effects only and a null model (i.e., intercept-only model) were also computed and compared to models of main interest.

Based on residual distributions and iterative model selection, we used zero-inflated models, which include two parts: conditional and zero-inflation models. The conditional model estimated linear effects of predictors after having accounted for “excess” zeroes in a logistic (i.e., binary) zero-inflation model. This detail was important to the results because models used the many apparent “zeroes” (due to true absences or sampling effects) in data to generate predicted values in those unreported locations. We compared model predictions to observed data to validate model results, especially for predicting invasive species in locations lacking observations.

We used cumulative occurrences per grid cell as a covariate in species richness models to represent direct sampling effects on recorded species (akin to rarefaction), based on the expectation that more occurrences represent more reported observations. Thus, species richness results here were statistically distinct from those for cumulative records and were adjusted for sampling effects. In those same models, we then treated Government Effectiveness and Voice & Accountability as representing indirect effects due to infrastructure to report invasive species and access to those data.