Data and code from: Assessing the transferability of species distribution models: A cross-continental evaluation
Data files
May 19, 2026 version files 27.76 KB
-
4_R_codes.zip
11.10 KB
-
README.md
16.66 KB
Abstract
To identify an appropriate method for assessing the transferability of species distribution models, the distribution of three invasive plant species was predicted using Maxent across different continents with datasets sourced from various continental combinations.
The study confirmed that the conventional approach, namely using random holdout test datasets and the AUC (the area under the receiver operating characteristic curve), failed to reliably assess model transferability. Instead, using spatially independent test datasets and RWIP (correlation coefficient with the predictions from the model calibrated in the predicted region) provided a more robust evaluation approach.
This dataset is the underlying data and codes of the study. It consists of four parts: distribution data, environmental data, Maxent predictions: Maxent output and Maxent values at point locations, and R codes used for thinning presence data and calculating evaluator values.
Model transferability was examined using three invasive plant species. For each species, three or four invaded regions were selected for distribution predictions, and the models were calibrated using data from various regional combinations. The specific geographical domains of each region are described in "5 Geographical extents of data".
The prediction results were evaluated using three indices: AUC, CBI, and RWIP.
AUC: the area under the receiver operating characteristic curve.
CBI: continuous Boyce index.
RWIP: correlation coefficient with the predictions from the model calibrated in the predicted region.
Additionally, 4-fold cross-validations and internal predictions were made using the same data.
In 1 to 4 below, the item numbers correspond to the zip file name numbers. The file names are long to indicate the content of each file. However, shorter file names were used in this study.
1 Distribution data
Filename: 1_Distribution_data.zip
The distribution data (global presence data) of the three species were downloaded from the Global Biodiversity Information Facility (GBIF) database (GBIF.org, 26 August 2023, 21 October 2023, 15 December 2023) GBIF Occurrence Download). They can be accessed at https://doi.org/10.15468/dl.2gj4cq, https://doi.org/10.15468/dl.8u2cxw, https://doi.org/10.15468/dl.nxrrxp, https://doi.org/10.15468/dl.9c2rdh, https://doi.org/10.15468/dl.unn2pm, and https://doi.org/10.15468/dl.uwhnct. 1.1 to 1.3 below were extracted from the downloaded data.
1.1 Presence data
Presence data with coordinate information from the regions of interest were extracted using QGIS version 3 and shapefiles from the global administrative areas (GADM) database (https://gadm.org).
1.2 Presence data used for model calibration
To eliminate the effect of area difference due to latitude on the amount of presence data, the data were thinned according to the latitude of the area. In the column header, "X" represents decimal longitude and "Y" represents decimal latitude.
1.3 Presence data up to 1970 and from 1971 onward
To compare the distribution up to 1970 and from 1971 onward, the presence data with coordinate and year information were divided according to year information. The "Distribution_maps" file displays these presence data on maps and shows their relationships to the model calibration regions and prediction target regions.
2 Environmental data
Filename: 2_Environmental_data.zip
The data of bioclimate variables were extracted from the CliMond database (Kriticos et al. 2012) using QGIS and shapefiles from the GADM database. Each subfolder name is a combination of the species' scientific name and the regional names.
Each subfolder contains the data of 8 variables: the annual mean temperature (Bio01), temperature seasonality (Bio04), mean temperature of the warmest quarter (Bio10), mean temperature of the coldest quarter (Bio11), precipitation seasonality (Bio15), precipitation of the wettest quarter (Bio16), annual mean moisture index (Bio28), and moisture index seasonality (Bio31).
3 Maxent predictions
Filename: 3_Maxent_predictions.zip
3.1 Maxent output
Maxent (version 3.4.4; Phillips et al. 2017) run produces multiple output files. The directory "56 predictions" contains the output files from the distribution predictions in the invaded regions described above. In the region names of the folders, the part before the hyphen is the calibration region, and the part after the hyphen is the target region.
The directory "4-fold_cross-validation" contains the output files from 4-fold random holdout cross-validations using the calibration data. The directory "Internal prediction" contains the output files from the distribution predictions by models calibrated using data from the predicted regions.
3.2 Maxent values
Maxent values at point locations were extracted using QGIS and exported as CSV files. In the column header of each file, "X" represents decimal longitude, "Y" represents decimal latitude, and "SAMPLE_1" represents the output value of Maxent (cloglog value).
3.2.1 Maxent values for presence cells
Presence data were subjected to simple systematic sampling with a reference grid of the same resolution as environmental data, and used to sample Maxent values from the 56 predictions.
3.2.2 Maxent values for absence cells
The center points of environmental data cells without presence data were used to sample Maxent values from the 56 predictions. The "2" at the end of each file name is added to avoid having the same name as a file in other directories.
3.2.3 Maxent values to calculate RWIP
The center points of environmental data cells were used to sample Maxent values from the 56 predictions and internal predictions. The data in column "SAMPLE_1" are from the internal predictions in the target regions, and the data in column "SAMPLE_1_2" are from the 56 predictions. The "3" at the end of each file name is added to avoid having the same name as a file in other directories.
3.2.4 Maxent values for cross-validations
The center points of environmental data cells were used to sample Maxent values from the 4-fold cross validations.
4 R codes
Filename: 4_R_codes.zip
R version 4.2.2 was used.
4.1 Thinning applied to presence data
To eliminate the effect of area differences due to latitude on the amount of presence data, the data were thinned according to the latitude of the area.
4.1.1 Longitudinal length
A table was created showing the longitudinal length (length of a meridian per degree of longitude) corresponding to each latitude (1° latitude intervals). The "deg" column of this table gives latitude, the "x" column gives the longitudinal lengths calculated by considering the Earth as a sphere, and the "z" column gives the longitudinal lengths calculated using eccentricity (The unit is km). This table was exported as a CSV file. The file named "distance" is the output created by this code.
4.1.2 Frequency distribution
The data from "1.1 presence data" were imported (although the names of object varied depending on the region, the name "D" is used here), and a table of the number of presence data per degree of latitude was created.
4.1.3 Thinning
This code was subsequently executed in the same project as "4.1.2 Frequency distribution".
This code divided the presence data into subsets at 1° latitude intervals, and reduced the quantity of presence data in accordance with the ratio of longitudinal length. Before the execution, the file "distance" was imported, and the ratio of the longitudinal length corresponding to the latitude of each subset to that of the highest latitude subset for each species was calculated (the values in the "z" column were used for the calculation, and the calculation results were stored in the "rate" column).
A column indicating the species name was added to the data frame "distribution3", output, and used for model calibration. The data frame "check1" is a table that shows the numbers of data, the numbers of thinned data, and the numbers of data after thinning for each latitude class.
4.2 AUC calculation
4.2.1 Data
4.2.2 Calculation
To calculate the AUC for the 56 predictions, these two codes were executed in the same project. Data from "3.2.1 Maxent values for presence cells" and "3.2.2 Maxent values for absence cells" were imported as the objects "presence" and "absence", respectively. "4.2.1 data" organized this data and created a data frame "df". Subsequently, "4.2.2 Calculation" calculated the AUC based on "df".
4.3 CBI calculation
4.3.1 Cell distribution
4.3.2 Moving windows
These two codes were executed in the same project when calculating the CBI for the 56 predictions. The data frame "df" created by the code "4.2.1 data" was imported as the object "D0", and the format of "frame" and "frame2" (frame_format.csv and frame2_format.csv) were imported.
"4.3.1 Cell distribution" created a data frame showing two types of numbers: all cells and cells that contain presence data, for each 0.01 increment of the Maxent value ("frame"). In the format of "frame", each row corresponds to a numerical range, with the values from 0 to 1 divided into increments of 0.01. In the column header, "low" represents the lower limit of the numerical value for each row, "high" represents the upper limit, "D" represents the number of cells whose Maxent values are within the numerical range of each row, and "DP" represents the number of cells that contain presence data.
Subsequently, "4.3.2 Moving windows" created a data frame showing these numbers for each moving window ("frame2") and exported. In the format of "frame2", the column header's "low" represents the lower limit of each moving window, "high" represents the upper limit, "D" represents the number of cells whose Maxent values fall within these numerical ranges, and "DP" represents the number of cells that contain presence data. Additionally, in the column header, "E" represents the ratio of "D" to the total number of cells, "P" represents the ratio of "DP" to the total number of cells that contain presence data, and "rate" represents the ratio of "P" to "E".
4.3.3 Rank correlation coefficient
The data frame "frame2" created by the code "4.3.2 Moving windows" and the format of "D0" (D0_format.csv) were imported, and this code was executed to calculate the CBI.
The column header items in the "D0" format represent the following variables.
x0: the ranking of the numerical values corresponding to each moving window.
x: the ranking of the numerical values corresponding to each moving window,
for moving windows that contain at least one cell.
y: the "rate" of "frame2" created above.
ranky: the ranking of the values of "y".
nexty: when the data are sorted in order of the value of "y", the value of "y" of the next data. (The values in this column were used to check if there were other instances of the same value "y".)
dx: deviation of "x".
dy: deviation of "y".
dxy: product of deviations.
TF: Whether the moving window contains no cells. (If the moving window contains no cells, "FALSE" is replaced with "TRUE".)
4.3.4 Cross-validation cell distribution
When calculating the CBI for 4-fold cross-validations, this code was executed instead of "4.3.1 Cell distribution" in the same project as "4.3.2 Moving windows".
Before this execution, the test points data were extract from "samplePrediction" files in "3.1 Maxent output" (Data with "test" in the "Test.or.train" column were extracted), and imported as object "DP0" (or DP1, DP2, DP3) to be used as presence cells data. The data from "3.2.4 Maxent values for cross-validations" were imported as object "DE0" (or DE1, DE2, DE3) to be used as all cells data. The results of 4-fold cross-validations consist of four predictions, each of which is numbered 0 to 3 in Maxent output. The numbers after DP and DE correspond to these, respectively. The projects were run using DP and DE with the same numbers. Here, the code when using DP0 and DE0 is shown. The other codes are exactly the same as this one, except for the numbers after DP and
DE.
Additionally, the formats of "frame" and "frame2" were imported, and the project created and exported "frame2".
4.4 RWIP calculation
Correlation coefficient
Data from "3.2.3 Maxent values to calculate RWIP" were imported, and this code was executed to calculate the RWIP. The object "r3" contains the RWIP value.
5 Geographical extents of data
For each region, the area boundaries were defined by using administrative districts to ensure that they included the distribution areas of the species. The GBIF database contains presence data with and without coordinate information. By comparing the two in the above areas, if a country (or, in the case of the United States, Canada, Brazil, Russia, China, and Australia, an administrative division, such as a state or province) had the latter but not former, the country was excluded from the area defined above.
The specific geographical domains of each region are as follows.
<Oxalis latifolia>
Native region: America
Invaded regions: Oceania, Africa, and Europe
America:
The geographical area was defined as follows:
The area of the Americas south of the United States-Canada border. This includes 34 countries (all countries as of November 2023), French Guiana, and four regions that had presence data with coordinate information (Bermuda, Guadeloupe, Martinique, and the Turks and Caicos Islands).
The following countries and regions were excluded from the above area because only presence data without coordinate information were available: Uruguay, three states of the United States (Georgia, Louisiana, and Washington), and the Bahia state of Brazil.
Oceania:
Australia, New Zealand, and New Caledonia, which had presence data with coordinate information and were in proximity to each other.
Africa:
The geographical area was defined as follows:
54 countries (all countries as of November 2023), Western Sahara, and two regions that had presence data with coordinate information (Canary Islands and Reunion).
Eritrea and Burundi were excluded from the above area because only presence data without coordinate information were available.
Europe: Excluding Cyprus, Iceland, and Russia
<Digitaria sanguinalis>
Native region: Europe
Invaded regions: Oceania, Africa, North America, and South America
Europe:
Includes European Russia (five Federal Districts: Central, South, North-Western, Volga, and North Caucasus, excluding Novaya Zemlya and the islands north) and three South Caucasus countries (Armenia, Azerbaijan, and Georgia), excluding Cyprus and Iceland.
The following countries and regions were excluded from the above area because only presence data without coordinate information were available: three countries (Albania, Armenia, and Bosnia and Herzegovina) and the Kirov Oblast of Russia.
Oceania:
Australia, New Zealand, and New Caledonia, which had presence data with coordinate information and were in proximity to each other.
Africa:
The geographical area was defined as follows: 54 countries (all countries as of November 2023), Western Sahara, and the Canary Islands (this region had presence data with coordinate information).
The following nine countries were excluded from the above area because only presence data without coordinate information were available: Burkina Faso, Ethiopia, Senegal, Seychelles, Somalia, South Sudan, Sudan, Uganda, and Zambia.
North America:
The area from southern Canada (the area of 10 provinces out of the 10 provinces and three territories in the whole country) in the north to Panama in the south.
The following seven countries were excluded from the above area because only presence data without coordinate information were available: Antigua and Barbuda, Belize, Dominican Republic, El Salvador, Jamaica, Saint Lucia, and Trinidad and Tobago. North Dakota in the United States was also excluded for the same reason.
South America:
Uruguay and six states of Brazil (Acre, Amazonas, Ceara, Rondonia, Roraima, and Sergipe) were excluded because only presence data without coordinate information were available.
<Amaranthus retroflexus>
Native region: North America
Invaded regions: Oceania, Europe, and East Asia
North America:
The Americas north of Panama. The Caribbean Islands were excluded because no presence data with coordinate information were available.
Oceania: Australia and New Zealand, which had presence data.
Europe:
Includes European Russia (five Federal Districts: Central, South, North-Western, Volga, and North Caucasus) and three South Caucasus countries, excluding Cyprus and Iceland.
East Asia: China, Japan, Mongolia, North Korea, South Korea, and Taiwan
References
Kriticos, D. J., B. L. Webber, A. Leriche, et al. 2012. "CliMond: Global High-Resolution Historical and Future Scenario Climate Surface for Bioclimatic Modelling." Methods in Ecology and Evolution 3, no 1: 53-64.
Phillips, S. J., R. P. Anderson, M. Dudik, R. E. Schapire, and M. E. Blair. 2017. "Open the Black Box: An Open-Source Release of Maxent." Ecography 40, no. 7: 887-893.
