Supporting information for: Accounting for the topology of road networks to better explain human-mediated dispersal in terrestrial landscapes

Rocabert, Charles 1 ; Fenet, Serge2; Kaufmann, Bernard3; Gippet, Jérôme M. W.4

Published Nov 13, 2023 on Dryad. https://doi.org/10.5061/dryad.tdz08kq5j

Abstract

Human trade and movements are central to biological invasions worldwide. Human activities not only transport species across biogeographical barriers but also accelerate their post-introduction spread in the landscape. Thus, by constraining human movements, the spatial structure of road networks might greatly affect the regional spread of invasive species. However, few invasion models have accounted for the topology of road networks so far, and its importance for explaining the regional distribution of invasive species remains mostly unexplored. To address this issue, we developed a spatially explicit and mechanistic human-mediated dispersal model that accounts and tests for the influence of transport networks on the regional spread of invasive species. Using as a model the spread of the invasive ant Lasius neglectus in the middle Rhône valley (France), we show that accounting for the topology of road networks improves our ability to explain the current distribution of the invasive ant. In contrast, we found that using human population density as a proxy for the frequency of transport events decreases models’ performance and might thus not be as appropriate as previously thought. Finally, by differentiating road networks into sub-networks, we show that national and regional roads are more important than smaller roads for explaining spread patterns. Overall, our results demonstrate that the topology of transport networks can strongly bias regional invasion patterns and highlight the importance of better incorporating it into future invasion models. The mechanistic modelling approach developed in this study should help invasion scientists explore how human-mediated dispersal and topography shape invasion dynamics in landscapes. Ultimately, our approach could be combined with demographic, natural dispersal and environmental suitability models to refine spread scenarios and improve invasive species monitoring and management at regional to national scales.

https://doi.org/10.5061/dryad.tdz08kq5j

This Dryad repository contains all the supporting information, data and the simulation framework related to the publication "Accounting for the topology of road networks to better explain human-mediated dispersal in terrestrial landscapes" (Rocabert et al. 2023).

The reader will find here a complete description of the repository, as well as guidelines to re-run the pipeline used to produce the results presented in the main manuscript.

Table of content

1) Repository content summary
2) MoRIS (Model of Routes of Invasive Spread) software
3) Description of the dataset
- 3.1) Supplementary figures and tables
- 3.2) Supporting information documents
- 3.3) Input files
4) Complete pipeline (Supporting Information 5)
- 4.1) Pipeline organization
- 4.2) Content description
5) Running the pipeline
- 5.1) Introduction
- 5.2) Supported platforms and dependencies
- 5.3) Dependencies
- 5.4) Compile the simulation executable
- 5.5) Run the validation of the CMA-ES outputs
- 5.6) Find and run the best parameters set of each scenario
- 5.7) Compute performance metrics distributions
- 5.8) Generate the figures of the manuscript
- 5.9) Convert figures

1) Repository content summary

Figure S1. General overview of the human-mediated dispersal algorithm.
Figure S2. Log-likelihood distribution of each calibrated model.
Table S1. Performance metrics of each calibrated model.
Animation S1. Animation of the spatial spread through time for each selected model.
Supporting Information 1. Input files for the simulation framework.
Supporting Information 2. Runtime information.
Supporting Information 3. Spatial distribution of experimental and simulated presences/absences of each calibrated model.
Supporting Information 4. Description of the invasion dynamics of the road network model.
Supporting Information 5. Complete pipeline (including the simulation framework) used to produce post-analyses, figures and gif animation.

2) MoRIS (Model of Routes of Invasive Spread) software

The software developed and used for this publication is freely available at https://github.com/charlesrocabert/MoRIS. You can consult MoRIS Github page for details about the software, a first usage tutorial, and a guideline to construct input files. Please contact the authors if you plan to use MoRIS for scientific purpose.

3) Description of the dataset

The content of the repository is described below. Please consult the main manuscript for more context (Rocabert et al. 2023).

3.1) Supplementary figures and tables

Supplementary figures (Figure S1; Figure S2) and the supplementary Table S1 are available (also in tabular format).
A gif animation showing one simulated example of human-mediated dispersal for each tested scenario is available in Animation S1.

3.2) Supporting information documents

Three supplementary information documents are available:

Supporting Information 2 (SupportingInformation2.pdf): Runtime information.
Supporting Information 3 (SupportingInformation3.pdf): Spatial distribution of experimental and simulated presences/absences of each calibrated model.
Supporting Information 4 (SupportingInformation4.pdf): Description of the invasion dynamics of the road network model.

3.3) Input files (Supporting Information 1)

Three input files are necessary to run human-mediated dispersal (HMD) simulations. As described in the main manuscript, these files are:

The map file (map.txt), containing the area of interest discretized in 2x2km square cells, i.e. The Rhône valley around Lyon urban area (France);
The network file (network.txt), containing a discretized version of the road network connecting cells on the map;
The sample file (sample.txt), containing the sampling effort of the invasive species of interest, cell by cell.

A guideline is available at https://github.com/charlesrocabert/MoRIS/blob/master/INPUT_FILES_TUTORIAL.md to build these files.

Files are structured as following:

• `map.txt`:

This file describes the properties of each cell in the discretized map.

Column 1: Cell identifier;
Column 2: X coordinate (in meters, cell centroid);
Column 3: Y coordinate (in meters, cell centroid);
Column 4: Cell's area (square meters);
Column 5: Cell's suitable area (square meters, not used here);
Column 6: Population size (not used here);
Column 7: Population density;
Column 8: Road density (not used here);

• `network.txt`:

This file is an adjacency list describing road connectivity between cells. Identifier -1 symbolizes the map border.

Column 1: Cell 1 identifier;
Column 2: Cell 2 identifier;
Column 3: Number of category I roads connecting the two cells;
Column 4: Number of category II roads connecting the two cells;
Column 5: Number of category III roads connecting the two cells;
Column 6: Number of category IV roads connecting the two cells;
Column 7: Number of category V roads connecting the two cells (not used here);
Column 8: Number of category VI roads connecting the two cells (not used here);

• `sample.txt`:

This file contains the Lasius neglectus experimental sampling dataset.

Column 1: Cell identifier;
Column 2: Number of positive samples in the cell (presence of L. neglectus);
Column 3: Total number of samples in the cell;

4) Complete pipeline (Supporting Information 5)

The compressed archive SupportingInformation5.tar.gz contains the complete pipeline to reproduce the data analysis, the figures and the animation presented in this work. It also includes the simulation framework (see the Github repository for the last version of the software).

The pipeline is organized around several folders that are pre-filled with simulation and post-processed data, to avoid time-consuming calculations for end-users and readers. Several scripts are also available to run the pipeline step by step.

4.1) Pipeline organization

1_simulation_results:
This folder contains the raw CMA-ES optimization results. For each of the four scenarii (isotropic, human activity, road network, and combined; see main manuscript), two files are provided:
- Best optimization result (file suffix _best.txt): This is the best point ever found for this optimization run,
- Mean optimization result (file suffix _mean.txt, not used here): The center of the best multivariate normal distribution found by CMA-ES during the optimization process (see Hansen & Auger, 2011).
All files have the same structure, each line being the result of one optimization run. The score column corresponds to the log-likelihood; all other columns are optimized or pre-defined simulation parameters (please consult the main manuscript and the Github repository for detailed explanations).
2_cmaes_validation:
This folder contains re-calculated log-likelihood distributions from the best parameter sets found by CMA-ES (see above). For each scenario, log-likelihood distribution mean and variance are in columns replay_mean and replay_var of the file XXX_replayed.txt (with XXX the scenario).
3_best_models:
This folder contains the output of one single simulation executed on the best parameters set found for each scenario. The organization of this output is described below.
4_models_evaluation:
This folder contains the result of the calculation of performance metrics (AUC, TSS, etc). For each of the four scenarii, performance metrics are computed 100 times to obtain a distribution (see main manuscript). For each scenario, metrics are store in the file score_distribution.txt.
5_models_complete_evaluation:
This folder contains the result of the calculation of some performance metrics at every time step during a simulation. For each scenario, metrics are store in the file complete_evaluation_all.txt.
input_files:
This folder contains the three input files (see Supporting Information 1), and two datafiles (.shxand .shp) describing the area of interet and used to generate graphics.
figures, gif:
These folders will contain generated figures and gif animations;
src, cmake, build:
These folders contain all the material (source code, compilation scripts, binary folders) to run numerical simulations (see instructions below);
scripts:
This folder contains lower-level scripts used to run the pipeline;

4.2) Content description

• `1_simulation_results` files content:

Column exec: Relative path of MoRIS executable file;
Column map: Relative path of the map file;
Column network: Relative path of the network file;
Column sample: Relative path of the sample file;
Column typeofdata: Type of experimental data. In the case of this study, always "PRESENCE_ABSENCE";
Column optimfunc: Optimization function used to calculate the minimization score (here, always "LOG_LIKELIHOOD");
Column law: Law of the probability distribution of dispersal length (in number of cells; always "LOG_NORMAL");
Column seed: Seed of the pseudo-random numbers generator;
Column iters: Number of iterations (here, always 25 years);
Column reps: Number of repetitions (here, always 1,000);
Column wmin: Minimal connection weight between cells;
Column pintro: Probability of presence in the cell of introduction (always 1 in this study);
Column humanactivity: Boolean indicating if human activity metrics should be used in simulations;
Column cell_id: Identifier of the cell of introduction;
Column xintro: X-coordinate of the point of introduction;
Column yintro: Y-coordinate of the point of introduction;
Column mu: Best mu value found by CMA-ES;
Column lambda: Best lambda value found by CMA-ES;
Column sigma: Best sigma value found by CMA-ES;
Column gamma: Parameter never used here (always 0);
Column w1: Best category I road weight found by CMA-ES;
Column w2: Best category II road weight found by CMA-ES;
Column w3: Best category III road weight found by CMA-ES;
Column w4: Best category IV road weight found by CMA-ES;
Column w5: Category V road weight (always 0 in this study);
Column w6: Category VI road weight (always 0 in this study);
Column score:Best log-likelihood found by CMA-ES ("score" is a generic term because other scores are possible, see MoRIS software);

• `2_cmaes_validation` files content:

File content is similar to above, except two columns of interest:

Column replay_mean: Mean of the distribution of re-calculated log-likelihoods;
Column replay_var: Variance of the distribution of re-calculated log-likelihoods;

• `3_best_models` files content:

For each scenario, the output folder contains the typical output of a simulation:

The final_state.txt file contains the final state (here, after 25 simulated years) of the simulation. Similar files are also generated for each time step (from 0 to 24 years), with an identical structure. These files are structured as following:

Column id: Cell identifier;
Column x: Cell X-coordinate (centroid);
Column y: Cell Y-coordinate (centroid);
Column y_obs: Number of experimental presences;
Column n_obs: Total number of experimental observations;
Column p_obs: Proportion of positive observations;
Column total_nb_intros: Total number of simulated introductions;
Column mean_nb_intros: Mean number of simulated introductions per repetition;
Column var_nb_intros: Variance of the number of simulated introductions per repetition;
Column y_sim: Number of simulated presences;
Column n_sim: Number of simulated observations;
Column p_sim: Proportion of simulated presences;
Column mean_first_invasion: Mean time of the first invasion (in years);
Column var_first_invasion: Variance of the time of the first invasion (in years);
Column mean_last_invasion: Mean time of the last invasion (in years);
Column var_last_invasion: Variance of the time of the last invasion (in years);
Column L: Likelihood;
Column empty_L: Log-likelihood with no simulated invasion;
Column max_L: Log-likelihood when the simulation perfectly matches the experimental data;
Column empty_score: Score with no simulated invasion;
Column score: Score when the simulation perfectly matches the experimental data;

The lineage_tree.txt file contains the complete list of HMD events during a simulation (for all repetitions), allowing to reconstructing the spread history. Each line describes one HMD event:

Column repetition: Repetition of the simulation;
Column start_node: Identifier of the starting cell of the HMD event;
Column end_node: Identifier of the ending cell of the HMD event;
Column geodesic_dist: "Geodesic" distance between cells (Shortest distance between the two cells on the connectivity graph, in number of cells);
Column euclidean_dist: Euclidean distance (in meters) between cell centroids;
Column iteration: Iteration (here, current year of the simulation);

The parameters.txt file contains all the input parameters of the simulation (see MoRIS software for a full description).

• `4_models_evaluation` files content:

For each scenario, the file score_distribution.txt contains the result of the calculation of performance metrics:

Column REP: Repetition;
Column likelihood: Likelihood;
Column empty_likelihood: Likelihood when there is no simulated presence;
Column max_likelihood: Likelihood when the simulation perfectly matches experimental data;
Column empty_score: Log-likelihood when there is no simulated presence;
Column score: Log-likelihood;
Column AUC: Area under the (ROC) curve;
Column d_th: Partition threshold minimizing the euclidean distance to the top-right corner (max(TPR) and min(FPR)) of the ROC curve (threshold used to split simulated data in presences/absences, see main manuscript);
Column d: Euclidean distance to the top-right corner (max(TPR) and min(FPR)) of the ROC (receiver operating characteristic) curve;
Column TPR: Sensitivity;
Column FPR: 1-Specificity;
Column ACC_th: Partition threshold minimizing the accuracy score,
Column ACC: Accuracy score;
Column F1_th: Partition threshold minimizing the F1 score;
Column F1: F1 score;
Column KAPPA_th: Partition threshold minimizing the standard Kappa score;
Column KAPPA: Standard Kappa score;
Column QDIS: Quantity disagreement;
Column ADIS: Allocation disagreement;
Column TSS_th: Partition threshold minimizing the true skill statistic;
Column TSS: True skill statistic;

• `5_models_complete_evaluation` files content:

For each scenario, the file complete_evaluation_all.txt contains the result of the calculation of some performance metrics at any time during a simulation:

Column rep: Repetition;
Column t: Time step (in years);
Column logL: Log-likelihood;
Column AIC: Corresponding AIC;
Column nb_colonies: Total number of simulated colonies (i.e. presences);
Column AUC: Area under the curve;
Column BOYCE_index: Boyce index;
Column BOYCE_pvalue: Associated p-value;

• `scripts` folder content:

This folder contains all the low-level scripts needed to run the pipeline. This includes Python and R-scripts to handle simulation outputs and run post-treatments, and R-scripts to generate and convert figures and the gif animation. The reader does not need to call directly these scripts, as higher-level shell scripts are provided to run the pipeline (see below).

Script validate.py: This Python script runs the validation pipeline (related to 2_cmaes_validation folder);
Script best_model.py: This Python script runs simulation examples for the best parameters set of each scenario (related to 3_best_models folder);
Script evaluate.py: This Python script evaluates performance metrics of each best scenario (related to 4_models_evaluation folder);
Script evaluation.R: R-script associated to evaluate.py script;
Script complete_evaluation.py: This Python script evaluates performance metrics at every time step of each best scenario (related to 5_models_complete_evaluation folder);
Script complete_evaluation.R: R-script associated to complete_evaluation.py script;
Script Print_LogLikelihood_AIC_metrics.R: This R-script displays log-likelihood and AIC for each best scenario;
Script Print_Evaluation_metrics.R: This R-script displays performance metrics for each best scenario;
Script Figure3.R: This R-script generates the Figure 3 of the main manuscript;
Script Figure4.R: This R-script generates the Figure 4 of the main manuscript;
Script Figure5.R: This R-script generates the Figure 5 of the main manuscript;
Script FigureS2.R: This R-script generates the Figure S2 of the main manuscript;
Script Figures_SupportingInformation3.R: This R-script generates the figures of Supporting Information 3 document;
Script Figure1_SupportingInformation4.R: This R-script generates the Figure 1 of Supporting Information 4 document;
Script Figure3_SupportingInformation4.R: This R-script generates the Figure 3 of Supporting Information 4 document;
Script Figure4_SupportingInformation4.R: This R-script generates the Figure 4 of Supporting Information 4 document;
Script AnimationS1.R: This R-script generates the components of the Animation S1 gif;

• Simulation framework:

src, cmake and build folders contain source code (C++), compilation scripts and executables for the simulation framework. Please consult MoRIS software Github page:

src folder:
- HMD_model_run.cpp: Main simulation executable;
- lib folder:
  - Enums.h: Enumerations;
  - Prng.h: Pseudo-random numbers generator class declaration;
  - Prng.cpp: Pseudo-random numbers generator class definition;
  - Node.h: Node (here, corresponding to map cells) class declaration;
  - Node.cpp: Node class definition;
  - Graph.h: Graph (here, corresponding to the road network) class declaration;
  - Graph.cpp: Graph class definition;
  - Parameters.h: Simulation parameters class declaration;
  - Parameters.cpp: Simulation parameters class definition;
  - Simulation.h: Simulation class declaration;
  - Simulation.cpp: Simulation class definition;ully tested on Unix/Linux and macOS platforms.
cmakefolder:
- make_clean.sh: Make clean script;
- make_debug.sh: Compilation script in debug mode;
- make_release.sh: Compilation script in optimized mode;
- modulesfolder:
  - Config.h.in: Header file containing the versioning of the software;
  - FindGSL.cmake: Module used by CMake to find the GSL library;
build folder:
- bin folder: will contain the binary executable HMD_model_run after compilation;

5.3) Dependencies

A C++ compiler (GCC, LLVM, ...);
CMake (command line version);
GSL for C/C++;
CBLAS for C/C++;
Python ≥ 3 (Packages CMA-ES and numpy are required);
R (packages ggplot2, cowplot, ggpubr, sf, viridis and scales are required);
ImageMagick;
poppler;
pdf2svg;

5.4) Compile the simulation executable

To compile the executable, navigate to the folder cmake, and run the following command line in a terminal:

sh make_release.sh

5.5) Run the validation of the CMA-ES outputs

To compute the log-likelihood distribution of the parameters sets found by the optimization algorithm (100 repetitions, see main manuscript), run the following command line in a terminal:

sh A_run_validation.sh

Resulting files will be saved in the folder 2_cmaes_validation.
This script may take several hours to complete.

5.6) Find and run the best parameters set of each scenario

The next script finds the best parameters set of each model by comparing the average log-likelihoods and selecting the lowest one (see main manuscript). The script then launches a simulation with N=1,000 repetitions. Run the following command line in a terminal:

sh B_run_best_models.sh

Resulting files will be saved in the folder 3_best_models.

5.7) Compute performance metrics distributions

To compute the various performance metrics associated to each calibrated model (see main manuscript), run the following scripts:

sh C_compute_evaluation_distributions.sh

And:

sh D_compute_complete_evaluation_distributions.sh

This operation could also take some time. Resulting files will be saved in the folders 4_models_evaluation and 5_models_complete_evaluation.

5.8) Generate the figures of the manuscript

To generate the figures of this manuscript, simply execute the following script (the Unix libraries poppler, pdf2svg and ImageMagick are needed, as well as the R-packages ggplot2, cowplot, sf, ggpubr, viridis and scales):

sh E_generate_figures.sh

All the figures are saved in the folder figures. The AnimationS1 gif is saved in the folder gif.

5.9) Convert figures

To convert figures in png and svg format, run:

sh F_convert_figures.sh

Converted figures are saved in the folder figures.