Data from: An assessment of sampling strategies for surveying introgression in watersheds
Data files
Mar 10, 2026 version files 767.92 MB
-
assessmentSampStrat_CJFAS2026.zip
767.91 MB
-
README.md
12.46 KB
Abstract
Extensive translocation and stocking of non-native fish have resulted in widespread patterns of hybridization and genomic introgression. Unfortunately, current methodologies used to survey the disruption of natural patterns of genetic variation often involve sampling from single, accessible sites in a watercourse. This leads to an overestimation of non-native admixture, as modern sampling sites are correlated with historical stocking locations where trail, road, or rail access was available. These strategies can lead to inflated, imprecise estimates of admixture and subsequently, inappropriate conservation measures. Here, we use an individual-based demographic and genetic model to simulate introgression. Then, we use iterative resampling of simulations to determine the relative importance of spatial site arrangement, number of sites, reach length along a watercourse, and number of fish sampled. Results reinforce and demonstrate expectations of statistical theory: that the most representative sampling strategies are those that use systematically distributed sampling locations and multiple sites per watershed. This study provides recommendations about robust sampling design and outlines the risks of poor sampling decisions that can lead to negative conservation outcomes. The dataset is separated into the three watersheds that were simulated. All files required to run the simulations and output datasets are contained within the respective folders. Technical descriptions of the steps and parameterization for the spatial modelling and demographic modelling are included in two separate pdfs. Data and scripts for the final analysis of all watersheds are contained in the Analysis folder. Note that spatial and demographic modelling was built for the sampling strategy analysis and so may not reflect actual hydrological or biological conditions in these watersheds.
Dataset DOI: 10.5061/dryad.djh9w0wcr
Description of the data and file structure
Dataset Overview
This dataset was generated from spatial modelling (completed with Unicor and QGIS), spatial demographic modelling (completed with CDMetaPop), sampling algorithms (completed in R and STAN), and a final synthesis of outputs (completed in R). All input files are included for running on Windows or Linux OS. In addition, output files are included from Unicor, sampling algorithms, and final synthesis.
Directory Structure
Within the assessmentSampStrat_CJFAS2026.zip folder, the directory structure is as follows:
Folder: Analysis
Folder: BowRiver
Folder: SprayReservoir
Folder: StarCreek
Software
These simulations require QGIS 3.22 and UNICOR v3.0 (https://github.com/ComputationalEcologyLab/UNICOR) to create the spatial input files. CDMetaPop v2.74 (https://github.com/ComputationalEcologyLab/CDMetaPOP) was then used to produce the landscape simulation data. All data processing and analysis were completed in R 4.4.1 (https://www.r-project.org/) and STAN (https://mc-stan.org/).
Folder: Analysis
This folder contains the R code (within the Scripts folder) that was used to process output from CDMetaPop and the final data files (.rds) that were created.
Subfolder: Scripts
Code is provided within the Analysis/scripts folder to complete each step:
- CDMPdataReformatting.R: Process each year of data outputs from CDMetaPop into dataframes. This script produces all_sample_individuals.rds files within the cdmp_outputs folders.
- bow_sampling.R, spray_sampling.R, star_sampling.R: Run sampling algorithm that selects individual fish based on the four variables of sampling strategy (spatial site arrangement, number of sites, reach length, and number of fish sampled). This produces samplingResults.rds files.
- calculate_metrics.R: Combine all watersheds into a single dataframe and calculate metrics that will be used to assess the sampling strategies. This produces the accuracy_ALL.RDS file, which is also included in the Analysis folder.
- results.R: Calculate summary statistics and models for the manuscript. This includes the overall mean error, the random forest analysis, root mean square errors, the linear model, and the contingency table for classification accuracy.
- linearModel.stan: Called by results. R to build a linear model in RStan.
- figures.R: Create figures used in the manuscript. Some minor figure compilation and label adjustments were completed in Inkscape software.
File: accuracy_ALL.rds
This file contains the final data output from the study. Variables are as follows:
- trueMEAN: The mean admixture coefficient for all fish (census) within a simulated watershed, at the end of the simulation period. This is the "true" value against which other estimates of admixture are compared. A value of 0 is a native Westslope Cutthroat Trout, and 1 is an introduced Rainbow Trout. This is referred to in the manuscript as the "true admixture coefficient."
- trueCAT: The conservation category that the censused population would fall into based on its mean admixture coefficient. Values are "core" (less than 1% non-native admixture), "cons" (1-5% non-native admixture), or "sport" (over 5% non-native admixture).
- trueMEDIAN: The median admixture coefficient for all fish (census) within a simulated watershed, at the end of the simulation period.
- trueMDCAT: The conservation category that the censused population would fall into based on its median admixture coefficient. Values are "core" (less than 1% non-native admixture), "cons" (1-5% non-native admixture), or "sport" (over 5% non-native admixture).
- spat.strat: The spatial site arrangement of how sampling location are chosen on the landscape. Values are "conv" (convenience sites), "syst" (systematic sites), "rand" (random sites), or "grts" (stratified random sites).
- num.site: The number of sampling sites per watershed (1, 5, 10, or 15 sites).
- reach.len: The length of stream sampled per site (100 m, 300 m, or 500 m).
- n.fish: The number of age-2+ fish that were planned to be sampled per watershed (20, 60, 120, or 240 fish).
- true.nfish: The number of fish that were actually sampled from the watershed, when the planned "n.fish" number was not attainable. This occurred if fewer fish occupied the sampling sites than needed to be sampled.
- year: The year of the simulation when sampling occurred (year 100 or year 120).
- experiment: The simulation number. Used for simulation version control.
- resample: The iteration of the sampling algorithm for each simulation background (selects different individual fish, with the sampling strategy). Values are iteration X1 to X5.
- sampMEAN: The mean admixture of the fish collected from a sampling strategy. This is referred to in the manuscript as the "sample admixture estimate."
- sampCAT: The conservation category that the mean admixture estimate would indicate. Values are "core" (less than 1% non-native admixture), "cons" (1-5% non-native admixture), or "sport" (over 5% non-native admixture).
- sampACC: The numeric value corresponding to how accurate the conservation category from the sample compared to the true population category is. Values are 0 (sample category and true population category match), 1 (sample overestimates by one category), 2 (sample overestimates by two categories), -1 (sample underestimates by one category), -2 (sample underestimates by two categories).
- sampMEDIAN: The median admixture of the fish collected from a sampling strategy.
- sampMDCAT: The conservation category that the median admixture estimate would indicate. Values are "core" (less than 1% non-native admixture), "cons" (1-5% non-native admixture), or "sport" (over 5% non-native admixture).
- sampMDACC: The numeric value corresponding to how accurate the conservation category from the sample compared to the true population category is (category indicated by the median). Values are 0 (sample category and true population category match), 1 (sample overestimates by one category), 2 (sample overestimates by two categories), -1 (sample underestimates by one category), -2 (sample underestimates by two categories).
- meanError: The difference between the mean sample admixture estimate and the mean population admixture.
- medianError: The difference between the median sample admixture estimate and the median population admixture.
- patchvars: The patchvars file that was used to generate the simulation background. Values describeWh the full stocking regime used.
- stockLocation: Locations in the watershed where non-native fish were stocked in the simulation background. Values are "fiveXing" (five road crossing locations), "main" (mainstem), "main+Xing" (mainstem and a road crossing), "Xing" (one road crossing location).
- stockFreq: How often the watershed was stocked with non-native fish in the simulation background. Values are "ann" (annually), "bi" (every two years), "ten" (every 10 years), "rare" (every 25 years).
- stockNumber: The number of non-native fish stocked during each stocking event in the simulation background. Values are 2000, 4000, 10000, 20000, 50000 or 100000 fish.
- watershed: Which watershed was modelled in the simulation background.
File: randomForestModel.rds
This file contains the random forest model that was used to assess the relative importance of the different aspects of the sampling strategy.
Folders: BowRiver, SprayReservoir, StarCreek
These three folders contain the files used to set up the simulation models, first in UNICOR, then in CDMetaPop. Each folder represents a watershed that was modelled and contains the same subfolders. Please see the user manuals for CDMetaPop and UNICOR for further details on setting up the individual software.
Subfolder: bow/spray/star_unicor
This folder contains all files needed to generate cost-distance surfaces in UNICOR. The .rip files (_completeBarrier.rip, _costDistance.rip, and _partialBarrier.rip) are the parameterization files for running UNICOR. Each of the .rip files ccallson the .asc files (_conductance.asc, _flowAcc.asc, _resistance.asc, and _streamGrid.asc), as well as a .csv file of XY coordinates for each patch, to generate the cost-distance surfaces. The cost-distance surfaces are saved as .cdmatrix.csv and .condmatrix.csv. Finally, processing through R (with script unicor_processing.R, contained in the Analysis/scripts folder) yields the _weibullXbarriers_probmatrix.csv file, which is used as input to CDMetaPop.
Subfolder: cdmats
Cost-Distance Matrices
This folder is called on by CDMetaPop and should only contain the bow/spray/star_weibullXbarriers_probmatrix.csv file that was generated by UNICOR and R.
Subfolder: classvars
Age Class Variables
This folder contains the age class variable files (ClassVars_RNTR.csv for Rainbow Trout, and ClassVars_WSCT.csv for Westslope Cutthroat Trout) that are used by CDMetaPop. Variable names, descriptions, values, and sources are described in Supplementary Materials B.
Subfolder: genes
Genetics
This folder contains the allele frequency files for Rainbow Trout (allelefrequencyRNTR.csv) and Westslope Cutthroat Trout (allelefrequencyWSCT.csv) that are used by CDMetaPop. Variables are "Allele List" (describes the loci and alleles present) and "Frequency" (describes the frequency of each allele at model initiation).
Subfolder: patchvars
Patch Variables
This folder contains the patch-level variables that are used by CDMetaPop. Each file contains variables about each patch in the watershed, which differ depending on the stocking regime of Rainbow Trout (48 different stocking regimes). File faming convention is "Patchvars_" followed by the stocking location ("fiveXing" = five road crossing locations, "main" = mainstem, "main+Xing" = mainstem and a road crossing, "Xing" = one road crossing location), the frequency of stocking ("ann" = annually, "bi" = every two years, "ten" = every 10 years, "rare" = every 25 years), and the number of fish stocked at each event (2000, 4000, 10000, 20000, 50000 or 100000). Variable names, descriptions, values, and sources are described in Supplementary Materials B.
Subfolder: popvars
Population Variables
This folder contains the population-level variables that are used by CDMetaPop. Each file differs only in which PatchVars file it calls (variable: xyfilename), depending on the stocking regime. File naming convention is "PopVars_", followed by the experiment and run number (important only for simulation version control), followed by 0-47, which correlates to the various stocking regimes (stocking regime specified in the first cell of the file). Variable names, descriptions, values, and sources are described in Supplementary Materials B.
Subfolder: runvars
Model Run Variables
This folder contains the model-run variables that are used by CDMetaPop. Each file differs only in which PopVars file it calls (variable: Popvars) and in which years stocking occurred (variable: cdclimgentime). File naming convention is "RunVars_", followed by the experiment and run number (important only for simulation version control), followed by 0-47, which correlates to the PopVars file, which is called. Variable names, descriptions, values, and sources are described in Supplementary Materials B.
Subfolder: sampling_results
Sampling Results
This folder contains all the results that were generated by the sampling strategy algorithm (in scripts folder, under bow/spray/star_sampling.R). File naming convention is "samplingResults_", followed by the experiment and run number (important only for simulation version control), followed by 0-47, which correlate to the respective RunVars and PopVars files. These files were then processed with the calculate_metrics.R script to generate the accuracy_ALL.RDS file.
