Using isolation-by-distance to jointly estimate effective population density and dispersal distance: a practical evaluation using bumble bees
Data files
Aug 21, 2025 version files 16.05 MB
-
data_and_code.zip
16.04 MB
-
README.md
10.90 KB
Abstract
Effective population density and intergenerational dispersal distance are key aspects of population biology, but obtaining empirical estimates of these parameters can be difficult. This is especially true for my study taxa, wild bees. In this paper, I apply and evaluate an existing but underutilized method to estimate the effective density and dispersal distance of bumble bees (Bombus, Apidae). Specifically, using 10 datasets of bumble bees in North America, I use the relationship between genetic isolation-by-distance and Wright’s neighborhood size to define a density-dispersal isocline—that is, a curve describing pairs of density and dispersal values consistent with observed rates of isolation-by-distance. These parameters are inversely related; as one increases, the other decreases. I then use outside estimates of bumblebee dispersal distances to make more specific estimates of effective colony density. Compared to some prior estimates of census density (100s to 1000s colonies/km2), my estimated effective colony densities were very low (1–41 effective colonies/km2). I also hypothesize, however, that these estimates are affected by the spatial extent of sampling, due to scale-dependent patterns in the distribution of individuals. To test this hypothesis, I subsampled each dataset to simulate varying study extent, and repeated my analysis. Within populations, effective densities tended to decrease when measured across larger spatial extents. Altogether, I demonstrate a useful and under-appreciated tool for studying population biology, especially of small, mobile animals like bees, but also show that researchers must interpret their results carefully within the context of their study design.
https://doi.org/10.5061/dryad.fj6q5744r
Description of the data and file structure
Data and code to recreate analysis in “Using isolation-by-distance to jointly estimate effective population density and dispersal distance: a practical evaluation using bumble bees” by DT Simpson, published in Oecologia, 2025: https://doi.org/10.1007/s00442-025-05721-4.
A note on data, copyright, and attribution: The analyses in this paper use data that I (DTS) collected as well as data made available by other authors. Those authors' data have been previously published on Dryad, and links to those repositories can be found at the end of this readme. Given that all this data (mine and theirs) is published under the CC0 license, no attribution is strictly necessary. I would, however, encourage anyone using these data for scientific publication to cite the original authors of each dataset.
The zip folder here is a working folder in which the analyses and figures can be reproduced. Except where otherwise noted, R scripts assume your working directory is that folder. All data needed to reproduce the work is in ‘data/in’, but I included intermediate products and outputs from my run of the analysis, so an interested person can start anywhere. Code to run the analysis is in ‘code/’ and scripts are numbered in the order in which they should be run. The 'figures/' directory is currently empty and will be populated as the code is run.
Detailed descriptions of folders and files are below.
Data/in/
- This folder contains genotype data from my own work and other publicly available datasets. These are the raw inputs for all analyses here.
- The files ending in “_specs.csv” are genotypes in a standard format that are the inputs for the first R script. File names are formatted as species_author_[suffix].csv.
- Critically, only one of these datasets was collected by me (impatiens_simpson_spec.csv), while the others are not my original work. I only include them here so that my analyses can be easily recreated.
- After attaining these external datasets, I processed them to put them in this standard format. For simplicity, I included them here in the standard format only. Processing steps included renaming columns, giving specimens unique ID numbers, converting geographic coordinates to the same system (Albers Equal Area projection), and getting alleles into the proper format for genepop (specifically, making sure each allele number had the proper number of digits and pasting them into a single column, rather than each allele being its own column).
- Columns in these files are unique specimen ID (ind); site name (site); easting (x) and northing (y) coordinates in Albers projection using WGS84 datum, but converted from meters to kilometers; and each of the genotyped loci. Values in these last columns are allele IDs; the first half of digits are the first allele, the second half are the second.
- The folder newJersey_impatiens_data contains more data on the bees I collected in southern New Jersey. These data are:
- Colony_bestFSfamilyAssignment.tsv is the best (maximum likelihood) family assignment from the program Colony (Jones and Wang 2010). It includes the colony assignments for each of my bee specimens
- Genotypes_Simpson_noSibs.csv is a table of genotypes of my Bombus impatiens specimens from southern New Jersey, filtered to one (arbitrary) individual per colony. Columns are ‘uid’, which is a unique identifier for each specimen, and then two columns per locus, one for each allele. The difference between this table and the input file in the parent folder is just formatting.
- Genotypes_Simpson_withSibs.csv is my full dataset of all bee genotypes. These were the data that went into the Colony analysis to determine siblingship. Columns are the same as in the noSibs version, above
- Simpson_specs.csv is location data for all my B. impatiens specimens. This also includes sex information because I did inadvertently collect some male bees, some of which were genotyped, but which were not included in later further analyses. These can be safely ignored.
- Columns are unique identifier (uid); sex of specimen (sex); site name (site); date of collection in M/D/Y format (date); and longitude (long) and latitude (lat) in decimal degree format using the WGS84 datum
Data/intermediate/
- This folder holds intermediate products from the analysis. Currently includes:
- Genepop_in/ holds genotype data formatted for use in the Genepop package in R (Rousset 2008). Each file here is made by running the script 1_dataFormatting.R, which itself applies the genepop_converter( ) function found in code/functions/.
Data/out/
- This holds the result outputs.
- /genepop_out/ holds the output files from running the ibd( ) function in the genepop package on the genotype data found in intermediate/genepop_in/
- Files ending in ‘.txt’ contain pairwise genetic and geographic distance matrices and summary tables of the IBD regression. They should be viewed in a text editor. These are processed by the genepop_extractor() function in ~/code/functions to extract the IBD slope.
- Files ending in ‘.txt.GRA’ are space-delimited tables containing pairwise geographic distances (natural-log transformed, left column) and genetic distances (right column). View these like you would a spreadsheet.
- Files ending in ‘.txt.MIG’ contain pairwise genetic distance and raw geographic distance matrices. View these with a text editor.
- Ibd_estimates.csv contains the IBD regression results for each of the 10 datasets, as calculated by Genepop. Confidence intervals are calculated by approximate Bayesian computation (ABC)
- Columns are: dataset, which contains the species (and author, if > 1 dataset for that species) analyzed; n, which is the number of sites/subpopulations or individuals analyzed; metric, which denotes whether genetic distance was measured between populations using Fst (metric = pop) or between individuals using Rousset’s a (metric = a); estimate, which is the IBD slope estimate; and ci.hi and ci.low, which are the upper and lower bounds of the 95% confidence interval on the slope.
- This table is produced by the R script 2.3_jointEstimates_Results.R
- Neighborhood_estimates.csv is actually the same as the above table, but with additional columns to hold neighborhood size estimates and their confidence intervals. These estimates are simply the inverse of the IBD slope and its confidence interval bounds. One unusual column: “Nb.hi.alt” is the “alternate” upper limit of the confidence bound on neighborhood size used for plotting. This is used because these four datasets have upper confidence bounds of infinity, but plots need real numbers.
- Scale_test.rds is an R object that contains the output/results from my subsampling analysis to test the effect of spatial scale on IBD estimates. It is a list of seven elements. Each element is the result for one species/dataset, and itself a 1000 x 21 matrix (saved as a dataframe). Each column is one spatial extent, and each row is the observed IBD from one random subsample containing 50% of the original observations. Although it is saved as a data frame, it should not be interpreted as ‘tidy data’ in which each row is an observation and, thus, data in the same row but in different columns is somehow linked. Iterations of the subsampling routine were not linked across scales. In this dataset, each column is an independent test.
Code/
- This folder contains all the code to run the analyses described in the main text and appendices.
- 1.1_DataFormatting.R takes the genotype data in ~/data/in/ and converts it to Genepop format
- 1.2_outsideEstimateOfSigma.R takes data presented in (Lepais et al. 2010) to empirically estimate the dispersal parameter sigma (σ)
- 2.1_jointEstimates_Analysis.R uses Genepop to run the IBD regression on each dataset. Results are saved to ~/data/out/genepop_out/
- 2.2_jointEstimates_Results.R pulls the results out of the Genepop output files, summarizes them, and visualizes them. Main text Figure 1 is made here.
- 3.1_scaleTest_analysis.R runs the subsampling routine to test the effect of study extent on IBD, De, and sigma estimates.
- 3.2_scaleTest_results.R compiles, summarizes, and reports the results of that analysis. Main text Figure 2 is made here.
- 4_dataSummary.R pulls data together to report some summary statistics.
- 5.1_neighborhoodSize.R makes Figure B1 to illustrate the meaning of neighborhood size
- 5.2_SpatialScale_Box.R runs the analysis and makes the figures in Box 2 and Appendix A1, illustrating the way in which scale-variant patterns in individual distributions might affect IBD estimates.
- 5.3_visualizingSigma.R demonstrates the relationship between the dispersal parameter sigma (σ) and observed Euclidean distance between parent and offspring, and makes Figure A5.1.
- Functions/ contains some R functions I wrote to help my analysis along
- Genepop_converter.R takes genotypes in a standard (but readable by humans) format and creates Genepop input files, which are hard-to-read-by-humans text files.
- Genepop_extractor.R extracts IBD slopes and confidence intervals from Genepop output files.
- Joint_estimates.R contains functions to calculate effective density (De) or dispersal (σ) from neighborhood size estimates and a putative value of the other parameters. This is simple algebra, as presented in the main text.
- Genepop_temp/ is a temporary/working folder to contain files generated by running Genepop. In the analysis R script, the working directory is changed to /genepop_temp/ so these temporary files can be summarily deleted.
Figures/ is a folder to contain figures printed by R scripts.
Access information
Other publicly accessible locations of the data:
- Data from Lozier et al 2011 can be found at: https://datadryad.org/stash/dataset/doi:10.5061/dryad.d403s
- Data from Jha and Kremen 2013 can be found at: https://datadryad.org/stash/dataset/doi:10.5061/dryad.n1922
- Data from Jha 2015 can be found at: https://datadryad.org/stash/dataset/doi:10.5061/dryad.hr4g0
- Data from Jackson et al 2018 can be found at: https://datadryad.org/stash/dataset/doi:10.5061/dryad.st7gm24
The data on *Bombus impatiens *workers from New Jersey, USA, were collected by me in the summers of 2020 and 2021. I extracted DNA using a bead-based extraction and amplified 11 microsatellite loci. These PCR products were analyzed using fragment analysis by the Rutgers Medical School Genomics Center. I assigned genotypes by first manually defining allele bins, then having software automatically assign allele IDs, and then manually checking allele IDs. Approximately 100 specimens were re-analyzed to assess error rates, which were all < 1%. Using these genotypes, I used the program Colony to determine siblishingship among workers. This dataset includes the full table of all genotypes and another that is filtered to only one worker from each colony. The family assignment output file from Colony is also included. There is also a table that includes geographic location and date of collection for each specimen.
The other nine datasets were collected and processed by other authors. The original works in which these data were published are cited in the Related Works section. In short, these data include microsatellite or SNP genotypes for individual Bombus workers. These have been previously filtered to only include one worker per colony (i.e., no siblings).
