Optimizing sampling design for landscape genomics
Data files
Nov 18, 2024 version files 13.75 GB
-
LGS_simulation_archive.tar.gz
13.75 GB
-
README.md
1.98 KB
Abstract
Landscape genomic approaches for detecting genotype-environment associations (GEA), isolation by distance (IBD), and isolation by environment (IBE) have seen a dramatic increase in use, but there have been few thorough analyses of the influence of sampling strategy on their performance under realistic genomic and environmental conditions. We simulated 24,000 datasets across a range of scenarios with complex population dynamics and realistic landscape structure to evaluate the effects of the spatial distribution and number of samples on common landscape genomics methods. Our results show that common analyses are relatively robust to sampling scheme as long as sampling covers enough environmental and geographic space. We found that for detecting adaptive loci and estimating IBE, sampling schemes that were explicitly designed to increase coverage of available environmental space matched or outperformed sampling schemes that only considered geographic space. When sampling does not cover adequate geographic and environmental space, such as with transect-based sampling, we detected fewer adaptive loci and had higher error when estimating IBD and IBE. We found that IBD could be detected with as few as nine sampling sites, while large sample sizes (e.g., greater than 100 individuals) were crucial for detecting adaptive loci and IBE. We also demonstrate that, even with optimal sampling strategies, landscape genomic analyses are highly sensitive to landscape structure and migration — when spatial autocorrelation and migration are weak, common GEA methods fail to detect adaptive loci.
https://doi.org/10.5061/dryad.63xsj3v8s
This dataset contains a compressed tarball (.tar.gz) with the simulation data used in “Optimizing sampling design for landscape genomics”
Description of the data and file structure
The tarball must first be unpacked. For example, this can be done using this bash code:
tar -xzvf LGS_simulation_archive.tar.gz
The tarball contains 960 pairs of CSV files and Variant Call Format (VCF) files with genomic data for each of the 960 simulations. The tarball also contains CSV files ending in NONNEUTS which provide the indices of the adaptive loci corresponding to each simulated trait.
Each file is titled as such:
mod-K[1 or 2]_phi[50 or 100]_m[25 or 100]_H[5 or 50]_r[30 or 60]_it-[1-10]_t-6000_spp-spp_0
The values within brackets represent the different low/high parameter levels (e.g., K1 = small population and K2 = large population) or the iteration (1 through 10)
File name abbreviations are:
K = population size
phi = selection strength
m = migration rate
H = spatial autocorrelation
r = environmental correlation
it = iteration
See the original paper for more information on these parameters.
The CSV files contain geospatial data for the simulated individuals. The columns are:
idx = individual ID (matches with VCF)
z = individual phenotypes ([trait1, trait2])
e = individual environments ([environment 0, environment 1, environment 2])
age = individual age (in model timesteps)
sex = individual sex
x = individual x coordinate
y = individual y coordinate
The VCF files follow standard VCF formatting and have the same IDs as the CSV files.
Code/Software
The code and additional files used to generate this data are archived on Zenodo (DOI 10.5281/zenodo.14009717)
The most recent version of this code can be found on GitHub: https://github.com/AnushaPB/LandGenSamp
This dataset was generated from simulations run in Python version 3.9.7 (Van Rossum & Drake, 2009) using Geonomics version 1.3.9 (Terasaki Hart et al., 2021). We ran simulations varying population size, migration rate, selection strength, spatial autocorrelation, and environmental correlation, each at a “low” and “high” level. We ran 10 replications of each simulation to capture variation in results due to stochasticity. Together with three sets of simulated landscapes, this produced a total of 960 simulations (30 repetitions of each of 32 unique parametrizations). This dataset contains a compressed tarball (.tar.gz) with 960 pairs of CSV files and Variant Call Format (VCF) files with genomic data for each of the 960 simulations. A complete description of the methods used to collect and process this dataset is available in the corresponding paper (Bishop et al., 2024). The corresponding code used to create these simulations is archived on Zenodo (DOI 10.5281/zenodo.14009716).