Data from: Population structure, ancestral admixture, gene flow, and landscape association of blacklegged ticks during range expansion in the Midwestern U.S.
Data files
Jan 09, 2025 version files 348.25 MB
-
03_LD_IMPUTED.vcf.gz
4.12 MB
-
03_MISS20.vcf.gz
190 MB
-
03.pruned.vcf.gz
12.86 MB
-
2024.10.04_03_LD_IMPUTED.geno
41.34 MB
-
ADZE_paramfile.txt
1.98 KB
-
Analyses_script.txt
77.26 KB
-
cohort.sample_map_wotbi.txt
200 B
-
combined_vif_new_agg10_narm.rds
475.87 KB
-
combined_vif_new_agg10.rds
375.07 KB
-
combined_vif_new.rds
22.29 MB
-
concordance_calculation.xlsx
31.17 KB
-
contig.list
14.26 KB
-
coord_03
11.17 KB
-
Data_prep_script.txt
22.53 KB
-
Dgeo.rds
9.60 KB
-
dps_distance_mat.csv
31.53 KB
-
fitted_resistance_est_dps_plus1.asc
240.38 KB
-
fitted_resistance_est_linfst_plus1.asc
240.38 KB
-
fitted_resistance_est_PG_plus1.asc
240.34 KB
-
GD.pop.new_03.rds
59.78 KB
-
genind_03.rds
6.61 MB
-
genpop_03.rds
3.52 MB
-
imputation_exclude.txt
85 B
-
input_files_QC.txt
38 B
-
linfst_distance_mat.csv
31.31 KB
-
long_intervals.list
209 B
-
Metadata.xlsx
32.98 KB
-
Midwest_triangles.shp
465.70 KB
-
omniscape_dps.ini
376 B
-
omniscape_linfst.ini
382 B
-
omniscape_PG.ini
374 B
-
PG_distance_mat.csv
31.34 KB
-
polygon_outer
723 B
-
population_stats.xlsx
11.03 KB
-
populationmap_03_ind.txt
5.41 KB
-
populationmap_03.txt
14.97 KB
-
private_allele_rich.csv
2.34 KB
-
README.md
14.19 KB
-
short_intervals.list
9.51 KB
-
sites.sf.rds
2.50 KB
-
sites.sp.rds
2.52 KB
-
smaller.interval_list
286.60 KB
-
surface_dir8_agg10.rds
3.46 MB
-
tick_private_popremoved
33.87 KB
-
ticks.eigenval
80 B
-
ticks.eigenvec
62.02 KB
-
wildareas-v3-2009-human-footprint_geo.tif
61.19 MB
Abstract
42 populations totalling 517 individuals of Ixodes scapularis from different spatial locations were sampled and sequenced to study the neutral variation, population structure, ancestral admixture, genetic connectivity, and landscape influences on gene flow. We began with genomic data preprocessing, variant calling, variant filtering and concordance check. Then we used the finalized dataset in variant call format (VCF) and spatial locations to conduct genetic distance statistics, isolation by distance modeling and calculate summary statstics. Further we used VCF and sample metadata to conduct Pincipal Component analysis and clustering analysis for understanding population structure and ancestral admixture. To understand region-wide gene flow connectivity, we conducted effective migration surface analysis and graph network analyses to visualize dispersal route and extent. Lastly, we processed landscape and ecological data to conduct landscape genomic analyses to understand the impact of landscape on gene flow, and visualized routes of dispersal across favorable environmental conditions.
README: Population structure, ancestral admixture, gene flow, and landscape association of Blacklegged ticks during range expansion in the Midwestern U.S.
Main Author Information Name: Dahn-young Dong ORCID:0000-0001-6284-2738 Institution: University of Wisconsin - Madison Email: ddong22@wisc.edu{.email}
Co-Author Information Name: Sean Schoville ORCID:0000-0001-7364-434X Institution: University of Wisconsin - Madison Email: sean.schoville@wisc.edu{.email}
Date of data collection: from 2021 to 2023
Geographic location of data collection: See Metadata.xlsx
SHARING/ACCESS INFORMATION
Licenses/restrictions placed on the data: NA
Links to publications that cite or use the data: Genetic and landscape connectivity of Blacklegged ticks during range expansion in the Midwestern U.S. In Review at Molecular Ecology
Links to other publicly accessible locations of the data: NA
Links/relationships to ancillary data sets: NA
Was data derived from another source? No
Recommended citation for this dataset: The publication mentioned above
DATA & FILE OVERVIEW
Metadata file: Metadata.xlsx
Genomic data processing files: 03.pruned.vcf.gz cohort.sample_map_wotbi.txt concordance calculation.xlsx contig.list imputation_exclude.txt input_files_QC.txt long_intervals.list short_intervals.list smaller.interval_list
Downstream Analysis data files: 03_LD_IMPUTED.vcf.gz 03_MISS20.vcf.gz 2024.10.04_03_LD_IMPUTED.geno ADZE_paramfile.txt Dgeo.rds GD.pop.new_03.rds Midwest triangles.shp PG_distance_mat.csv combined_vif_new.rds combined_vif_new_agg10.rds combined_vif_new_agg10_narm.rds coord_03 dps_distance_mat.csv fitted_resistance_est_PG_plus1.asc fitted_resistance_est_dps_plus1.asc fitted_resistance_est_linfst_plus1.asc genind_03.rds genpop_03.rds linfst_distance_mat.csv omniscape_PG.ini omniscape_dps.ini omniscape_linfst.ini polygon_outer population_stats.xlsx populationmap_03.txt populationmap_03_ind.txt private_allele_rich.csv sites.sf.rds sites.sp.rds surface_dir8_agg10.rds tick_private_popremoved ticks.eigenval ticks.eigenvec wildareas-v3-2009-human-footprint_geo.tif
The use of these files are included in the script.txt files
[Description of each file]{.underline}: What is the file name, what does it do, where in the script you can find the use of the file
Metadata file:
- Metadata.xlsx
- metadata of all samples, inclusive of 517 samples that are used for downstream genetic and landscape analyses. It is imported commonly in R scripts to obtain information such as sample ID, site names, acronyms, region names, lat-lon coordinates. It also contain information about tissue source and year of collection. You need some data wrangling to fit the R script input requirement.
Genomic data processing files:
- 03.pruned.vcf.gz
- 03 is a internal code for versions. Pruned means Linkage disequilibrium (LD)-controlled variants. VCF means variant call format, and gz means gzipped compression. This is the variant call set that was pruned for LD but is not imputed for missingness after filtering highly missing variants. This file is used in 1J. Genotype imputation in BEAGLE v5.4
- cohort.sample_map_wotbi.txt
- A mapping file linking sample same to directory file to allow building of GATK datastore. Used in #######1F. Variant calling - GATK 4.4.0.0
- concordance calculation.xlsx
- concordance is needed to ensure whole genome amplified DNA can be used in same analyses as the ones without amplification. The concordance count of various categories are download from GATK output and manually calculated for genotype concordance. #1K. Genotype concordance between non-amplified and amplified ticks - GATK
- contig.list
- the contig.list provide the vcf.gz files with genomic contig names sorted in increasing order comformative to original reference genome call order, to keep scatter-gather process aligned, while allowing for parallelism.#######1F. Variant calling - GATK 4.4.0.
- imputation_exclude.txt
- Imputation of missing variant using BEAGLE. BEAGLE needs a minimal number of variants per contig to allow imputation. Some contigs have too few variants so needed to be excluded. ######1J. Genotype imputation in BEAGLE v5.4
- input_files_QC.txt
- A list of Quality Control tick names to be included in GATK variant call pipeline.######1K. Genotype concordance between non-amplified and amplified ticks - GATK
- long_intervals.list
- a list of contigs that are long to optimize parallelization during RAM intensive step of GATK variant calling. #######1F. Variant calling - GATK 4.4.0.0
- short_intervals.list
- a list of contigs that are short to optimize parallelization during RAM intensive step of GATK variant calling. #######1F. Variant calling - GATK 4.4.0.0
- smaller.interval_list
- an initial interval list to allow the creation of gatk datastrore. #######1F. Variant calling - GATK 4.4.0.0
Downstream Analysis data files:
- 03_LD_IMPUTED.vcf.gz
- Linkage disequilibrium controlled, missingness imputed set of variants that are ready for some of the downstream analyses.
- 03_MISS20.vcf.gz
- A set of variants for all tick samples that were only controlled for variant missingness by removing the variants that are missing in more than 80% of the biological samples. This is a file generated in ######1I. Variant missingness filtering - VCFTOOLS 0.1.17 , and use to prune linkage disequilibrium variant in ######3A. Principal component analysis
- 2024.10.04_03_LD_IMPUTED.geno
- the file type that is readable by SNMF analysis to calculate ancestral admixture. I generated this file and read in R script found in ######3B. SNMF via LEA 3.12.2
- ADZE_paramfile.txt
- parameter file to allow calculation of rarefied allele richness for each population. #######2E. Rarefied private alleles via ADZE-1.0
- Dgeo.rds
- R data object that stores the Great Circle geographic distances among all populations. Used in many landscape genetics analyses R script.
- GD.pop.new_03.rds
- R data object Genetic Distance for populations contain a list of genetic distances calculated and accessed varyingly depending on the need. Used in many genetic distance, isolation by distance, and landscape association analyses in R scripts.
- Midwest triangles.shp
- The tesselation grod shapefile needed to generate the edges values in the migration surface in the gene flow analyses. #4A. Effective Migration surface via FEEMS
- PG_distance_mat.csv
- A matrix of resistance-optimized Pop Graph genetic distance used in generating combined modeling of isolation by environment and resistance.######6C. Isolation by Environment modeling, and co-estimating with Isolation by Distance and Isolation by Resistance via MMRR implemented in ALGATR
- combined_vif_new.rds
- landscape rasters in spatraster format in 1km resolution ready for landscape genetic analyses, after variable selection
- combined_vif_new_agg10.rds
- landscape rasters in spatraster format ready for landscape genetic analyses, after variable selection. Aggregated to 10km resolution
- combined_vif_new_agg10_narm.rds
- landscape rasters in spatraster format ready for landscape genetic analyses, after variable selection. Aggregated to 10km resolution and removing missing value cells after aggregation
- coord_03
- population coordinates used in ######4A. Effective Migration surface via FEEMS
- dps_distance_mat.csv
- A matrix of resistance-optimized 1-proportion of share alleles (dps) genetic distance used in generating combined modeling of isolation by environment and resistance.######6C. Isolation by Environment modeling, and co-estimating with Isolation by Distance and Isolation by Resistance via MMRR implemented in ALGATR
- fitted_resistance_est_PG_plus1.asc
- raster values across the study region after fitted with isolation by resistance model using PG genetic distance. This file is used in Current Flow analysis in ######6B. Current Flow map via CIRCUITSCAPE and visualization
- fitted_resistance_est_dps_plus1.asc
- raster values across the study region after fitted with isolation by resistance model using dps genetic distance. This file is used in Current Flow analysis in ######6B. Current Flow map via CIRCUITSCAPE and visualization
- fitted_resistance_est_linfst_plus1.asc
- raster values across the study region after fitted with isolation by resistance model using linearized Fst genetic distance. This file is used in Current Flow analysis in ######6B. Current Flow map via CIRCUITSCAPE and visualization
- genind_03.rds
- genetic information of individuals in a R object data format. It is used in calculating some genetic distance in ######2A. Genetic distance via ADAGENET 2.1.10, GRAPH4LG 1.8.0, and MMOD 1.3.3. And some other plotting functions in #4B. Graph Networks via GRAPH4LG
- genpop_03.rds
- genetic information of individuals in a R object data format. It is used in calculating some genetic distance in ######2A. Genetic distance via ADAGENET 2.1.10, GRAPH4LG 1.8.0, and MMOD 1.3.3. Also used in calculating population allele frequency #######3C. CONSTRUCT 1.0.5, code adapted from tutorial.
- linfst_distance_mat.csv
- A matrix of resistance-optimized linearized Fst genetic distance used in generating combined modeling of isolation by environment and resistance.######6C. Isolation by Environment modeling, and co-estimating with Isolation by Distance and Isolation by Resistance via MMRR implemented in ALGATR
- omniscape_PG.ini
- The init file to provide parameters to run omniscape to generate gene flow current map for the study area by providing landscape resistance map generated by Pop Graph genetic distance. ######6B. Current Flow map via CIRCUITSCAPE and visualization
- omniscape_dps.ini
- The init file to provide parameters to run omniscape to generate gene flow current map for the study area by providing landscape resistance map generated by Dps genetic distance. ######6B. Current Flow map via CIRCUITSCAPE and visualization
- omniscape_linfst.ini
- The init file to provide parameters to run omniscape to generate gene flow current map for the study area by providing landscape resistance map generated by linearized Fst genetic distance. ######6B. Current Flow map via CIRCUITSCAPE and visualization
- polygon_outer
- A self-enclosing set of coordinates that formulate the spatial boundary of the study region. Used in #######5A. Landscape data download, reformat, and variable selection and ######4A. Effective Migration surface via FEEMS
- population_stats.xlsx
- genomic summary statistics of populations on Expected heterozygosity, nucleotide diversity, with sample size and Tajima's D. It is used in mapping and plotting of the results. ######2D. Tajima's D via DADI ######2C. Expected heterozygosity and nucleotide diversity via POPULATIONS in STACKS 2.64
- populationmap_03.txt
- a file linking sample name and population sampling location. Used in generation of ######2C. Expected heterozygosity and nucleotide diversity via POPULATIONS in STACKS 2.64, ######2D. Tajima's D via DADI, Rarefied private alleles via ADZE-1.0, and ancestral clustering using #######3C. CONSTRUCT 1.0.5, code adapted from tutorial
- populationmap_03_ind.txt
- this file is a list of sample name IDs. Used in ######3B. SNMF via LEA 3.12.2
- private_allele_rich.csv
- rarefied private allelic richness generated by ADZE. The file include site name, rarefied sample size (loci), mean, variance, and standardized error. The mean values are taken for visusalization. Used in #######2E. Rarefied private alleles via ADZE-1.0
- sites.sf.rds
- R data object of sampling sites/population coordinates transformed into SF objects useful for downstream spatial analyses.
- sites.sp.rds
- R data object of sampling sites/population coordinates transformed into SP objects useful for downstream spatial analyses.
- surface_dir8_agg10.rds
- R data object of conductance surface generated by RADISH R package to prepare for landscape genetic regression analyses. It is generated by setting direction of neighbor to 8 and aggregation to 10km resolution. It is used in ######6A. Isolation by resistance modeling, model selection, and resistance mapping via RADISH
- tick_private_popremoved
- The actual ADZE output and later manually imported to private_allele_rich.csv for further visualization. The file is mentioned in #######2E. Rarefied private alleles via ADZE-1.0
- ticks.eigenval
- The eigenvalues, an input component to generate Pincipal component plot. This is used in ######3A. Principal component analysis
- ticks.eigenvec
- The eigenvector, an input component to generate Pincipal component plot. This is used in ######3A. Principal component analysis
- wildareas-v3-2009-human-footprint_geo.tif
- The tif file (Venter et al., 2016) I downloaded from https://www.earthdata.nasa.gov/data/catalog/sedac-ciesin-sedac-lwp3-hf-2009-2018.00 to generate human footprint raster. The file is used in #######5A. Landscape data download, reformat, and variable selection. This dataset is not my own, and included here for the sake of reproducibility. It is under "Creative Commons Zero (CC0). There are no restrictions on the use of these data." https://www.earthdata.nasa.gov/engage/open-data-services-software-policies/data-use-policy
Relationship between files, if important: N/A
Additional related data collected that was not included in the current data package: N/A
Are there multiple versions of the dataset? No
Read quality control were assessed using FastQC and MultiQC. Quality control of Whole-genome Amplification described in Data_prep_script.txt ######1K. Genotype concordance between non-amplified and amplified ticks - GATK
People involved with sample collection, processing, analysis and/or submission: See manuscript acknowledgement
DATA-SPECIFIC INFORMATION: see script files marked out comments among the codes
Reference
Venter, O., Sanderson, E. W., Magrach, A., Allan, J. R., Beher, J., Jones, K. R., . . . Watson, J. E. M. (2016). Global terrestrial Human Footprint maps for 1993 and 2009. Scientific Data, 3(1), 160067. doi:10.1038/sdata.2016.67
Methods
Table of Contents
Part 1. Genomic Data Preprocessing, Variant Calling, Variant filtering and concordance
1A. Demultiplex and assign ID - PROCESS_RADTAGS in STACKS 2.64
1B. Adapter removal using CUTADAPT 3.5
1C. Read trimming with Trimmomatic 0.39
1D. Read mapping with reference genome using bwa-mem 0.7.17-r1188
1E. Alignment file conversion, and then resequencing bam merge - SAMTOOLS 1.16.1
1F. Variant calling - GATK 4.4.0.0
1G. Sample missingness filtering - PLINK2 v2.00a3 SSE4.2
1H. Variant filtering - GATK 4.4.0.0
1I. Variant missingness filtering - VCFTOOLS 0.1.17
1J. Genotype imputation in BEAGLE v5.4
1K. Genotype concordance between non-amplified and amplified ticks - GATK
Part 2. Genetic distance statistics, Isolation by distance and Summary stats
2A. Genetic distance via ADAGENET 2.1.10, GRAPH4LG 1.8.0, and MMOD 1.3.3
2B. Isolation by distance and mantel correlogram
2C. Expected heterozygosity and nucleotide diversity via POPULATIONS in STACKS 2.64
2D. Tajima's D via DADI
2E. Rarefied private alleles via ADZE-1.0
Part 3. Population structure via PCA, SNMF, and CONSTRUCT
3A. Principal component analysis
3B. SNMF via LEA 3.12.2
3C. CONSTRUCT 1.0.5
Part 4. Effective Migration surface and Graph Networks
4A. Effective Migration surface via FEEMS
4B. Graph Networks via GRAPH4LG
Part 5. Landscape ecological data collection and processing
5A. Landscape data download, reformat, variable selection, NA cell treatment, and aggregation
Part 6. Isolation by resistance (IBR) and environment (IBE) analyses
6A. Isolation by resistance modeling, model selection, and resistance mapping via RADISH
6B. Current Flow map via CIRCUITSCAPE and visualization
6C. Isolation by Environment modeling, and co-estimating with Isolation by Distance and Isolation by Resistance via MMRR implemented in ALGATR