Data for: Molecular footprints of Quaternary climate fluctuations in the circumpolar tundra shrub dwarf birch

Name: Data for: Molecular footprints of Quaternary climate fluctuations in the circumpolar tundra shrub dwarf birch
Creator: Maria Dance

Dance, Maria 1 2

Research facility: University Centre in Svalbard

Published Sep 05, 2025 on Dryad. https://doi.org/10.5061/dryad.3xsj3txtm

Data files

Sep 05, 2025 version files 156.43 MB

code.zip

92.04 KB
data.zip

156.33 MB
README.md

14.77 KB

Abstract

Data and code to accompany the manuscript "Molecular footprints of Quaternary climate fluctuations in the circumpolar tundra shrub dwarf birch".

We tested scenarios of the Quaternary demographic history of dwarf birch species (Betula nana L. and Betula Glandulosa Michx.) using Single Nucleotide Polymorphism (SNP) markers obtained from RAD sequencing and approximate Bayesian computation. We compared the timings of modelled population events with ice sheet reconstructions and other paleoenvironmental information to untangle the impacts of alternating cold and warm periods on the phylogeography of dwarf birch.

Details of the datasets and methods are available in the published manuscript and supplementary information: https://doi.org/10.1111/mec.70082

The raw genetic data for the study have been deposited in the European Nucleotide Archive (ENA) at EMBL-EBI under accession number PRJEB67938. The data and code deposited in Dryad will help enable replication of the methods of the paper.

Code.zip

Author: Maria Dance
Date: 2024-10-01

Most code is in .r or .rmd files, which are numbered in the order in which they should be run. .rmd files are best opened in RStudio. Most scripts should run in R version 4.4.1 and RStudio version 2024.04.2. Scripts number 10 and 11 were run in a Unix environment with R version 4.5.1 and relied on the packages admixr v0.9.1 (https://cran.r-project.org/web/packages/admixr/index.html) and ADMIXTOOLS v7.0.2 (doi.org/10.1534/genetics.112.145037), which was installed via Bioconda.

There is also a folder of bash scripts that were run in a unix environment.

------------ R scripts ------------

1.0_plot_process_radtags_output

This script makes plots for the Stacks process_radtags per-sample statistics. This R script is best run in a Linux environment using the syntax: $ Rscript process_radtags_stats_radchecks.r.

1.1_gstacks_stats_part1.r

This script plots the per-sample statistics output from the Stacks gstacks module.

1.2_gstacks_stats_part2.rmd

This script calculates stacks-gstacks dataset-wide statistics for the number of loci, mean coverage, and phasing.

2_calc_LD_decay.rmd

This script plots the average LD across different intervals in the genome (calculated using PLINK) to identify the window size where LD decays to a background rate of r=0.1.

3_plot_admixture_output_sampling_loc.rmd

This script plots ADMIXTURE results in a bar plot using linkage disequilibrium-pruned SNPS, which were filtered using sampling sites as populations, and excluding missing populations.

4_making_popmaps_using_admixture_output.rmd

This script makes a new popmap (Stacks populations input file) for the SNP filtering to retain more SNPs in a second stage of SNP filtering (populations run). It uses ADMIXTURE ancestry proportions to assign individuals to a genetic group, which are used as populations in the new populations run, instead of the sample sites.

5.1_plot_admixture_output_subsp.rmd

This script plots new ADMIXTURE results in a bar plot using linkage disequilibrium-pruned SNPS, which were filtered using the first run of ADMIXTURE genetic groups as populations, and excluding missing populations.

5.2_plot_radpainter_output_subsp.rmd

This script plots the output from fineRADstructure output, making coancestry matrices with dendrograms of population clustering. Script adapted from Milan Malinsky (millanek@gmail.com), in turn adapted from a Finestructure R Example by Daniel Lawson.

5.3_isolation_by_distance.rmd

This script tests for isolation by distance (IBD) in the dwarf birch genetic data, implemented using the function dartR::gl.ibd(). This test compares pairwise geographic and genetic distance matrices, using permutations to test for significance. It also tests the association between the matrix of genetic distances and a model matrix of genetic cluster membership, with the matrix of geographical distances as a covariate.

6.1_making_diyabc_input.rmd

This script makes the input genetic data files for DIYABC RF using the output of the populations module of Stacks.

6.2_plotting_LDA.rmd

This script plots the LDA ordination output from DIYABC-RF to help find the best scenario and fit to the data.

6.3_plotting_diyabc_params.rmd

This script plots the posterior distributions of DIYABC RF parameters from the best model (Model set 22).

7.1_Making metadata_for_NMDS.rmd

This script makes metadata for NMDS, and it uses admixture proportions from ADMIXTURE output to assign individuals to a genetic group. This data is then matched with other metadata to make a metadata output file that is useful for e.g. NMDS. It is the same approach as in script #4.

7.2_NMMDS.rmd

This script makes NMDS plots for individuals based on their SNPs.

8_plotting_admixed_groups_ice_sheet_map.rmd

This script makes figure 1 in the manuscript and also other map figures (S24, S33).

9_genetic_summary_stats_for_paper.rmd

This script makes genetic diversity statistics for each sampling site.

FinestructureLibrary.r

This script contains functions that need to be sourced by the 5.2_plot_radpainter_output_subsp.rmd script.

The bash scripts were run in a Linux environment.

------------Folder "bash_script_backups"------------

All these shell files can be opened in a basic text editor. For details on the parameters for each software and Unix command, see the online manuals.

2_process_radtags.sh

This script runs the Stacks module process_radtags, which demultiplexes reads on the plates and groups them into individual-level reads. This version includes the option to run radchecks, checking that the rad cut sites are intact.

2b_merge_duplicate_indivs_radcheck.sh

This script merges the duplicated individuals from the sequencing.

3_bowtie2_align_radcheck.sh

This script aligns the reads to a reference Betula nana genome using bowtie2 and sorts the output with samtools.

4_gstacks_new_apr_23.sh

This script runs the Stacks module gstacks, excluding the four populations that were identified as hybrids with another Betula species.

4_gstacks_new_apr_23.sh

This script runs the Stacks module gstacks, excluding the populations with putative Betula hybrids.

5_populations*.sh

The scripts 5_populations_* not only run the populations module of Stacks, but they then prune the SNPs in linkage disequilibrium (LD), then run ADMIXTURE on those LD-pruned SNPs. For output in phased haplotypes (version "B"), the scripts also run the programs RADpainter and finestructure.

Guide to naming convention:

v1 = used popmaps where populations represented sampling locations
v2 - used popmaps where populations represented genetic groups as calculated from admixture proportions from an ADMIXTURE analysis conducted on the v1 output.
A = single SNP per locus output (For ADMIXTURE and DIYABC RF)
B = phased haplotypes with multiple SNPs per locus output (for fineRADstructure/RADpainter only)

"exl_hybrids" excludes the four populations that were identified as hybrids with another Betula species that were not studied.
"w_hybrids" includes those hybrid populations

There are population runs for America, Eurasia, the western Eurasian nana group, and global (not specified in file name).

There was also a population run for calculating D-statistics specifically (5_populations_glandulosa_east_outgroup).

6_process_files_for_admixtools_gland

This script was run to convert population output (vcf) to an input file that was compatible with admixr v0.9.1 (https://cran.r-project.org/web/packages/admixr/index.html) and ADMIXTOOLS v7.0.2 (doi.org/10.1534/genetics.112.145037) which was installed via Bioconda.

ldPruning_relaxed.sh

This script was used for pruning SNPS that were in Linkage Disequilibrium. For details on the parameters, see the manuscript supplementary information.

Example_of_diyabc_script.sh

This is an example of the bash scripts used to run the software DIYABC RF (Collins et al. 2021). It lists the settings used in all DIYABC runs.

data.zip

Author: Maria Dance

Date: 2024-10-01

Guide to data folders:

barcodes

Contains .txt and .xlsx files with the unique DNA read barcodes and the corresponding individual ID. The files with _no_cut_site suffix contain the barcodes without the sfbl restriction enzyme cut site. These files are used in the pipeline to demultiplex the raw sequence reads from plate-level to individual-level reads.

bowtie2_logs

Contains .txt and .xlsx files with logs (summaries) taken from the output of the alignment software Bowtie2. Includes, on a per-individual basis: the number of reads pre-alignment, the number of reads post-alignment, and the number of reads lost in the alignment process. The full_ prefix refers to the dataset containing duplicates from the same individual, the merged_ prefix refers to the dataset once all the duplicate reads have been merged for each individual.

calc_LD_decay

Contains files related to the identification of sliding windows for pruning SNPs in linkage disequilibrium (LD).

Average LD was calculated across intervals in the genome in 100 kb bins using PLINK v1.9 and haplotype variant call format files (VCFs). The LD is represented by an R2 value, and the files in this folder contain R2 values for each locus, which, when visually inspected with a graph, identify a window size of 33 kb where LD decayed to a background rate of 0.1. VCF files were then filtered in windows of 33 kb, sliding by 10 kb, with a threshold of correlation coefficient (r2) > 0.2.

d_stats

Contains some text files that contain the results of D statistics and F4 ratios to test the ancestral origins of the Tundrarum population, and to formally test for admixture. These data were created using the scripts 10_statistics_gland_east_outgroup.r and 11_f4_statistics_gland_east_outgroup.r. The scripts were run in a Unix environment and relied on the packages admixr v0.9.1 (https://cran.r-project.org/web/packages/admixr/index.html) and ADMIXTOOLS v7.0.2 (doi.org/10.1534/genetics.112.145037), which was installed via Bioconda.

diyabc

Contains the input SNP files for DIYABC-RF and the output. Note that only the output for model set 22, the best-supported model, is provided due to large reftable.bin file sizes. The folder with the suffix _ratios contains the output for parameter estimation using ratios of time divided by effective population size. We estimated both types of parameters and found local posterior RMSE estimates to be lower in raw parameters.

The modelset22 folder contains model choice results (cmdline_modelchoice), parameter estimates using 100,000 simulated SNP datasets (param_est_100k), and files created using the DIYABC-RF GUI, which are useful for generating images of the various scenarios.

gstacks_logs

Contains .txt log files (summaries) on gstacks output (gstacks.txt), including per-individual: number of loci, number of reads used to genotype to loci, and mean effective coverage (efective_coverages_per_sample). The per-individual number and fraction of aligned reads and mapping quality (frac_aligned_reads_per_sample). A log file summarising the results of haplotype phasing (gstacks_phasing_results). Also includes a PDF of plots showing coverage, and number/fraction of kept reads on an individual and population basis. Here, log files are only included for the most recent gstacks run, which was used in the paper (new_april_23).

ice_sheet_data

The ice sheet data were used for plotting and comparisons with DIYABC-RF time parameter estimates. Note that due to third-party restrictions, this data is not stored here, but details on how to access the data are in the main manuscript. The shapefile with the world outline used in plotting was also stored here (data from Natural Earth, Free vector and raster map data @ naturalearthdata.com.).

metadata

Contains population metadata, including geographic coordinates (2022_final_betula_pop_coordinates.csv) and collection metadata (Betula_nana_metadata_MED.csv), and other metadata generated during the course of the study.

popmaps

Contains the .txt popmap files (individual id and corresponding population id) used as input for the populations module of stacks for filtering and generating output file formats. "pop_map_merged_w_humilis_gstacks" includes the Betula humilis population. "pop_map_exl_hybrids_" excludes the Betula humilis population, and the four populations which were identified as hybrid with another Betula species were was not studied ("pop_map_w_hybrids_sites" includes those hybrid populations). There are popmaps where the original sampling site is the population "sites", and popmaps where the population was defined using the ADMIXTURE proportions, or genetic groups ("admixture"). There are popmaps for America, Eurasia, the western Eurasian Nana group, and global (not specified in file name). Some popmaps were used for a population run specifically to calculate D-statistics, where Glandulosa East was specified as an outgroup (sample_pops_gland_east and pop_map_gland_east), and the Atlantic nana populations were specified as separate (atlantic_nana_group).

populations_output

Guide to naming convention:

v1 = used popmaps where populations represented sampling locations

v2 - used popmaps where populations represented genetic groups as calculated from admixture proportions from an ADMIXTURE analysis conducted on the v1 output.

A = single SNP per locus output (For ADMIXTURE and DIYABC RF)

B = phased haplotypes with multiple SNPs per locus output (for fineRADstructure/RADpainter only)

There is also population output from the run, which was used to calculate D statistics: gland_east_outgroup.

radchecks

This folder contains the log files, summaries, and plots generated from the process_radtags module of Stacks.

Process_radtags checks that the barcode and RAD cutsite are intact, then demultiplexes the data (grouping reads from plates to certain individuals). Includes the top lines (head) of the reads on each plate (e.g. plate1_head), the log file of the process_radtags process (plate1_process_radtags.radseq_download), a list of the unrecognised RAD sites (plate1_rad_sites_unrecog), and plots summarising the proportion of the library each individual's reads are, the number and proprtion of retained reads per individual and per sampling site. There are plots for the individuals, including duplicates (process_radtags_stats_plate1.pdf) and plots where the duplicates have been merged (process_radtags_stats_plate1_merged.pdf).

rangemaps

This folder contains shapefiles used to create Figure 1 in the manuscript. The shapefiles are species ranges which were digitised in QGIS from range maps in The Flora of North America (Furlow 2004) and the Atlas of North European vascular plants north of the Tropic of Cancer (Hultén & Fries, 1986). ------------------------------------------------------------------------------------------------------------------------------------------------------

There are also several files not in a folder, as they were created during the processing pipeline and accessed multiple times throughout the process:

2024_09_16_genetic_sum_stats.csv Genetic summary stats for each sampling site ( calculated in the script 9_genetic_summary_stats_for_paper.rmd).

Data for: Molecular footprints of Quaternary climate fluctuations in the circumpolar tundra shrub dwarf birch

Data files

Abstract

README: Data for: Molecular footprints of Quaternary climate fluctuations in the circumpolar tundra shrub dwarf birch

Code.zip

1.0_plot_process_radtags_output

1.1_gstacks_stats_part1.r

1.2_gstacks_stats_part2.rmd

2_calc_LD_decay.rmd

3_plot_admixture_output_sampling_loc.rmd

4_making_popmaps_using_admixture_output.rmd

5.1_plot_admixture_output_subsp.rmd

5.2_plot_radpainter_output_subsp.rmd

5.3_isolation_by_distance.rmd

6.1_making_diyabc_input.rmd

6.2_plotting_LDA.rmd

6.3_plotting_diyabc_params.rmd

7.1_Making metadata_for_NMDS.rmd

7.2_NMMDS.rmd

8_plotting_admixed_groups_ice_sheet_map.rmd

9_genetic_summary_stats_for_paper.rmd

FinestructureLibrary.r

2_process_radtags.sh

2b_merge_duplicate_indivs_radcheck.sh

3_bowtie2_align_radcheck.sh

4_gstacks_new_apr_23.sh

4_gstacks_new_apr_23.sh

5_populations*.sh

6_process_files_for_admixtools_gland

ldPruning_relaxed.sh

Example_of_diyabc_script.sh

data.zip

barcodes

bowtie2_logs

calc_LD_decay

d_stats

diyabc

gstacks_logs

ice_sheet_data

metadata

popmaps

populations_output

radchecks

rangemaps