Data for: Molecular footprints of Quaternary climate fluctuations in the circumpolar tundra shrub dwarf birch
Data files
Abstract
Data and code to accompany the manuscript "Molecular footprints of Quaternary climate fluctuations in the circumpolar tundra shrub dwarf birch".
We tested scenarios of the Quaternary demographic history of dwarf birch species (Betula nana L. and Betula Glandulosa Michx.) using Single Nucleotide Polymorphism (SNP) markers obtained from RAD sequencing and approximate Bayesian computation. We compared the timings of modelled population events with ice sheet reconstructions and other paleoenvironmental information to untangle the impacts of alternating cold and warm periods on the phylogeography of dwarf birch.
Details of the datasets and methods are available in the published manuscript and supplementary information: https://doi.org/10.1111/mec.70082
The raw genetic data for the study have been deposited in the European Nucleotide Archive (ENA) at EMBL-EBI under accession number PRJEB67938. The data and code deposited in Dryad will help enable replication of the methods of the paper.
Code.zip
Author: Maria Dance
Date: 2024-10-01
Most code is in .r or .rmd files, which are numbered in the order in which they should be run. .rmd files are best opened in RStudio. Most scripts should run in R version 4.4.1 and RStudio version 2024.04.2. Scripts number 10 and 11 were run in a Unix environment with R version 4.5.1 and relied on the packages admixr v0.9.1 (https://cran.r-project.org/web/packages/admixr/index.html) and ADMIXTOOLS v7.0.2 (doi.org/10.1534/genetics.112.145037), which was installed via Bioconda.
There is also a folder of bash scripts that were run in a unix environment.
------------ R scripts ------------
1.0_plot_process_radtags_output
This script makes plots for the Stacks process_radtags per-sample statistics. This R script is best run in a Linux environment using the syntax: $ Rscript process_radtags_stats_radchecks.r.
1.1_gstacks_stats_part1.r
This script plots the per-sample statistics output from the Stacks gstacks module.
1.2_gstacks_stats_part2.rmd
This script calculates stacks-gstacks dataset-wide statistics for the number of loci, mean coverage, and phasing.
2_calc_LD_decay.rmd
This script plots the average LD across different intervals in the genome (calculated using PLINK) to identify the window size where LD decays to a background rate of r=0.1.
3_plot_admixture_output_sampling_loc.rmd
This script plots ADMIXTURE results in a bar plot using linkage disequilibrium-pruned SNPS, which were filtered using sampling sites as populations, and excluding missing populations.
4_making_popmaps_using_admixture_output.rmd
This script makes a new popmap (Stacks populations input file) for the SNP filtering to retain more SNPs in a second stage of SNP filtering (populations run). It uses ADMIXTURE ancestry proportions to assign individuals to a genetic group, which are used as populations in the new populations run, instead of the sample sites.
5.1_plot_admixture_output_subsp.rmd
This script plots new ADMIXTURE results in a bar plot using linkage disequilibrium-pruned SNPS, which were filtered using the first run of ADMIXTURE genetic groups as populations, and excluding missing populations.
5.2_plot_radpainter_output_subsp.rmd
This script plots the output from fineRADstructure output, making coancestry matrices with dendrograms of population clustering. Script adapted from Milan Malinsky (millanek@gmail.com), in turn adapted from a Finestructure R Example by Daniel Lawson.
5.3_isolation_by_distance.rmd
This script tests for isolation by distance (IBD) in the dwarf birch genetic data, implemented using the function dartR::gl.ibd(). This test compares pairwise geographic and genetic distance matrices, using permutations to test for significance. It also tests the association between the matrix of genetic distances and a model matrix of genetic cluster membership, with the matrix of geographical distances as a covariate.
6.1_making_diyabc_input.rmd
This script makes the input genetic data files for DIYABC RF using the output of the populations module of Stacks.
6.2_plotting_LDA.rmd
This script plots the LDA ordination output from DIYABC-RF to help find the best scenario and fit to the data.
6.3_plotting_diyabc_params.rmd
This script plots the posterior distributions of DIYABC RF parameters from the best model (Model set 22).
7.1_Making metadata_for_NMDS.rmd
This script makes metadata for NMDS, and it uses admixture proportions from ADMIXTURE output to assign individuals to a genetic group. This data is then matched with other metadata to make a metadata output file that is useful for e.g. NMDS. It is the same approach as in script #4.
7.2_NMMDS.rmd
This script makes NMDS plots for individuals based on their SNPs.
8_plotting_admixed_groups_ice_sheet_map.rmd
This script makes figure 1 in the manuscript and also other map figures (S24, S33).
9_genetic_summary_stats_for_paper.rmd
This script makes genetic diversity statistics for each sampling site.
FinestructureLibrary.r
This script contains functions that need to be sourced by the 5.2_plot_radpainter_output_subsp.rmd script.
The bash scripts were run in a Linux environment.
------------Folder "bash_script_backups"------------
All these shell files can be opened in a basic text editor. For details on the parameters for each software and Unix command, see the online manuals.
2_process_radtags.sh
This script runs the Stacks module process_radtags, which demultiplexes reads on the plates and groups them into individual-level reads. This version includes the option to run radchecks, checking that the rad cut sites are intact.
2b_merge_duplicate_indivs_radcheck.sh
This script merges the duplicated individuals from the sequencing.
3_bowtie2_align_radcheck.sh
This script aligns the reads to a reference Betula nana genome using bowtie2 and sorts the output with samtools.
4_gstacks_new_apr_23.sh
This script runs the Stacks module gstacks, excluding the four populations that were identified as hybrids with another Betula species.
4_gstacks_new_apr_23.sh
This script runs the Stacks module gstacks, excluding the populations with putative Betula hybrids.
5_populations*.sh
The scripts 5_populations_* not only run the populations module of Stacks, but they then prune the SNPs in linkage disequilibrium (LD), then run ADMIXTURE on those LD-pruned SNPs. For output in phased haplotypes (version "B"), the scripts also run the programs RADpainter and finestructure.
Guide to naming convention:
v1 = used popmaps where populations represented sampling locations
v2 - used popmaps where populations represented genetic groups as calculated from admixture proportions from an ADMIXTURE analysis conducted on the v1 output.
A = single SNP per locus output (For ADMIXTURE and DIYABC RF)
B = phased haplotypes with multiple SNPs per locus output (for fineRADstructure/RADpainter only)
"exl_hybrids" excludes the four populations that were identified as hybrids with another Betula species that were not studied.
"w_hybrids" includes those hybrid populations
There are population runs for America, Eurasia, the western Eurasian nana group, and global (not specified in file name).
There was also a population run for calculating D-statistics specifically (5_populations_glandulosa_east_outgroup).
6_process_files_for_admixtools_gland
This script was run to convert population output (vcf) to an input file that was compatible with admixr v0.9.1 (https://cran.r-project.org/web/packages/admixr/index.html) and ADMIXTOOLS v7.0.2 (doi.org/10.1534/genetics.112.145037) which was installed via Bioconda.
ldPruning_relaxed.sh
This script was used for pruning SNPS that were in Linkage Disequilibrium. For details on the parameters, see the manuscript supplementary information.
Example_of_diyabc_script.sh
This is an example of the bash scripts used to run the software DIYABC RF (Collins et al. 2021). It lists the settings used in all DIYABC runs.
data.zip
Author: Maria Dance
Date: 2024-10-01
Guide to data folders:
barcodes
Contains .txt and .xlsx files with the unique DNA read barcodes and the corresponding individual ID. The files with _no_cut_site suffix contain the barcodes without the sfbl restriction enzyme cut site. These files are used in the pipeline to demultiplex the raw sequence reads from plate-level to individual-level reads.
bowtie2_logs
Contains .txt and .xlsx files with logs (summaries) taken from the output of the alignment software Bowtie2. Includes, on a per-individual basis: the number of reads pre-alignment, the number of reads post-alignment, and the number of reads lost in the alignment process. The full_ prefix refers to the dataset containing duplicates from the same individual, the merged_ prefix refers to the dataset once all the duplicate reads have been merged for each individual.
calc_LD_decay
Contains files related to the identification of sliding windows for pruning SNPs in linkage disequilibrium (LD).
Average LD was calculated across intervals in the genome in 100 kb bins using PLINK v1.9 and haplotype variant call format files (VCFs). The LD is represented by an R2 value, and the files in this folder contain R2 values for each locus, which, when visually inspected with a graph, identify a window size of 33 kb where LD decayed to a background rate of 0.1. VCF files were then filtered in windows of 33 kb, sliding by 10 kb, with a threshold of correlation coefficient (r2) > 0.2.
d_stats
Contains some text files that contain the results of D statistics and F4 ratios to test the ancestral origins of the Tundrarum population, and to formally test for admixture. These data were created using the scripts 10_statistics_gland_east_outgroup.r and 11_f4_statistics_gland_east_outgroup.r. The scripts were run in a Unix environment and relied on the packages admixr v0.9.1 (https://cran.r-project.org/web/packages/admixr/index.html) and ADMIXTOOLS v7.0.2 (doi.org/10.1534/genetics.112.145037), which was installed via Bioconda.
diyabc
Contains the input SNP files for DIYABC-RF and the output. Note that only the output for model set 22, the best-supported model, is provided due to large reftable.bin file sizes. The folder with the suffix _ratios contains the output for parameter estimation using ratios of time divided by effective population size. We estimated both types of parameters and found local posterior RMSE estimates to be lower in raw parameters.
The modelset22 folder contains model choice results (cmdline_modelchoice), parameter estimates using 100,000 simulated SNP datasets (param_est_100k), and files created using the DIYABC-RF GUI, which are useful for generating images of the various scenarios.
gstacks_logs
Contains .txt log files (summaries) on gstacks output (gstacks.txt), including per-individual: number of loci, number of reads used to genotype to loci, and mean effective coverage (efective_coverages_per_sample). The per-individual number and fraction of aligned reads and mapping quality (frac_aligned_reads_per_sample). A log file summarising the results of haplotype phasing (gstacks_phasing_results). Also includes a PDF of plots showing coverage, and number/fraction of kept reads on an individual and population basis. Here, log files are only included for the most recent gstacks run, which was used in the paper (new_april_23).
ice_sheet_data
The ice sheet data were used for plotting and comparisons with DIYABC-RF time parameter estimates. Note that due to third-party restrictions, this data is not stored here, but details on how to access the data are in the main manuscript. The shapefile with the world outline used in plotting was also stored here (data from Natural Earth, Free vector and raster map data @ naturalearthdata.com.).
metadata
Contains population metadata, including geographic coordinates (2022_final_betula_pop_coordinates.csv) and collection metadata (Betula_nana_metadata_MED.csv), and other metadata generated during the course of the study.
popmaps
Contains the .txt popmap files (individual id and corresponding population id) used as input for the populations module of stacks for filtering and generating output file formats. "pop_map_merged_w_humilis_gstacks" includes the Betula humilis population. "pop_map_exl_hybrids_" excludes the Betula humilis population, and the four populations which were identified as hybrid with another Betula species were was not studied ("pop_map_w_hybrids_sites" includes those hybrid populations). There are popmaps where the original sampling site is the population "sites", and popmaps where the population was defined using the ADMIXTURE proportions, or genetic groups ("admixture"). There are popmaps for America, Eurasia, the western Eurasian Nana group, and global (not specified in file name). Some popmaps were used for a population run specifically to calculate D-statistics, where Glandulosa East was specified as an outgroup (sample_pops_gland_east and pop_map_gland_east), and the Atlantic nana populations were specified as separate (atlantic_nana_group).
populations_output
Guide to naming convention:
v1 = used popmaps where populations represented sampling locations
v2 - used popmaps where populations represented genetic groups as calculated from admixture proportions from an ADMIXTURE analysis conducted on the v1 output.
A = single SNP per locus output (For ADMIXTURE and DIYABC RF)
B = phased haplotypes with multiple SNPs per locus output (for fineRADstructure/RADpainter only)
There is also population output from the run, which was used to calculate D statistics: gland_east_outgroup.
radchecks
This folder contains the log files, summaries, and plots generated from the process_radtags module of Stacks.
Process_radtags checks that the barcode and RAD cutsite are intact, then demultiplexes the data (grouping reads from plates to certain individuals). Includes the top lines (head) of the reads on each plate (e.g. plate1_head), the log file of the process_radtags process (plate1_process_radtags.radseq_download), a list of the unrecognised RAD sites (plate1_rad_sites_unrecog), and plots summarising the proportion of the library each individual's reads are, the number and proprtion of retained reads per individual and per sampling site. There are plots for the individuals, including duplicates (process_radtags_stats_plate1.pdf) and plots where the duplicates have been merged (process_radtags_stats_plate1_merged.pdf).
rangemaps
This folder contains shapefiles used to create Figure 1 in the manuscript. The shapefiles are species ranges which were digitised in QGIS from range maps in The Flora of North America (Furlow 2004) and the Atlas of North European vascular plants north of the Tropic of Cancer (Hultén & Fries, 1986). ------------------------------------------------------------------------------------------------------------------------------------------------------
There are also several files not in a folder, as they were created during the processing pipeline and accessed multiple times throughout the process:
- 2024_09_16_genetic_sum_stats.csv Genetic summary stats for each sampling site ( calculated in the script 9_genetic_summary_stats_for_paper.rmd).
