# Title of Dataset: Divergent effective population size histories shape contemporary genetic diversity gradients in a montane bumble bee --- ## Description of the Data and file structure This data set pertains to the paper "Whole genome demographic models indicate divergent effective population size histories shape contemporary genetic diversity gradients in a montane bumble bee" by Lozier et al. published in Ecology and Evolution (https://doi.org/10.1002/ece3.9778). Please see main text for references and citations to software and detailed methods. All analyses require deduplicated and sorted alignment (BAM) files generated from the raw sequence data (FASTQ) files (see bottom table below or original manuscipt for SRA accessions for genetic data). Scripts used for making alignment BAM files from FASTq data are as follows. Please see main manuscript text for references to bioinformatics software. ### Step1: Generate the alignment using BWA MEM and convert to BAM with SAMTOOLS. Readgroup information is taken from sample name and flowcell information in the FASTQ files. bwa mem -M -t 10 \ -R "@RG\tID:${RG_ID}\tSM:${SAMPLE_ID}\tPL:ILLUMINA\tLB:${BARCODE}\tPU:${RG_ID}" \ ${REFERENCE_SEQ} ${FILENAME}_1.fastq.gz ${FILENAME}_2.fastq.gz \ | samtools view -Shb - -o Vanc_BAMs/${FILENAME}.bam Step2: use PICARD to sort and deduplicate and index the BAM files prior to variant calling java -jar picard.jar SortSam I=${FILENAME}.bam O=${FILENAME}.sorted.bam SORT_ORDER=coordinate TMP_DIR=/scratch/ java -jar picard.jar MarkDuplicates I=${FILENAME}.sorted.bam O=${FILENAME}.markdup.bam METRICS_FILE=${FILENAME}_metrics.txt REMOVE_DUPLICATES=true TMP_DIR=/scratch/ java -jar picard.jar BuildBamIndex INPUT=${FILENAME}.markdup.bam REFERENCE GENOME USED FOR ANALYSES: NCBI RefSeq GCF_011952275.1 Files: Summary_Ho_Ne_PSMCstats_rangestats_Figure3data.txt: Data table for Figure 3 of main text. Includes "Sample" = sample identifier, "Site" = locality identifier, "Long" = longitude in decimal degrees, "Lat" = latitude in decimal degrees, "Elev_m" = elevation in meters, "Region" = regional group population identifier (1_CA-S = Southern California, 2_CA_N = Northern California, 3_OR_S = southern Oregon, 4_OR_N = Northern Oregon, 5-WA = Washington), "dist_to_nearest_best_cell_forallpresencecoords_m" = the calculated distance to the nearest pixel at the LGM with suitability equivalent to the mean suitability of genome sample sites (in m), "dist_to_nearest_best_cell_forsamplecoords_m" = contains the calculated distance to the nearest pixel at the LGM with suitability equivalent to the mean suitability or all spatially filtered GBIF coordinates, "rangeloss" = decline in suitability from the last glacial maximum to today (difference in pixel value between the contemporary model and the LGM maxent projection), "meanpsmcNex104" = the mean effective population size Ne from PSMC over all time (in 10^4 individuals), "max" = the maximum PSMC expansion size (10^4 individuals), "min" = the minumum last glacial maximum bottleneck (10^4 individuals), "max_min" = the PSMC bottleneck size (max - min), and "indhet" = the individual heterozygosity from ANGSD (see below). vanc_scafs_morethan500kb.bed: A mask BED format file listing the scaffolds in the B. vanvouverensis genome (GCF_011952275.1) of greater than or equal to 500kb in length. Used for masking BAM files in generating inputs for ANGSD, PSMC, and MSMC2. File contains columns for scaffold identifier, starting nucleotide position, and ending nucleotide position. GCF_011952275.1_Bvanc_JDL1245_genomic_catall500kbscaffolds.mask.bed: BED file for mappable regions used for masking (same format as "vanc_scafs_morethan500kb.bed"). SMC_Neplots.xlsx: a multisheet excel file summaring information for plotting and analyzing effective population size (Ne) trajectories using PSMC and MSMC2. -PSMC_ex_code: example of code for conducting PSMC analyses -500kbscaffoldBED: the scaffolds used in analysis with start and end coords: see "vanc_scafs_morethan500kb.bed" -PSMC_moderatecovgenomes: PSMC outputs for all 18x coverage ("moderate coverage") samples in a single dataframe (used for Figure 1B in paper). A line of R code illustrates how to plot. The columns of data include "T" = the time point in years (0 = present), "Ne" = effective population size in 10^4 individuals, "File" = PSMC output file from which data were obtained (in PSMC/poplevel_psmcoutputs in the PSMC.tar.gz directory described below), "Sample" = sample identifier (described at end of this file), "Site" = locality identifier (described at end of this file), "Lat" = latitude in decimal degrees, "Long" = longitude in decimal degrees, "Elev_m" = elevation (meters), "Region" = regional population grouping (see Summary_Ho_Ne_PSMCstats_rangestats_Figure3data.txt above) -PSMC_3highcov_bootstrap: PSMC output for the three high coverage genomes , including bootstraps in a dataframe (for Figure 1C). Columns as descrbied in the previous tab, except "File" points to data that is summarized in "PSMC/highcov_psmcoutputs/combined_3genome_bootstrapdf_result.txt" in the PSMC.tar.gz directory described below), "Replicate" = bootstrap replicate, with 0 being the actual data inference for each sample, "DataBoot" = a value of Data if the row reflects the inference or Boot if the row indicated a bootstraps in 1-100. Example code for plotting shown. Conversions from PSMC into "real world" times and populaiton sizes assume Theta=3Nu for haplodiploids where N is the effective population size and u is the mutation rate. -PSMC_mappabilitymasked_3highcov: PSMC output for analyses repeated using mappability mask (see main text for details). Conversions for T=time and Ne = effective population size columns assume Theta=3Nu for haplodiploids as above. -msmc2res_unphased_convertedscal: MSMC2 outputs run in unphased mode. Columns are "time_index" reflect the temporal bin, "left_time_boundary" = scaled time at the start of the bin (0 = present), "right_time_boundary" = scaled time at the end of the bin, "lambda" contains information on effective population size, and "sample" reflects the population grouping (see Figure S3). These columns are converted to Real World values "Time" (Time in years before present) and "Ne" (effective population size in number of individuals) for each population sample using provided mutation rate "mu" and generation time "gen" following https://github.com/jessicarick/msmc2_scripts. Conversions assume Theta=3Nu for haplodiploids. MSMC2 was run from BAM files using the scripts from https://github.com/jessicarick/msmc2_scripts, with minor modifications to the paths, and sample names. Bvanc_ENM_StabilityCalcs.xlsx: excel file containing analysis tables resulting from Maxent modeling and niche stability analyses -Coordinates were obtained from GBIF.org. Please refer to GBIF.org (14 November 2021) GBIF Occurrence Download https://doi.org/10.15468/dl.cqq8rb for the search used to obtain relevant coordinates. Please see the main Lozier et al. 2022 Methods text for the process of filtering coordinates spatially. Briefly, We removed localities likely to represent the sister species B. bifarius based on expected distributions by limiting the spatial extent to Longitude −150 to −110 decimal degrees and latitude 35 to 65 decimal degrees. We only retained observations that had specific descriptive locality information for the record (e.g., at a finer resolution than county) to minimize georeferencing inaccuracy. We filtered the data to reduce spatial bias by randomly removing points within two 10 arc-minute raster cells of each other. -Niche modeling for stability analysis used Maxent. Bioclimatic (BIOCLIM) variables were obtains from the rpaleoclim 0.9 package for R. See methods in main text for details. -stability_summaries_1 tab. This tab contains a table on stability calculations for each genome sequenced. The "sample" column contains the individual ideantifier for the sequenced genome (see table below with SRA accession numbers for FASTQ files). The "dist_to_nearest_best_cell_forallpresencecoords_m" column contains the calculated distance to the nearest pixel at the LGM with suitability equivalent to the mean suitability of genome sample sites (in m). The "dist_to_nearest_best_cell_forsamplecoords_m" column contains the calculated distance to the nearest pixel at the LGM with suitability equivalent to the mean suitability or all spatially filtered GBIF coordinates. -stability_summaries_2 tab. This tab contains a table that represents the change in habitat suitability between the last glacial maximum (LGM) Maxent model and the contemporary Maxent distribution Model for each sampling site. The first column "Longitude" is Longitude (decimal degrees), the second "Latitude" column is latitude (decimal degrees), the third "Region" column represent the geographic categorization of the site 1_CA-S = Southern California, 2_CA_N = Northern California, 3_OR_S = southern Oregon, 4_OR_N = Northern Oregon, 5-WA = Washington, 1_CA-S-HighCovGenome = Southern California for the high coverage genome sample, 4_OR-N-HighCovGenome = northern Oregon for the high coverage genome sample, SanJuanIsl-HighCovGenome = the San Juan Island high coverage genome sample), and the fourth "LGMsuitabilityloss" reflcting the the change in suitability between LGM and today (subtracted difference in pixel value between the contemporary model and the LGM maxent projection). Directories (compressed as tar.gz): ANGSD.tar.gz: genetic diversity results generated by the software ANGSD. -The main directory contains diversity statistics for populations (.pestG files), individual heterozygosities (all_indsfs.txt) and example code (ExampleANGSDscripts.txt) used to generate results. -The ExampleANGSDscripts.txt contains code lines that would be used to generate the site frequency spectra (SFS), the individual heterozygosity estimates, and for outputting population genetic diversity estimates from ANGSD. See ANGSD manual available at http://www.popgen.dk/angsd/index.php/ANGSD for detailed description of command options. -The .pestPG files are raw data exported by the ANGSD software with genetic diversity estimated per chromosome. Refer to the ANGSD manual available at http://www.popgen.dk/angsd/index.php/ANGSD for file formats and column information. -The all_indsfs.txt file is a summarizatin of ANGSD results for each genome used for Figure 2 of main text used for plotting results. The columns are "Homozygous" = estimated number of homozygous positions in the genome, "Heterozygous" = estimated number of heterozygous positions in the genome, "Ho"=estimated heterozygosity for the individual, "file" = the SFS file used for analysis, "sample" = sample identity, "site" = locality descriptor, "lat" = latitude in decimal degrees, "long" = longitude in decimal degrees, "elev" = elevation in meters, "region" = region (see Bvanc_ENM_StabilityCalcs.xlsx description), "color" = the color code used for plotting in R, "LGMsuitabilityloss" = the loss in habitat suitability for the locality taken from Bvanc_ENM_StabilityCalcs.xlsx. -Subdirectory "sfs" contains population (see Table Below) folded 1D site frequency spectra (SFS) for each region defined in manuscript, and 2D SFS (2D.sfs) for region pairs. Also included are the downsampled SFS used for Supplementary Figure S4 and analyses that considered evenly sampled populations (N=9 per region). Refer to the ANGSD manual available at http://www.popgen.dk/angsd/index.php/ANGSD for file formats. -Subdirectory "R_diversitystats" comntains R code and a summary of chromosome level population statistics from ANGSD (SMN_Angsd_divstats.txt) used for analysis amd plotting results from .pestPG files. The column "Chr" = the scaffold in the genome, "WinCenter" = position of the scaffold midpoint, "nSites" = the number of nucleotides, "Region" = regional sample grouping ("CA_south" = southern California, "Mid" = mid-range, and "North" = northern), "TajimaD" = mean Tajima's D, "Theta_pi_persite" = estimate of nucleotide diveristy per base pair for each scaffold. See ANGSD manual available at http://www.popgen.dk/angsd/index.php/ANGSD for detailed description of statistcs used in the calculation. -The "PCANGSD" subdirectory contains the covariances among samples from PCANGSD (allvancpcangsd.cov), with rows and columns being defined as in the bampops.txt file, alongside R code for plotting the principle components analysis. bampops.txt has the following columns in order: individual sample identifier, the sample locality identifier, latitude in decimal degrees, longitude in decimal degrees, population region grouping, color code used for plotting. PSMC.tar.gz: Files associated with PSMC runs. Most files reflect raw inputs and outputs of the software PSMC and are only of use for documentation purposes, please see the original publication (https://doi.org/10.1038/nature10231) and documentation (https://github.com/lh3/psmc) for details on preparation of input files and interpreting outputs if necessary. The file "SMC_Neplotxt.xlsx" described above has necessary summaries of these files for recreating figures and statistical analyses in the manuscript. All filenames utilize the sample identifiers (JDLXXXX identification number) described at the end of this file. Subdirectory "poplevel_psmcoutputs" contains inputs (psmcfa files) raw output of PSMC runs for Fig 1B (psmc and par files), graphical outputs (eps files) and .txt files for each sample detailing the converted output from psmc_plot.pl (converted using Theta=3Nu). Subdirectory "highcov_psmcoutputs" contains inputs and outputs for Fig 1C, including psmcfa, psmc, par outputs for actual data, conversion to real world times and Ne for each bootstrap replicate. Please refer to SMC_Neplots.xlsx for more human readable versions of the results. SDM.tar.gz: Files for niche stability analyses. The directory includes SDMcode.R = R code for generating species distribution modeling and niche stability analysis, requires a file coord.txt = with GBIF specimen coordinate data (Latitude and Longitude in decimal degrees) obtained from GBIF https://doi.org/10.15468/dl.cqq8rb. sites_for_mapping.txt is a text file with sample site information needed for suitability analyses, which includes Longitude and Latitude in decimal degrees and the RegionGroup containg the population grouping as defined in Bvanc_ENM_StabilityCalcs.xlsx. ## Sharing/access Information Sequences used for this study are available on NCBI Sequence Read Archive (https://www.ncbi.nlm.nih.gov/sra) with Accessions in the table below. This table is also available in the paper's supplemental information Appendix 1, Table S2 SampleID SiteID RegionPopulationGroup SequencingCoverageCategory MeanSequencingDepth NCBIAccession JDL1245 CA14.2015 Southern: Southern CA High 81.94 SRX7986195 JDL3160 OR06.2016 Northern: Northern OR High 76.71 SAMN29751581 JDL3191 JS.SJ01.2018 San Juan Island High 79.66 SRX7270433 JDL1303 CA17.2015 Southern: Southern CA Moderate 19.31 SAMN29751564 JDL1305 CA17.2015 Southern: Southern CA Moderate 21.97 SAMN29751483 JDL1308 CA17.2015 Southern: Southern CA Moderate 22 SAMN29751504 JDL1293 CA16.2015 Southern: Southern CA Moderate 20.76 SAMN29751511 JDL1294 CA16.2015 Southern: Southern CA Moderate 18.42 SAMN29751541 JDL1388 CA24.2015 Southern: Southern CA Moderate 18 SAMN29751498 JDL1389 CA24.2015 Southern: Southern CA Moderate 18.56 SAMN29751546 JDL1396 CA24.2015 Southern: Southern CA Moderate 18.05 SAMN29751501 JDL1400 CA24.2015 Southern: Southern CA Moderate 21.97 SAMN29751527 JDL1406 CA25.2015 Southern: Southern CA Moderate 18.09 SAMN29751510 JDL1417 CA25.2015 Southern: Southern CA Moderate 23.78 SAMN29751503 JDL1418 CA25.2015 Southern: Southern CA Moderate 19.94 SAMN29751461 JDL1419 CA25.2015 Southern: Southern CA Moderate 28.45 SAMN29751492 JDL695 CA02.2014 Mid-range: Northern CA Moderate 25.6 SAMN29751522 JDL696 CA02.2014 Mid-range: Northern CA Moderate 20.77 SAMN29751550 JDL697 CA02.2014 Mid-range: Northern CA Moderate 27.94 SAMN29751496 JDL698 CA02.2014 Mid-range: Northern CA Moderate 20.49 SAMN29751573 JDL700 CA02.2014 Mid-range: Northern CA Moderate 21.13 SAMN29751547 JDL702 CA02.2014 Mid-range: Northern CA Moderate 29.49 SAMN29751484 JDL913 OR11.2014 Mid-range: Southern OR Moderate 22.53 SAMN29751464 JDL915 OR11.2014 Mid-range: Southern OR Moderate 25.07 SAMN29751514 JDL923 OR11.2014 Mid-range: Southern OR Moderate 25.49 SAMN29751566 JDL1539 OR09.2015 Northern: Northern OR Moderate 19.56 SAMN29751555 JDL3116 OR03.2016 Northern: Northern OR Moderate 26.44 SAMN29751576 JDL3117 OR03.2016 Northern: Northern OR Moderate 18.64 SAMN29751544 JDL3121 OR03.2016 Northern: Northern OR Moderate 21.78 SAMN29751532 JDL2921 WA10.2016 Northern: WA Moderate 20.45 SAMN29751476 JDL3049 WA15.2016 Northern: WA Moderate 22.44 SAMN29751466 JDL3060 WA16.2016 Northern: WA Moderate 22.98 SAMN29751519 JDL3071 WA16.2016 Northern: WA Moderate 19.82 SAMN29751557 JDL3072 WA16.2016 Northern: WA Moderate 20.42 SAMN29751507 JDL3077 WA17.2016 Northern: WA Moderate 25.94 SAMN29751554 JDL3078 WA17.2016 Northern: WA Moderate 19.45 SAMN29751488 JDL3085 WA17.2016 Northern: WA Moderate 20.62 SAMN29751536 JDL3018 WA12.2016 Northern: WA Moderate 32.7 SAMN29751469 JDL3022 WA12.2016 Northern: WA Moderate 22.35 SAMN29751490 JDL3030 WA12.2016 Northern: WA Moderate 22.5 SAMN29751568 JDL3031 WA12.2016 Northern: WA Moderate 22.34 SAMN29751520 JDL3001 WA11.2016 Northern: WA Moderate 24.54 SAMN29751460 JDL3006 WA11.2016 Northern: WA Moderate 25.82 SAMN29751559 JDL3035 WA14.2016 Northern: WA Moderate 27.15 SAMN29751470 JDL3046 WA14.2016 Northern: WA Moderate 22.47 SAMN29751558 Associated Site information for these localities is in the table below (reproduced from manuscript Appendix 1, Table S1) Site State Region: Population Group NumberGenomesPerSite SequencingCoverageCategory Latitude(decimal degrees) Longitude(decimal degrees) Elevation(m) CA17.2015 CA Southern: Southern CA 3 Moderate Coverage 37.217 -119.196 2223 CA16.2015 CA Southern: Southern CA 2 Moderate Coverage 37.288 -119.103 2747 CA24.2015 CA Southern: Southern CA 4 Moderate Coverage 38.333 -119.643 2902 CA25.2015 CA Southern: Southern CA 4 Moderate Coverage 39.230 -120.142 2361 CA02.2014 CA Mid-range: Northern CA 6 Moderate Coverage 41.362 -122.201 2377 OR11.2014 OR Mid-range: Southern OR 3 Moderate Coverage 42.067 -122.683 1754 OR09.2015 OR Northern: Northern OR 1 Moderate Coverage 45.256 -121.712 1047 OR03.2016 OR Northern: Northern OR 3 Moderate Coverage 45.334 -121.708 1835 WA10.2016 WA Northern: WA 1 Moderate Coverage 47.535 -120.678 1228 WA15.2016 WA Northern: WA 1 Moderate Coverage 47.801 -121.058 1390 WA16.2016 WA Northern: WA 3 Moderate Coverage 47.804 -121.068 1299 WA17.2016 WA Northern: WA 3 Moderate Coverage 47.809 -120.714 584 WA12.2016 WA Northern: WA 4 Moderate Coverage 48.282 -117.582 1549 WA11.2016 WA Northern: WA 2 Moderate Coverage 48.299 -117.413 840 WA14.2016 WA Northern: WA 2 Moderate Coverage 48.642 -120.404 1892 CA14.2015 CA Southern: Southern CA 1 High Coverage Genome 36.597 -118.736 2214 OR06.2016 OR Northern: Northern OR 1 High Coverage Genome 45.256 -121.712 1047 JS.SJ01.2018 WA San Juan Island 1 High Coverage Genome 48.535 -123.044 49