Evidence for admixture and rapid evolution during glacial climate change in an alpine specialist
Data files
Apr 23, 2024 version files 11.39 GB
Abstract
The pace of current climate change is expected to be problematic for alpine flora and fauna, as their adaptive capacity may be limited by small population size. Yet despite substantial genetic drift following post-glacial recolonization of alpine habitats, alpine species are notable for their success in surviving highly heterogeneous environments. Population genomic analyses demonstrating how alpine species have adapted to novel environments with limited genetic diversity remain rare, yet are important in understanding the potential for species to respond to contemporary climate change. In this study, we explored the evolutionary history of alpine ground beetles in the Nebria ingens complex, including the demographic and adaptive changes that followed the last glacier retreat. We first tested alternative models of evolutionary divergence in the species complex. Using the millions of genome-wide SNP markers from hundreds of beetles, we found evidence that the Nebria ingens complex has been formed by past admixture of lineages responding to glacial cycles. Recolonization of alpine sites involved a distributional range shift to higher elevation, which was accompanied by a reduction in suitable habitat and the emergence of complex spatial genetic structure. We tested several possible genetic pathways involved in the adaptation to heterogeneous local environments using whole genome scan and genotype-environment association approach. From the identified genes, we found enriched functions broadly associated with abiotic stress responses, with strong evidence for adaptation to hypoxia-related pathways. The results demonstrate that despite rapid environmental changes, alpine beetles in the N. ingens complex have shown rapid physiological evolution.
README: README: Evidence for admixture and rapid evolution during glacial climate change in an alpine specialist
https://doi.org/10.5061/dryad.0gb5mkm6j
This readme file contains catalog and instructions about how we prepared and analyzed the data in our study entitled "Evidence for admixture and rapid evolution during glacial climate change in an alpine specialist". In this study, we collected genomic data from 382 Nebria ingens complex beetles throughout the Sierra Nevada in California. We first reconstructed the evolutionary history of this species complex including the formation of admixture population and more recent post-glacial recolonization to the high elevation habitats. We also performed genome scan of selection and genotype environment association to detect molecular adaptation and genes associated with the alpine environments.
This readme file includes two parts: data and scripts. Data section shows a list of files in this dryad repository that are essential to repeat the analyses, and scripts are the packed commands to perform the analyses. All raw data files can be opened and viewed with a standard text editor.
Data
1. Genotype_data
- vqsr_raw_snps.vcf.gz: The raw vcf output of GATK variant calling pipeline from 382 N. ingens complex samples using Variant Quality Score Recalibration (VQSR) filters.
- final_maf005.vcf.gz: Imputed genotype SNP data by filtering out loci with minor allele frequency less than 5% from the raw vqsr output SNPs (vqsr_raw_snps.vcf.gz). This dataset was used in Genome Scan and Genotype-environment association analyses.
- lcs_maf005_thin.geno: The geno object (created and used in R) by thining SNP loci in a span of 5kbp region with genotype correlation coefficients higher than 0.5 from the filtered SNP dataset (final_maf005.vcf.gz).
2. Population Structure and demographic history
- admix_prior.out: prior demographic data generated from msprime simulation under admixture model. There are 1 million simulations (lines) and six parameters including: "seed", "Nc" (population size of Conness site), "Ns" (population size of Selden site), "Na" (population size of Army site), "NE" (ancestral population size), "t1" (divergence time of Conness and Army populations) , "cs" (gene contribution rate from Conness to Selden), "as" (gene contribution rate from Army to Selden), "t2 (time of admixture event)"
- indep_prior.out: prior demographic data generated from msprime simulation under independent lineage model. There are 1 million simulations (lines) and six parameters same to those in "admix_prior.out".
- admix_sum.out: summary statistics calculated from the priors in "admix_prior.out". There are 1 million sets of summary statistics calculated from corresponding line in "admix_prior.out". Four columns of summary statistics are: "f3" (f3 statistics), "dcs" (dXY between Conness and Selden), "dca" (dXY between Conness and Army), and "das" (dXY between Army and Selden).
- indep_sum.out: summary statistics calculated from the priors in "indep_prior.out". There are 1 million sets of summary statistics calculated from corresponding line in "indep_prior.out". Four columns of summary statistics are same to "admix_sum.out".
- dxy_obs.txt: the summary statistics of dXY calculated from the observed populations.
- F3_stat: the summary statistics of f3 statistics calculated from the observed populations.
3. Selection
- RAiSD: the outputs of RAiSD analysis, including mu statistics and final list of candidate genes.
- X_all_mu_stat: all the mu-statistics calculated from the population X (21 populations are in the folder)
- X_raisd_gene_list: the intersect genes of the X population based on the outliers (top 0.5%) of mu-statistics (21 populations are in the folder).
- raisd_final_genes: the list of final selected genes from the 21 populations. Information includes 1) gene ID, 2) corresponding NCBI protein entrance, and 3) number of population (among the 21) with the given gene present in the intersect gene list (gene in X_raisd_gene_list).
- OmegaPlus: the outputs of OmegaPlus analysis
- omegaPlus_out_final: all the omega-statistics calculated from OmegaPlus. Three columns are in the file including 1) contig ID, 2) site ID, and 3) omega-statistics
- omegaplus_gene_final_list: The list of intersect genes from the omega-statistics outlier (top 0.5%) in omegaPlus_out_final. Five columns are: 1) contig ID, 2) site ID, 3) omega_statistics, 4) gene ID, and 5) corresponding NCBI protein entrance.
- LFMM: the inputs data including the climatic variable data in ASCII Raster format (.asc) and the extracted values for each sites in "env_data.csv", genotype data stored in "lcs_final_maf005.lfmm*", and output data of LFMM2 analysis including three BH005 files.
- lcs_final_maf005.lfmm: genotype data used in the LFMM analysis, which is converted from "final_maf005.vcf.gz" .
- X_InMo.asc: The five climatic variables that were tested for running LFMM analysis, X includes elev for elevation, ppt for precipitation, tmax for temperature of the hottest month of the year, tmin for temperature of the coldest month of the year, and tmean for mean temperature of the year.
- May.asc: mean of snow water equivalent data in May for the span of 31 years (1985-2015) in the Sierra Nevada.
- Sep.asc: mean of snow water equivalent data in September for the span of 31 years in the Sierra Nevada.
- lfmm_elev_BH005: The output of LFMM analysis for variable of elevation. Data includes 1) contig ID, 2) site ID, 3) p-value of lasso regression, 4) p-value of ridge regression, and 5) mean p-value.
- lfmm_ppt_BH005: The output of LFMM analysis for variable of precipitation. Data type same to "lfmm_elev_BH005".
- lfmm_swe_BH005: The output of LFMM analysis for variable of snow water equivalent data in May. Data type same to "lfmm_elev_BH005".
Scripts
1. Variant Calling
- bbudk.sh: decontaminating the reads in fastq format (The fastq files we uploaded to NCBI are from this step so if you downloaded them from NCBI, no need to rerun this part)
- bwa.sh: map the fastq reads, convert to bam, extract mapped reads, and sort with samtools.
- MarkDuplicates.sh: mark the duplicates for the sorted bam files using MarkDuplicates (Picard)
- haplotypecaller_lcs_run1.sh: a series of steps to run GATK haplotypecaller for generating GVCF files for each sample
- GenomicsDBImport_lcs_run1.sh: a series of steps to run GATK GenomicsDBImport for generating cohort callset for each contig, using GVCF files from
haplotypecaller_lcs_run1.sh
, and merge them into a single raw vcf file call "lcs_run1.vcf.gz" using MergeVcfs (Picard). - hard-filtering_lcs.sh: apply hard-filters to clean up the raw vcf (lcs_run1.vcf.gz). The thresholds were applied based on the visualizations of distribution of each parameters in supplementary file 1: Figure S3.
- BQSR.sh: a series of steps to run BQSR to generate the new bam files for each sample.
- haplotypecaller_BQSR.sh: a series of steps to run GATK haplotypecaller for generating GVCF files for each BQSR sample (from
BQSR.sh
), followed by "GenomicsDBImport" and "MergeVcfs" to create the final BQSR version vcf. the top 50 samples of this dataset will be, together with GBS callset, used to train VQSR. - haplotypecaller_GBS.sh: a series of steps to run GATK haplotypecaller for generating GVCF files for each of GBS sample, followed by "GenomicsDBImport" and "MergeVcfs" to create the raw vcf of GBS reads.
- haplotypecaller_VQSR.sh: a series of steps to run GATK haplotypecaller for generating GVCF files and apply VQSR filters to finalize the callset.
- callerror.sh: a shell script to calculate the genotype mismatch between the deep-coverage and low-coverage samples of the benchmark sample (SDS06-306A).
- LD_decay.sh: calculate and plot LD decay for each contig larger than 10kbp.
2. Population Structure and demographic history
- sNMF.r: sNMF analysis with SNP data
- pca.r: pca analysis for the population structure with SNP data
- fst_dxy_pi.r: calculate population pairwise FST, dXY, and population genetic diversity (pi).
- msp_parallel.sh: the bash shell script to run msprime in parallel.It will process two python scripts, i.e., "model1.py", and "model2.py" to generate simulation data including summary statistics and parameters.
- ingensabc.r: R script to run ABC (Approximate Bayesian Computation) to get the posteriors of the parameters based on rejection algorithm.
3. Selection
- raisd.sh: the bash shell script running RAiSD for each of the selected population.
- omegaplus.sh: the bash shell script running OmegaPlus.
- swe_extract.sh: the bash shell script extracting the September mean SWE (snow water equivalent) from the database.
- raster_mean.R: get the mean of SWE for the localities of the populations.
- ingens_LFMM.r: R script running GEA using LFMM2.
- GO_enrichment.r: R script running GO enrichment analyses from the variable candidate genes.