Data from: Evidence of intraspecific adaptive variation in the American pika (Ochotona princeps) on a continental scale using a target enrichment and mitochondrial genome skimming approach
Data files
Oct 22, 2024 version files 83.08 MB
-
118_filter1.recode.vcf.gz
71.71 MB
-
120_80LDpruned.ped
8.36 MB
-
120_FinalAnnotate_80LDpruned.vcf.gz
613.12 KB
-
Extra_R_inputs.zip
93.72 KB
-
linux_code.txt
13.94 KB
-
mt_118_dp3.recode.vcf.gz
2.20 MB
-
mtDNA_geneAlignments.zip
50.46 KB
-
R_code.R
29.22 KB
-
README.md
5.44 KB
Abstract
Montane landscapes present an array of abiotic challenges that drive adaptive evolution among organisms. These adaptations can promote habitat specialization, which may heighten the risk of extirpation from environmental change. For example, higher metabolic rates in an endothermic species may contribute to heightened cold tolerance, while simultaneously limiting heat tolerance. Here, using the climate-sensitive American pika (Ochotona princeps), we test for evidence of intraspecific adaptive variation among environmental gradients across the Intermountain West of North America. We leveraged results from previous studies on pika adaptation to generate a custom nuclear target enrichment design to sequence several hundred candidate genes related to cold, hypoxia, and dietary detoxification. We also applied a ‘genome skimming’ approach to sequence mitochondrial DNA. Using genotype-environment association tests, we identified rare genomic variants associated with elevation and temperature variation among populations. Among mitochondrial genes, we identified intraspecific variation in selective signals and significant changes to the amino acid property equilibrium constant, which may relate to electron transport chain efficiency. These results illustrate a complex dynamic of adaptive variation among O. princeps where lineages and populations have adapted to unique regional conditions. Some of the clearest signals of selection were in a genetic lineage that includes pikas of the Great Basin region, which is also where recent localized extirpations have taken place and highlights the risk of losing adaptive alleles during environmental change.
https://doi.org/10.5061/dryad.f4qrfj73x
This datset contains vcfs generated from target enrichment of nuclear candidate genes and genome skimming of mitochondrial genes from six American pika lineages across the species range. The mitochondrial vcf was used to generate consensus fasta files, which are also included. Finally, the linux and R code used to process and analyze these data are included, along with additional input files for the R code.
Description of the data and file structure
Fasta files in mtDNA_geneAlignments.zip are consensus fasta files of the 13 mitochondrial protein coding genes from 118 O. princeps samples. These were generated from a vcf file and the OchPri4.0 reference genome using bcftools consensus. Missing data are coded as “N”. For additional information on fasta file format, see: https://www.ncbi.nlm.nih.gov/genbank/fastaformat
Three vcf files are included. For information about vcf file format, see: https://samtools.github.io/hts-specs/VCFv4.3.pdf
120_FinalAnnotate_80LDpruned.vcf.gz consists of 17407 nuclear bi-allelic SNPs, annotated by gene ID, from 120 O. princeps samples, which were used for population structure and nuclear selection tests.
mt_118_dp3.recode.vcf.gz is the vcf used in generating the mitochondrial fasta files.
118_filter1.recode.vcf.gz is the vcf used in generating a SNP phylogenetic tree, which was used in downstream mitochondrial selection tests with the fasta files.
120_80LDpruned.ped is a ped file, which was the input for snmf population structuring and imputation for downstream analyses. For additional information on ped file format, see: https://zzz.bwh.harvard.edu/plink/data.shtml#ped
Two code files are included:
linux_code.txt consists of code for raw data processing and analyses performed in bash.
R_code.R consists of code for analyses performed in R.
Additional R input files are in Extra_R_inputs.zip: 120_FinalAnnotate_SNPnames.txt was used for annotating SNP matrix with SNP names which are labeled with the Chromosome_Position_Consequence, 120_names.csv was used for annotating SNP matrix with sample names, and lfmm_env.csv is the environmental data and lineage name inputs for nuclear analyses, where ID1 is sample ID, lineage is snmf-derived lineage, elevation is elevation in meters, and MCMT is mean coldest month temperature in degrees C. Additional folders within Extra_R_inputs.zip are as follows:
- BayeScan:
- bayescan_pop.txt was used for generating the BayeScan input file. ID1 refers to sample ID and lineage refers to snmf-derived lineage.
- positive_FinalAnnotation.csv has significant BayeScan results (see BayeScan user manual for explanation of variables: https://cmpg.unibe.ch/software/BayeScan/files/BayeScan2.1_manual.pdf). Final variable, SNP, contains annotated SNP names derived from SNP genotype matrix in R.
- **bayescan100pos_FinalAnnotation_forpca.csv** is the input file for generating the BayeScan significant loci pca. Sample refers to sample ID and lineage refers to snmf-derived lineage. The rest of the fields are genotypes (0,1,2) at each SNP and SNP names have placeholders (“”) removed for compatibility with R package.
-
climNA: climNA_input1_test.csv and climNA_input2.csv are input files for climNA, climNA_mt_input1.csv and climNA_mt_input2.csv are input files for climNA for just the samples used in mtDNA analyses. In each file, ID1 refers to sample ID, ID2 refers to subspecies, lat refers to latitude, lon refers to longitude, and el refers to elevation in meters.
-
lfmm: LFMMintersect_candlist_FinalAnnotate_forpca.csv, LFMMlasso_candlist_FinalAnnotate_forpca.csv, and LFMMridge_candlist_FinalAnnotate_forpca.csv are input files for running pca on significant loci from LFMM. For each file, sample refers to sample ID and lineage refers to snmf-derived lineage. The rest of the fields are genotypes (0,1,2) at each SNP and SNP names have placeholders (“_”) removed for compatibility with R package.
- TreeSAAP:
- COOH_env.csv is an input file for analyzing TreeSAAP results. Sample refers to sample ID, MAT refers to mean annual temperature (degrees C), MWMT refers to mean warmest month temperature (degrees C), MCMT refers to mean coldest month temperature (degrees C), Elevation (m) is elevation in meters, Latitude and Longitude are provided, Lineage refers to snmf-derived lineage, location refers to US state, and Sum_change refers to the sum change of pK’ for all mitochondrial protein-coding genes, which derives from TableS11 in Molecular Ecology publication.
- COOH_env_sitename.csv is an input file for analyzing additional aspects of TreeSAAP results. Variables are the same as COOH_env.csv, except Location now has country, state, and location description, ComplexI_Sum is the sum change in pK’ for OXPHOS Complex I genes, ComplexII_Sum is the sum change in pK’ for OXPHOS Complex II genes.