The genetic architecture of repeated local adaptation to climate in distantly related plants

Published Jul 24, 2024 on Dryad. https://doi.org/10.5061/dryad.15dv41p57

Abstract

Closely related species often use the same genes to adapt to similar environments. However, we know little about why such genes possess increased adaptive potential, and whether this is conserved across deeper evolutionary lineages. Adaptation to climate presents a natural laboratory to test these ideas, as even distantly related species must contend with similar stresses. Here, we re-analyse genomic data from thousands of individuals from 25 plant species as diverged as lodgepole pine and Arabidopsis (~300 My). We test for genetic repeatability based on within-species associations between allele frequencies in genes and variation in 21 climate variables. Our results demonstrate significant statistical evidence for genetic repeatability across deep time that is not expected under randomness, identifying a suite of 108 gene families (orthogroups) and gene functions that repeatedly drive local adaptation to climate. This set includes many orthogroups with well-known functions in abiotic stress response. Using gene co-expression networks to quantify pleiotropy, we find orthogroups with stronger evidence for repeatability exhibit greater network centrality and broader expression across tissues (i.e. higher pleiotropy), contrary to the “cost of complexity” theory. These gene families may be important in helping wild and crop species cope with future climate change, representing important candidates for future study.

Dryad DOI: https://doi.org/10.5061/dryad.15dv41p57

Author Contact Information

Sam Yeaman - sam.yeaman@ucalgary.ca

Dryad contents:

29 VCFs for SNP data across 25 species.
Sampling location and climate change variables for each VCF.
VCF metadata xlsx

VCF formats

All capture and whole-genome sequencing data is in standard VCF format. The sampling location CSV for each VCF denotes where each sample was taken. For the sake of grouping individuals into 'populations' for analyses, these were grouped according to samples with the same lat/lon sampling locations.

Pool-sequencing data is in the format of simplified SNP tables, where the allele frequency for each SNP for each pool is given in a tabular format. The sampling locations of pools are given in the same CSV format.

Full metadata for each VCF is provided in the vcf_metadata_dryad.xlsx file. This table links each VCF to the relevant species, the reference genome used, and the original study DOI, and includes a range of summary statistics number of SNPs, the number of individuals, and the number of unique sampling locations (unique lat/long). This is the same as Table S1 in the published manuscript, with an additional column denoting the filename for each VCF as it appears here.

Sampling location csv

There is a CSV file included for each VCF in the vcf_sampling_location_and_climate_change.zip archive. The Sample_name column corresponds to how samples are listed in VCF/SNP tables.

Altitude is given in metres.

The Pool_Size column is only relevant to pool-sequencing.

Taxon_name	Pop_ID	Sample_name	Country	Lat	Long	Alt	Author	Data_type
Arabidopsis lyrata	Grovudalen	Alyr_Norway_Grovudalen_T3-6	Norway	62.44	8.9	900	Savolainen	WGS
Arabidopsis lyrata	Grovudalen	Alyr_Norway_Grovudalen_T3-5	Norway	62.44	8.9	900	Savolainen	WGS
Arabidopsis lyrata	Grovudalen	Alyr_Norway_Grovudalen_T3-3	Norway	62.44	8.9	900	Savolainen	WGS

Climate change variables

Each VCF has a corresponding TSV txt file that includes two novel variables describing climate change variation among sampling locations. These variables are calculated by comparing precipitation (prec) and max temperature (tmax) variation, taken at a given sampling location, between observations taken across the 1960s and across the 2010s from worldclim (https://www.worldclim.org/data/worldclim21.html). The variable shown is an estimate of the effect size of the difference between the 1960s climate distribution and the 2010 climate distribution. It therefore seeks to represent sites where the current climate is more different as observed across the 2010s relative to the 1960s. The units of these effect sizes are not meaningful, rather their usefulness is in ranking locations from least to most climate change for the sake of performing genotype-environment associations.

The format of each file is a 3-column table. The pop column denotes the sampling location in the format Longitude_Latitude, which corresponds to how individuals in VCFs are grouped into 'populations'. The climate change estimates for precipitation and max temperature are then provided as prec_clim_change and tmax_clim_change respectively.

pop	prec_clim_change	tmax_clim_change
-73.56_41.73	0.301220736826739	0.0764795340578612
-73.98_41.3	0.33101986430034	0.0718555737120204
-73.99_40.44	0.279054883385785	0.0977646538138408
-73.99_41.32	0.33101986430034	0.0718555737120204
-74.02_42.66	0.26041207949836	0.0713417534589646

Dataset selection criteria

We selected 29 datasets covering 25 species from 21 studies (see README) for which sequence data were available with either whole genome shotgun data (WGS), pooled sequencing data (POOL), or sequence capture data (CAPTURE) (Table S1). Such datasets provide broad genomic representation and maximise the resolution of the number of genes for each species. We limited our study to datasets with at least five populations distributed along latitudinal, longitudinal, and/or altitudinal gradients. Here, the minimum number of populations was set in order to allow for a minimum baseline of statistical power for Kendall’s τ correlations between environment and allele frequency (see below). The list of included datasets is not exhaustive but rather was selected to achieve a desired number of approximately 20-30 species for which power to detect repeated adaptation has been demonstrated for the methods used here. Datasets also had to provide locality information, either for individual samples or groups of individuals. Species were selected to include both gymnosperms and angiosperms, setting the MRCA of all species sampled at approximately 300 million years ago.

SNP-calling

We used two SNP-calling pipelines depending on whether sequencing data came from individuals (WGS and CAPTURE) or from pools of individuals (POOL). These pipelines were necessarily different but were based on similar approaches to reduce bioinformatic discrepancies between the data types. For all data types, raw fastq data were retrieved from either the NCBI sequencing read archive (SRA) or the EBI European nucleotide archive (ENA). Accession codes for all data are provided in Table S1. The reference genome used for each species, or closely related species where genomes were not available, are also in Table S1. SNP pipelines were designed to balance computational time and low false-positive rates.

The pipeline for individual-based data was as follows, note that selfing and outcrossing species were processed using the same pipeline. Raw fastq files were cleaned and trimmed for adapter sequences using fastp (v0.20.1) before being aligned against the reference genome using bwa-mem (v0.7.17-r1188). BAM files were generated, sorted, and indexed using samtools (v1.16.1), skipping alignments with MAPQ < 10 (-q 10). We then collected quality metrics with Picard Tools (v2.26.3) based on alignment summary (CollectAlignmentSummaryMetrics), insert size metrics (CollectInsertSizeMetrics), and coverage (CollectWgsMetricsWithNonZeroCoverage). We then marked and removed duplicates using Picard’s MarkDuplicates and used AddOrReplaceReadGroups to amend read groups. In some cases, datasets split sequencing data from individual samples across multiple technical replicates, so we then merged BAM files within samples with Picard’s MergeSamFiles. We ran a realignment of the cleaned, merged BAM files by running the RealignerTargetCreator and IndelRealigner from the Genome Analysis Tool Kit (GATK v3.8) and repeated the aforementioned quality metrics on final BAM files. To identify single-nucleotide polymorphisms (SNPs), we generated genotype likelihoods using BCFtools’ mpileup by specifying a minimum mapping quality of 5 for an alignment to be used and retained additional annotation information such as allelic depth (AD). From there, individual pileups were converted into SNP VCFs by using the BCFtools’ call program. We set sample ploidy (-S) information to match the species’ known ploidy and called for genotype quality (GQ) to be reported while excluding any group samples (-G -) information. Finally, we filtered raw VCF files with VCFtools by removing sites with quality value below 30 (--minQ 30), Genotype Quality below 20 (--minGQ 20), and minimum read depth of 5 (--minDP 5), before finally retaining only biallelic (--max-alleles 2) genotypes present in more than 70% of individuals (--max-missing 0.7). For downstream analyses, we performed additional filtering on the basis of minor allele frequency (maf) and minor allele count (mac), retaining only sites with maf > 0.05 and mac > 5, whichever was most stringent.

Pooled data was processed using a similarly structured workflow. Raw fastq files were cleaned and trimmed with fastp and aligned to references with bwa-mem using the additional flag to mark shorter split hits as secondary (-M). BAM files were generated, sorted, and indexed using samtools, skipping alignments with MAPQ < 20 (-q 20) and bedfiles were generated from indexed BAM files using BEDtools (v2.27.1). Duplicates were then marked and removed with MarkDuplicates before indel realignment was performed with Picard’s RealignerTargetCreator and IndelRealigner. SNP-calling was then performed using mpileup followed by VarScan’s (v.2.4.2) mpileup2cns program. Variants were called on the basis of minimum read depth of 8 (--min-coverage 8), a p-value threshold of 0.05 (--p-value 0.05), a minimum frequency for calling homozygotes of 80% (--min-freq-for-hom 0.8), ignoring variants with >90% support on one strand (--strand-filter 1), minimum base quality at position of 20 (--min-avg-qual 20) and a minimum variant allele frequency of 0 (--min-var-freq = 0). To extract allele frequencies, pooled SNPs were converted to SNP tables using GATK (v4.1.0.0) VariantsToTable, splitting multi-allelic sites across multiple rows (--split-multi-allelic) and extracting the AF field. Final SNPTables for POOL data were filtered for indels, retaining only biallelic sites, and a minor allele frequency cutoff of 0.05.

The genetic architecture of repeated local adaptation to climate in distantly related plants

Data files

Abstract

README: The genetic architecture of repeated local adaptation to climate in distantly related plants

Author Contact Information

Dryad contents:

VCF formats

Sampling location csv

Climate change variables

Methods

Works referencing this dataset