Trans-acting genotypes drive mRNA expression affecting metabolic and thermal tolerance traits
Data files
Jun 28, 2023 version files 160.24 MB
-
2023_genotypic_drivers_associations.zip
-
README.md
Abstract
Evolutionary processes driving physiological trait variation depend on the underlying genomic mechanisms. Evolution of these mechanisms depends on whether traits are genetically complex (involving many genes) and how gene expression that impacts the traits is converted to phenotype. Yet, genomic mechanisms that impact physiological traits are diverse and context-dependent (e.g., vary by environment or among tissues), making them difficult to discern. Here we examine the relationships between genotype, mRNA expression, and physiological traits to discern the genetic complexity and whether the gene expression affecting the physiological traits is primarily cis or trans-acting. We use low-coverage whole genome sequencing and tissue-specific mRNA expression among individuals to identify polymorphisms directly associated with physiological traits and expressed quantitative trait loci (eQTL) driving variation in six temperature-specific physiological traits (standard metabolic rate, thermal tolerance, and four substrate-specific cardiac metabolic rates). Not surprisingly, there were few, only five, SNPs directly associated with physiological traits. Yet, by focusing on a select set of mRNAs belonging to co-expression modules that explain up to 82% of temperature specific (12°C or 28°C) metabolism and thermal tolerance, we identified hundreds of significant eQTL for mRNA whose expression affects physiological traits. Surprisingly, most eQTL (97.4% for heart and 96.7% for brain) of eQTL were trans-acting. This could be due to higher effect size or greater importance of trans versus cis-acting eQTLs for mRNAs that are central to co-expression modules. That is, we may have enhanced the identification of trans-acting factors by looking for SNPs associated with mRNAs in co-expression modules that are known to be correlated with the expression of 10s or 100s of other genes, and thus have identified eQTLs with widespread effects on broad gene expression patterns. Overall, these data indicate that the genomic mechanism driving physiological variation across environments is driven by trans-acting tissue-specific mRNA expression.
Methods
1. Sample collection
Fin clips were taken from adult F. heteroclitus collected along the central coast of New Jersey, USA near the Oyster Creek Nuclear Generating Station (OCNGS), which produces a thermal effluent that locally heats the water. Three populations were sampled: 1) north reference (N.Ref; 39°52’28.000 N, 74°08’19.000 W), 2) south reference (S.Ref; 39°47’04.000 N, 74°11’07.000 W) and 3) a central site located between the southern and northern reference that is within the OCNGS thermal effluent (TE; 39°48’33.000 N, 74°10’51.000 W). The TE population used here differs by 4°C in habitat temperature from the two references populations (average summer high tide temperature 28°C N.Ref and S.Ref, and 32°C for TE) but is otherwise ecologically similar (Drown et al. 2021) (Dayan et al. 2019). Fin clips were collected in fall 2015 (F15), fall 2018 (F18), spring 2019 (S19), fall 2019 (F19), and fall 2020 (F20) and stored in GuHCl buffer. DNA was extracted using carboxyl-coated magnetic beads. The DNA quality was assayed using gel electrophoresis and spectrophotometry to ensure high molecular weight and low contamination.
2. Library preparation
The analysis presented here uses a subset of samples that were part of a larger sequencing run. A total of 1,121 individuals were sequenced (Table S3). All samples were quantified in triplicate using spectrophotometry and normalized to 100ng for sequencing library preparation. The whole genome sequencing library was prepared using a tagmentation approach. Briefly, DNA was digested with an in-house purified Tn5 transposase (as in (Picelli et al. 2014)) loaded with partial adapter sequences. After tagmentation, the fragmented DNA was amplified using barcoded primers such that each individual sample would contain a unique i7 and a plate level (1 per 96 samples) i5 barcode. This allowed for unique dual indexing of up to 768 individuals. After barcoding, samples were combined into two pools (560 samples each) and each pool amplified and then sequenced on a single lane of Illumina HiSeq 3000. These single sequencing lanes were assessed to determine coverage balance among samples, and the same libraries were sequenced across an additional four lanes each. For all sequencing runs, a greater relative amount of library for F18 samples was added to the pool to achieve higher coverage because whole animal, tissue, and molecular (mRNA expression) level phenotypic data was available for these individuals.
3. Raw sequence analysis
We followed best practices for low coverage whole genome sequencing (lc-WGS) data processing as in (Lou et al. 2021). Briefly, adapter sequences and low-quality bases were trimmed using Trimmomatic (v0.39) (Bolger et al. 2014). Flash (v1.2.11)(Magoč and Salzberg 2011) was used to combine overlapping reads and to parse singletons and paired reads. Singletons and paired reads were mapped separately using BWA mem (v0.7.17) and resulting sam files converted to bam files using samtools (v1.3.1) (Danecek et al. 2021). The first and second sequencing run were processed separately until BAM files were produced and found to be of similar quality assessed by comparing total percentage of mapped reads and levels of dually mapped reads before combining for the remaining file processing steps. Picard (v2.26.4) was used to add read group information, which is needed for duplicate marking downstream. After combining all mapped reads for a single individual, BAM files were further filtered for mapping quality using samtools and overlapping reads softclipped using bamutil (v1.0.15). Finally, Picard (v2.26.4) was used to mark duplicate reads.
4. Variant calling
Two variant calling pipelines were used. First, Freebayes (v1.0.2) was used to call variants and the resulting VCF file was filtered using VCFtools (v0.1.16). VCFtools filters were to include only biallelic sites, >5% minor allele frequency, <5% missingness per individual, <10% missingness per site. This resulted in 1,406,282 high-probability variant sites. ANGSD (v0.935) (Korneliussen et al. 2014), which is designed for use with lcWGS data, was then used to obtain a genotype likelihood beagle file containing the previously identified high-probability variant sites from Freebayes and VCFtools. This approach is similar to other studies using lcWGS data where variant calling may be sensitive to specific tool use.
5. Association testing
We interrogated potential associations between 1,406,282 SNPs and eight phenotypes: six physiological traits: SMR (standard metabolic rate), CTmax (critical thermal maximum), four substrate-specific MRcardiac (glucose, fatty acids, LKA, and endogenous), and two measures of mRNA expression: single mRNA and multivariate mRNA expression. Multivariate mRNA expression used weighted gene co-expression network analysis (WGCNA, (Langfelder and Horvath 2008)) to identify co-expression mRNAs and group them into modules (MEs). Single mRNA expression was limited to the top 10 mRNAs from each WGCNA co-expression module. There were 80 MEs: 39 heart modules and 41 brain modules (Table S1), 390 single heart mRNAs, and 410 single brain mRNAs (Drown et al. 2022).
All association results are reported from ANGSD score test (-doAsso 2) and after multiple test correction of p-values (Benjamini and Hochberg 1995). Physiological traits and mRNA expression were measured at two acclimation temperatures (12°C and 28°C). Acclimation temperature was included as a variable in the association tests and yielded similar results when compared to within temperature associations. Sample size varied among association tests based on phenotypic and genotypic data availability and are reported in Table S2.
Usage notes
Associated code can be found: https://github.com/mxd1288/Genotypic_drivers.git