Gene expression in male and female sticklebacks from populations with convergent and divergent throat coloration
McKinnon, Jeffrey; Newsome, W. Burns; Balakrishnan, Christopher (2022), Gene expression in male and female sticklebacks from populations with convergent and divergent throat coloration, Dryad, Dataset, https://doi.org/10.5061/dryad.sn02v6x64
Understanding of genetic mechanisms underlying variation in sexual dichromatism remains limited, especially for carotenoid-based colors. We addressed this knowledge gap in a gene expression study with threespine stickleback. We compared male and female throat tissues across five populations, including two in which female red coloration has evolved convergently. We found that the expression of individual genes, gene ontologies, and coexpression networks associated with red female color within a population differed between California and British Columbia populations, suggesting differences in underlying mechanisms. Comparing females from each of these populations to females from populations dominated by dull females, we again found extensive expression differences. For each population, genes and networks associated with female red color showed the same patterns for males only inconsistently. The functional roles of genes showing correlated expression with female color are unclear within populations, whereas genes highlighted through inter-population comparisons include some previously suggested to function in carotenoid pathways. Among these, the most consistent patterns involved TTC39B (Tetratricopeptide Repeat Domain 39B), which is within a known red coloration QTL in stickleback and implicated in red coloration in other taxa.
Overview of analyses and contrasts
In our main analyses, we used DESeq2 to: identify genes whose expression was correlated with female red coloration within populations; test whether expression of the same genes was correlated with female color in different populations; ask if the genes showing expression correlated with female color also showed differential expression in analyses including both males and females. Because some red-associated genes may differ in expression mainly between populations, we conducted complementary analyses comparing red females with dull females of different populations in which dull females predominate. To address similar questions at a systems level, we used WGCNA to ask if gene networks showing color-correlated differential expression, in each population possessing red females, were present in the other population; we conducted an analogous analysis of the two sexes. In order to identify candidate genes for further study, we asked if any of the genes consistently associated with red coloration had been linked with coloration in other species, and if any were associated with stickleback coloration QTL detected in one of the same populations (Yong et al. 2016). As a first step before the main analyses, we characterized the overall structure of the data and broad patterns of variation, by population and sex, with a principal component analysis of gene expression.
Fish were collected from populations in British Columbia, Canada and California, USA. Populations surveyed are Little Campbell Anadromous (49°00'58 N, 122°46'46 W; henceforth “Anadromous”), Little Campbell Stream-resident (49°00'43 N, 122°37'30 W), Salmon River (49°05'29 N, 122°29'34 W), Bonsall Creek drainage (48°51'05 N, 123°42'40 W), and Matadero Creek (the only California population: 37°23'36 N, 122°9'46 W). Using minnow traps and seine/dip nets, fish were collected from the field in spring 2014 and transported to East Carolina University, North Carolina. British Columbia stickleback were collected in late April and California stickleback were collected in late June. Fish from Little Campbell Stream-resident (henceforth LC Stream) and Matadero populations, both possessing red-females, were sampled so as to ensure high diversity in female throat coloration, with sufficient numbers of both red and dull females (which were categorized as red or dull for some analyses). All fish were held for two weeks to allow acclimation to the lab. They were kept in 102-liter tanks at approximately 15-20 fish per tank under natural-spectrum mimicking fluorescent light (Lumichrome® Full Spectrum Plus, Lumiram Electric Co., Larchmont, NY, USA) and photoperiod at 17–20°C. Fish were fed bloodworms (chironomid larvae) and brine shrimp twice per day.
RNA extraction, library preparation, sequencing, and generation of read counts
Red coloration was assessed following protocols described in Yong et al. (2013) using an Ocean Optics Maya2000 Pro spectrophotometer. Prior to collection of spectrometry data, subject fish were euthanized by MS-222 overdose. Immediately following these measurements, throat and telencephalon tissue (these results to be presented in a different MS) were removed and stored in RNAlater. ‘Throat tissue’ specifically refers to ventral surface tissue between the opercula and immediately ventral to the ceratohyal cartilage, excluding tissue immediately ventral to the sternohyoideus; this includes (potentially) red colored skin tissue on the underside of the head, detached from underlying cartilage. Additionally, standard length was measured and each fish was dissected to confirm sex. Remaining tissue samples were immersed in RNAlater and stored in a -80°C freezer for later use. Animal use protocols were approved by East Carolina University's Institutional Animal Care & Use Committee (AUP #D224b) and experiments were performed in accordance with relevant institutional and national guidelines for the care and use of laboratory animals.
Tissues were homogenized in TRI reagent with a Bio-Gen 2000 Homogenizer (Pro Scientific) and RNA was extracted using the TRI Reagent protocol. Samples were then treated with DNase to remove contaminant DNA, and total RNA isolation was purified using the manufacturer's instructions for the RNeasy Kit (Qiagen). To determine RNA quality, RNA samples were run on a RNA 6000 picoLab Chip using a 2100 Bioanalyzer (Agilent). Average RNA quality for the samples used in this study was high (RNA integrity number, RIN, = 8.1). RNA was extracted from throat and brain tissue and stored in a -80°C freezer until sequencing. cDNA library preparations were conducted by the University of Illinois Roy J. Carver Biotech Center under the guidance of Dr. Alvaro Hernandez using Illumina TruSeq SBS sequencing kit version 4. Ninety-six libraries (52 throat, 44 telencephalon) were multiplexed on a full flow cell (8 lanes) Illumina HiSeq 2500 to yield approximately 18.8 million single-end, 100bp reads per library. Fastq files were generated and demultiplexed with the bcl2fastq v1.8.4 Conversion Software (Illumina).
Following sequencing, standard quality control (adapter removal and trimming of low-quality bases) was conducted using TrimGalore on default settings (Martin 2011; Krueger 2015). Reads were mapped to the stickleback reference genome (Jones et al. 2012) using Tophat version 2.0.13 (Trapnell et al. 2009). Tables of read counts, which were used as input for most downstream analyses, were generated using HT-Seq version 0.6.0 (Anders et al. 2015). They referenced Ensembl 79. Gene annotations were from Ensembl 100 unless otherwise indicated.
Color measurement and analysis
Reflectance spectra were processed with pavo (Maia et al. 2013) in the R environment (R Core Team 2017) using a perceptual model of threespine stickleback vision (Stuckert et al. 2019a) to generate a measure of chroma, r.vec, which can be thought of as color intensity, and two indices of wavelength, h.phi and h.theta. The r.vec variable, which in our analyses is a measure of red intensity and will henceforth be referred to as chroma, was the intended focus but all three variables proved to be strongly correlated (see Results). Some heteroscedasticity was present in chroma data owing largely to limited variance in the consistently low values of the Bonsall population (note that males in this site appear to be melanic based on our field observations; we have observed some red expression in males from this drainage held in the laboratory for an extended period, but not in males in the present study). This was not completely eliminated by transformations and significance values were high and robust, so analyses of raw data, conducted using JMP Pro 14.1.0, are presented for ease of interpretation. Reflectance data were not available for one Little Campbell Stream male, but that male’s RNAseq results were retained in most analyses as other key data were present.
Differential expression analyses
To characterize major patterns of variation in expression data, principal component analyses (PCA) were run using the plotPCA function of DE-Seq2 version 1.24.0 (Anders and Huber 2010). We then tested for effects of population and sex using ANOVA, in JMP Pro 14.1.0. ANOVA results for PC1 were checked and confirmed with non-parametric Wilcoxon tests as some heteroscedasticity was present and not eliminated by standard transformations. ANOVA results are reported for consistency.
Differential expression analyses were conducted using DE-Seq2 version 1.24.0 (unless otherwise indicated; Anders and Huber 2010), which tests for differential gene expression using a negative binomial general linearized model. Effects in DE-Seq2 are presented as log2 fold change, which for an experiment is log2 (treated/untreated); for continuous independent variables, as in some of our analyses, the reported log2 fold change is per unit of change in the continuous variable. Unless otherwise indicated, alpha values for DESeq2 analyses have been corrected for multiple testing using the method of Benjamini and Hochberg (1995) and are denoted as “padj”. We accepted the DESeq2 default independent filtering of lowly expressed genes, which optimizes the number of adjusted p-values, resulting in a padj of “NA” for some genes. Uncorrected significance tests are included in supplemental tables as a complement to corrected tests. Provisional characterizations of genes not named in Ensembl are presented when highly significant or otherwise of interest. Because some populations do not have red females and only one population lacks red males, our sample design is not balanced. We therefore conducted our analyses through focused comparisons rather than through a comprehensive model.
1. Analysis of red females in Matadero and LC Stream populations
We first tested whether expression of the same or different genes was associated with convergent female red throat coloration within LC Stream (BC) and Matadero (California) females. Red chroma was treated as a continuous variable to maximize power. (A) We asked if there were significant interactions between red chroma and population/region in gene expression. Genes that show a significant interaction have distinctive associations with color among populations, as predicted if the genes differ. If genes are the same, only significant main effects are expected. (B) We tested for associations between gene expression and chroma within each of the two populations, after significant interactions between chroma and population proved common.
2. Comparisons of red females and red males
Next, we tested whether genes whose expression correlated with red in females (1B above) were also associated with red across the sexes, as predicted by the genetic correlation hypothesis. Correlations between chroma and gene expression were analyzed for males and females together, first with a univariate analysis and then with sex included as a covariate, and the results compared with those for genes identified in 1B. The analysis was then repeated with the addition of an interaction term for sex*red chroma, to test for different relationships between chroma and gene expression in males versus females. These within-population analyses were conducted using DESeq2 1.32.0. In addition, we assessed whether in inter-population contrasts, males and red females of the same populations differed in expression of the same genes when contrasted with females from populations lacking red in females (as in 3 below).
3. Gene expression comparisons of red females with dull females from other populations
Because genes mediating variation in a trait within and between populations are not necessarily the same, we also conducted an “inter-population” analysis comparing red females (dull females were excluded to ensure a meaningful contrast) from each of LC Stream and Matadero against females from each of two freshwater-resident populations in which red females are less common (Salmon River) or absent (Bonsall Creek). That is, we compared red females from LC Stream with dull females from Salmon and with those from Bonsall, and the same for Matadero females. While these contrasts are expected to reveal genes that differ by population independent of color, genes associated with color will also be encompassed, especially genes that differ in expression between red females and both of the dull female populations.
Functional enrichment testing
Sets of genes differentially expressed with red coloration in females of each red female population were tested for functional over-representation using GOrilla (November 25-30, 2020), which tests rank-ordered gene lists for functional enrichment towards the top of the list (Eden et al. 2007, Eden et al. 2009). To run these analyses, stickleback Ensembl gene stable IDs were converted to those for zebrafish, which for some genes resulted in longer lists owing to zebrafish possessing multiple members of a gene family that all correspond to a single stickleback gene stable ID. To avoid artifacts we retained only the first (ordered by name) in the list of zebrafish genes corresponding to a stickleback gene; these results are presented. We tested the robustness of the results obtained by instead retaining only the last gene in the zebra fish list and rerunning the analysis. Using default GOrilla settings, padj of 0.05, and the “Process” ontology, the results were identical for the number of Gene Ontology terms shared between analyses with Matadero and with LC Stream females.
Candidate Gene Annotation and Analysis
Because some pigmentation pathways may not be thoroughly covered by gene ontology analyses (Baxter et al.2019), we also compared differentially expressed genes with a list of candidate pigmentation genes associated with pigmentation/color phenotypes. First, we compiled stickleback genes orthologous to 650 genes in a manually curated list associated with integumentary pigmentation phenotypes in humans, mice and zebrafish (Baxter et al. 2019). We used Ensembl (98; Sep. 29, 2019) to search for stickleback orthologs, retaining those which Ensemble assigned high orthology confidence (1 rather than 0). Of the 619 zebrafish genes in the list, we retained 526 stickleback orthologs; of 30 mouse genes lacking zebrafish homologs, we retained four; one human gene with neither a zebrafish nor mouse homologue lacked a stickleback homologue.
As a supplement to this extensively curated list of 530 genes, we used stickleback orthologs of genes in a color gene list assembled by Stuckert et al. (2019b), derived from a broader array of taxa but less extensively annotated and vetted. We first used Ensembl to search for stickleback orthologs corresponding to gene names. To be thorough, we then used the list to search for human and zebrafish genes, in turn using those to search for stickleback orthologs. We retained those with a stickleback gene name which Ensembl assigned high orthology confidence (1 rather than 0); we merged the three lists and removed duplicate genes. We augmented this set with genes associated with carotenoid pigmentation based on Toews et al. 2017, again using Ensembl to identify stickleback genes, a total of 10, corresponding to carotenoid related genes BCO1, BCO2 and CYP2J19 (though the latter is not expected to have direct homologues in fish); we used multiple stickleback genes from the same subfamilies where one to one orthology could not readily be established. TTC39B has been found to be pigment related (Hooper et al. 2019; Salis et al. 2019; Ahi et al. 2020a, b) so it was also added (and its second copy). The Stuckert et al. (2019b)/Toews et al. (2017) lists added 147 new genes. With Ensembl 100, some gene identifications changed such that 13 were removed, for a final total of 666 pigment-related genes (final list in Suppl. Table 1). We describe genes from this list as candidate pigmentation genes. In addition, genes were assigned to QTL Bayesian credible intervals from Yong et al. (2016) using the assembly update of Glazer et al. (2015).
We conducted network analyses of RNAseq data using the R package WGCNA 1.68. Weighted gene correlation network analysis (WGCNA) is a systems biology method that can be used to find clusters, called modules, of genes with highly correlated expression (Langfelder and Horvath 2008; Langfelder et al. 2011). We analyzed networks based on expression data transformed using DESeq2’s variance-stabilizing transformation (Anders and Huber 2010), with all genes not expressed in at least 50% of samples removed and a minimum module size of 30. We tested correlation of modules with red intensity in R, using the method of Benjamini and Hochberg (1995) to correct for multiple tests.
As with DE-Seq analyses, we used WGCNA to test for similarities and differences between Matadero and LC Stream populations. To do this, we built separate networks for each population, including males and females. Due to small sample sizes for LC Stream, we included the adjacent (and similar in overall gene expression) Salmon River population for network construction. The resulting modules were tested for associations with color (in both sexes and in females only). All modules, including color-associated modules, were then tested for preservation of network structure among populations. It is important to keep in mind that both males and females were included in these WGCNA analyses, when interpreting relationships between WGCNA modules and traits in a single sex, and that sample sizes were relatively small for this method.
We also tested for network preservation between sexes to identify whether genes associated with color in females were preserved in males. To do this required larger sample sizes, so we built another gene co-expression network for all of the British Columbia fish together. Using these modules, we tested for module preservation of color-associated modules between males and females.
Reflectance data are provided in individual files generated by the spectrometer and saved in a compressed folder, “Files1029,19Analyses.zip”. Population and sex information are provided in file names as described for the column titles in the main data file. An R script, “McKinnon_R_code_Reflectance_Analyses”, is provided for generating chroma measures from these data using a model of stickleback visual perception based on cone data provided in the file, “stickleback.visual.sensitivityJSM1030,19.csv”. Reflectance data are lacking for one Little Campbell stream resident male, as described in the MS and indicated in R code for the man analyses. The chroma and other values generated were initially analyzed in JMP as described in the MS. They are also included in the R scripts for the analyses of chroma by population and sex.
The core data set is “Complete_Throat_Count_Table0328,19”, which provides expression data of 22,456 genes for 51 threespine stickleback from Western North American populations. The R code that accompanies the data set (McKinnon_R_code_DESeq2_Analyses) assigns population, sex, and color measurements to each individual but the sex and population of each fish can also be inferred from the ID code at the top of each column. The first letter indicates the population: A-anadromous (from the lower Little Campbell); L-Little Campbell stream-resident; S-Salmon River; B-Bonsall; M-Matadero. The second letter indicates sex: M-male; F-female.
Principal component analysis scripts are provided in: McKinnon_R_code_PrincipComponent_Analyses
The principal components themselves were analyzed in JMP as described in the MS.
DESeq analysis scripts are provided in McKinnon_R_code_DESeq2_Analyses.
Data used in the WGCNA analyses of population are samples from specific populations that came from the main date set: Complete_Throat_Count_Table0328_19.txt, after variance stabilization in DESeq2 and removal of genes not expressed in at least 50% of samples or expressed in the Matadero population. For the WGCNA analyses by population (R scripts in McKinnon_R_code_WGCNA_PopAnalyses) the resulting data files were:
Those scripts also required (for preliminary analyses in WGCNA) information about traits in those populations, which were in: traits.throatsMats101819.csv, traits.throatsLCSalm120619.csv
Data used in the WGCNA analyses of sex are samples of each sex from British Columbia populations that came from the main date set: Complete_Throat_Count_Table0328_19.txt, after variance stabilization in DESeq2 and removal of genes not expressed in at least 50% of samples. The resulting data files (R scripts in McKinnon_R_code_WGCNA_SexAnalyses_ ) were:
Those scripts also required (for preliminary analyses in WGCNA) information about traits in those populations, which were in: traits.throats0429_19NoMATS_Males120819.csv, traits.throats0429_19NoMATS_Females120819.csv
National Institute of General Medical Sciences, Award: R15GM109291