Data from: Nascent transcription reveals regulatory changes in extremophile fishes inhabiting hydrogen sulfide-rich environments
Data files
Abstract
Regulating transcription allows organisms to respond to their environment, both within a single generation (plasticity) and across generations (adaptation). We examined transcriptional differences in gill tissues of fishes in the Poecilia mexicana species complex (family Poeciliidae), which have colonized toxic springs rich in hydrogen sulfide (H2S) in southern Mexico. There are gene expression differences between sulfidic and non-sulfidic populations, yet regulatory mechanisms mediating this gene expression variation remain poorly studied. We combined capped-small RNA sequencing (csRNA-seq), which captures actively transcribed (i.e., nascent) transcripts, and traditional messenger RNA sequencing (mRNA-seq) to examine how variation in transcription, enhancer activity, and associated transcription factor binding sites may facilitate adaptation to extreme environments. csRNA-seq revealed thousands of differentially initiated transcripts between sulfidic and non-sulfidic populations, many of which are involved in H2S detoxification and response. Analyses of transcription factor binding sites in promoter and putative enhancer csRNA-seq peaks identify a suite of transcription factors likely involved in regulating H2S-specific shifts in gene expression, including several key transcription factors known to respond to hypoxia. Our findings uncover a complex interplay of regulatory processes that reflect the divergence of extremophile populations of P. mexicana from their non-sulfidic ancestors and suggest shared responses among evolutionarily independent lineages.
Methods
Sample collection
Adult female fish were collected via seine net from sulfidic and non-sulfidic sites in the Pichucalco, Puyacatengo, and Tacotalpa drainages of the Río Grijalva basin in southern Mexico (N = 5–6 per site; electronic supplementary material, table S1) as part of a previous study by Kelley et al. (Kelley et al., 2016). In the Pichucalco drainage, sulfidic springs are inhabited by the highly endemic extremophile Poecilia sulphuraria (sulphur molly; figure 1b, top) (Alvarez del Villar, 1948), but non-sulfidic springs are populated by P. mexicana (figure 1b, bottom). In the Puyacatengo and Tacotalpa drainages, P. mexicana inhabits both the sulfidic and non-sulfidic sites (figure 1c–d). Fish were sacrificed, and their gill arches were extracted, preserved in RNAlater (Invitrogen, Inc), and stored at -80 ºC. Gill tissues were selected because they are directly exposed to H2S in the water and were previously shown to have significant gene expression differences between the sulfidic and non-sulfidic ecotypes (Kelley et al., 2016; Passow et al., 2017).
While inter-drainage variation in adaptation to H2S has been demonstrated in Poecilia (Greenway et al., 2020; Pfenninger et al., 2014), we were most interested in the broad patterns of gene regulation across ecotypes, as there is pre-established convergence in the differential expression of H2S-related genes in all sulfidic populations (Kelley et al., 2016; Passow et al., 2017). For this reason, we compared gill tissues sampled from all sulfidic fish to those from all non-sulfidic fish combined across drainages.
csRNA library preparation and sequencing
csRNA-seq was generated from a subset of the fish samples described above (N = 6 fish per population; electronic supplementary material, table S1). Total RNA was extracted from gill tissues using Qiagen’s miRNeasy Mini Kit following all standard protocols. Total RNA concentrations were estimated with the Qubit RNA HS Assay Kit and the Agilent 2100 BioAnalyzer using the RNA 6000 Nano Kit. Short RNAs (~20–60 nucleotides) were isolated using size selection with gel electrophoresis. For each sample, a 10% aliquot of the size-selected RNA suspension was reserved for use as an “input library” (Duttke, Beyhan, et al., 2022). These aliquot are sequenced and used in the HOMER pipeline to control for exonic contamination when identifying transcription initiation sites (Duttke et al., 2019). For the remaining size-selected RNA for each sample, 5’ 7-methylguanosine capped RNA was isolated following protocols from Duttke et al. (Duttke et al., 2019), and this constituted our csRNA libraries. csRNA and the reserved input aliquots were converted into cDNA libraries for each sample, amplified with PCR, size selected to remove primer dimers and purified with gel electrophoresis, pooled, and single-end 75-bp reads were sequenced using an Illumina NextSeq 500.
Analysis of differential transcription initiation using csRNA-seq
We used FastQC (v. 0.11.9) to examine the quality of the raw csRNA and input RNA reads (Andrews, 2010). Subsequent analyses were carried out using the csRNA-seq analysis pipeline in HOMER (v. 4.11) (Duttke et al., 2019). We used HOMER’s trim to remove Illumina TruSeq adapters from the 3’ end of reads, discard reads less than 20 bp, and trim the ends of reads below a Phred score of 20. We indexed the P. mexicana reference genome (GenBank assembly accession: GCF_001443325.1) (Warren et al., 2018) using STAR (v. 2.7.6a) and then aligned the reads with STAR, outputting one primary alignment per read (Dobin et al., 2013). Tag directories were created for the uniquely mapped csRNA libraries and the input libraries for each sample using HOMER’s makeTagDirectory, including the -single flag to account for the large number of scaffolds in the reference genome and -fragLength set to 30 bp to reflect the average length of the nascent RNAs captured by the csRNA-seq protocol (Duttke et al., 2019). A third tag directory was created for each sample using the mapped mRNA libraries (described below) and used to control for false positives and exonic contaminants in the next step. The csRNA, input, and mRNA tag directories were input into HOMER’s findcsRNATSS.pl to identify sites of transcription initiation, hereafter called peaks. This step of the HOMER pipeline also determines the stability of initiated transcripts by assessing whether stable mRNA transcripts map within -100 bp to +500 bp of each csRNA-seq peak. Unstable, nascent transcripts are not processed into stable mRNAs and are thus determined by the lack of stable mRNA transcripts mapping to the adjacent region (see (Duttke et al., 2019) for additional detail). We retained peaks with at least seven csRNA-sequencing reads per 10 million reads for further analysis following the default parameters. Peaks were annotated using the GTF file for P. mexicana [GenBank assembly accession: GCF_001443325.1, converted from GFF using gffread (v. 0.9.9)]. HOMER’s mergePeaks was used to create a non-redundant list of peaks across all samples, condensing peaks that directly overlapped. The merged peaks file was used with homerTools annotatePeaks.pl and the -strand + -fragLength 1 options required for csRNA-seq to quantify the raw read counts for each sample, associate each peak with the nearest annotated gene from the GFF, and identify the region of the genome they mapped to relative to this gene [i.e., in an exon, intergenic, intron, promoter or transcription start site (promoter-TSS), or transcription termination site (TTS)]. The promoter-TSS region is defined as the region 1000 bp upstream through 100 bp downstream of a gene’s annotated TSS in the GTF file. We the identified differentially initiated peaks in R using edgeR (v. 4.0.12) (Robinson et al., 2010) by comparing all sulfidic samples to all non-sulfidic samples while including drainage and species as covariates. Significantly differentially initiated peaks were defined as those with a false discovery rate (FDR) < 0.05. Variation in transcription initiation across samples was plotted using a principal components analysis (PCA), using the 500 peaks with the highest average signal across all samples using the R package limma (v. 3.46.0). Hierarchically clustered heatmaps of csRNA-seq peaks were generated with the R package pheatmap (v.b1.0.12) using the top 5,000 peaks based on overall number of counts across samples and subsequently with only significantly differentially initiated peaks.
We used a curated a list of genes with Gene Ontology (GO) terms related to H2S detoxification or response (GO:0070221, GO:0006790, GO:0044273, GO:0070813, GO:0000096, GO:0006090, GO:0006749, GO:1901687, GO:0000098, GO:0008272, GO:0006534, GO:0044272) to carry out a Fisher’s exact test using fisher.test() in base R (v. 4.0.3), which tested whether peaks in or near genes related to H2S detoxification and response were overrepresented in the list of significantly differentially initiated peaks.
mRNA library preparation and sequencing
Samples were prepared and sequenced as part of a previous study by Kelley et al. (Kelley et al., 2016). In brief, total RNA was extracted from gill tissues (N = 17 sulfidic fish, N = 18 non-sulfidic fish; electronic supplementary material, table S1), poly-A messenger RNA (mRNA) were isolated, and libraries were sequenced using an Illumina HiSeq 2000 using paired-end 101-bp reads.
mRNA differential gene expression
The quality of all raw reads was assessed using FastQC. Trim Galore! (v. 0.4.2) was used to remove the first 11 bp from the 5’ ends of reads (--clip_R1 11 --clip_R2 11) and perform adapter trimming (--stringency 6) but not quality trimming (--quality 0) (Krueger, 2014). A second round of Trim Galore! was used to quality trim the ends of reads with a Phred score below 24 (--quality 24) and remove reads less than 50 bp long (--length 50). HISAT2 hisat2-build (v. 2.1.0) was used to index the P. mexicana reference genome with the mitochondrial genome appended (GenBank accession: KC992995.1) (Pfenninger et al., 2014). Trimmed reads were mapped using the --downstream-transcriptome-assembly flag for compatibility with StringTie (v. 2.0.3), --fr specifying the mate pair orientation, and read group information included for sample, read group identifier, and platform (Kim et al., 2019). SAMtools view (v. 1.9) was used to convert the resulting SAM files to BAM files, sort to sort the BAM files by coordinate, and merge to combine samples with more than one technical replicate (Li et al., 2009). StringTie (v. 2.0.3) with the -eB flag was used to generate read coverage tables in GTF format for each sample (Pertea et al., 2016). An edited version of the prepDE.py script from StringTie (see code repository) was used to generate gene counts matrices, resulting in counts for nuclear genes only. Genes were filtered by a minimum counts-per-million (cpm) of at least five in the smallest library and to have non-zero counts in at least five samples. Using edgeR (v. 4.0.12), libraries were normalized using the trimmed mean of M-values (TMM, the default) and quasi-likelihood F-tests were used to compare differential gene expression in sulfidic samples compared to non-sulfidic samples with species and drainage included as covariates (McCarthy et al., 2012; Robinson et al., 2010).
Nascent transcription near regions with stable gene expression
We investigated if patterns in differential transcript initiation between ecotypes matched patterns in nearby gene expression. Peaks with differential transcript initiation from csRNA-seq (FDR < 0.05) were intersected with differentially expressed genes from mRNA-seq (FDR < 0.05) based on their gene annotation to identify genomic regions that showed both changes in nascent transcription and stable gene expression between ecotypes. We then examined directionality to see if csRNA and mRNA datasets were both upregulated in sulfidic samples compared to non-sulfidic samples, both downregulated, or divergent. Those that were upregulated in both datasets or downregulated in both datasets are hereafter referred to as the “parallel upregulated subset” and “parallel downregulated subset”, respectively.
TF binding site enrichment in differentially initiated promoter regions
We identified TF binding sites in differentially initiated csRNA-seq peaks in the promoter-TSS regions of DE genes, focusing specifically on the upregulated and downregulated subsets defined above. We used homerTools findMotifsGenome.pl to extract FASTA sequences spanning a 200-bp region from 150 bp upstream through 50 bp downstream of the peak center, which should approximate the site of transcription initiation. We chose this narrow window in order to conservatively identify only the primary TF binding sites involved in initiating gene expression. We used CiiiDER (Gearing et al., 2019) to identify significantly enriched TF binding motifs in these regions using the JASPAR CORE 2022 non-redundant vertebrate transcription factors database (Sandelin et al., 2004). We used a maximum deficit of 0.15 to allow for non-exact motif matching, and used a stringent p-value cutoff of 0.01 for classifying significantly enriched binding sites. A set of promoter-TSS regions that were not differentially initiated (FDR > 0.05 and log2 fold-change < 0.5) and were not associated with a differentially expressed gene between ecotypes were used as the background for these enrichment analyses (n = 7,443).
Identification and analysis of putative enhancer regions
csRNA-sequencing can capture enhancer RNAs, small unstable transcripts that result from active enhancer regions (Duttke, Montilla-Perez, et al., 2022; Lim et al., 2021). Enhancer RNA abundance often correlates with the expression of an enhancer’s target gene(s) (Arnold et al., 2020; Azofeifa et al., 2018); we therefore correlated differential initiation of putative enhancer csRNA-seq peaks with the expression of nearby genes to identify putative enhancers exhibiting differential activity between ecotypes and to infer genes that may be targeted by these enhancers. csRNA-seq peaks putatively corresponding to enhancer regions were defined as peaks annotated as distal from nearby genes (>500 bp from the promoter region of the nearest gene) and being comprised of unstable transcripts (i.e., transcripts present in csRNA but not mRNA libraries). This list of peaks was then intersected with the results of differential initiation analysis to identify putative enhancers with significantly increased or decreased activity in the sulfidic ecotype.
We then identified all significantly differentially expressed genes with a TSS located less than 500 kilobase pairs (kbp) from the center of an ecotype-responsive putative enhancer peak by first expanding all ecotype-responsive putative enhancer peaks by 500 kbp in both directions using the slop tool within bedtools (v. 2.30.0) and then intersecting these expanded regions with a bedfile of gene TSS positions using bedtools intersect. The resulting pairs of putative enhancers and potential target genes were filtered to retain those with the same direction of differential initiation/expression in the sulfidic ecotype (i.e., upregulated csRNA-seq peak and upregulated gene expression, and vice versa), and the log2 fold-change values of csRNA-seq and mRNA-seq were correlated using linear regressions in R. To focus downstream analyses only on putative target genes with high correlations to putative enhancer regions, pairs of putative enhancers and target genes were filtered to retain only those falling within one standardized residual of the regression line.
For the resulting putative enhancer peaks that were highly correlated with the expression of their inferred target gene(s), fasta sequences were extracted using the getfasta tool from bedtools. A background set of 1,000 peaks were randomly selected from all putative enhancer peaks that were not differentially initiated between ecotypes (i.e., distal csRNA-seq peaks comprised of unstable transcripts and with an FDR > 0.05 in differential initiation analyses). CiiiDER was again used to identify significantly enriched TF binding motifs in these regions using the JASPAR CORE 2022 non-redundant vertebrate transcription factors database. We used a maximum deficit of 0.15 to allow for non-exact motif matching, and used a stringent p-value cutoff of 0.01 for defining significantly enriched binding sites.
References
Alvarez del Villar, J. (1948). Descripción de una nueva especie de Mollienisiacapturada en Baños del Azufre, Tabasco (Pisces, Poeciliidae).
Andrews, S. (2010). FastQC: A Quality Control Tool for High Throughput Sequence Data [Online]. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Arnold, P. R., Wells, A. D., & Li, X. C. (2020). Diversity and emerging roles of enhancer RNA in regulation of gene expression and cell fate. Frontiers in Cell and Developmental Biology, 7, 377.
Azofeifa, J. G., Allen, M. A., Hendrix, J. R., Rubin, J. D., & Dowell, R. D. (2018). Enhancer RNA profiling predicts transcription factor activity. Genome Research, 28(3), 334–344.
Dobin, A., Davis, C. A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., Batut, P., Chaisson, M., & Gingeras, T. R. (2013). STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 29(1), 15–21.
Duttke, S. H., Beyhan, S., Singh, R., Neal, S., Viriyakosol, S., Fierer, J., Kirkland, T. N., Stajich, J. E., Benner, C., & Carlin, A. F. (2022). Decoding transcription regulatory mechanisms associated with coccidioides immitis phase transition using total RNA. Msystems, 7(1), e01404-21.
Duttke, S. H., Chang, M. W., Heinz, S., & Benner, C. (2019). Identification and dynamic quantification of regulatory elements using total RNA. Genome Research, 29(11), 1836–1846.
Duttke, S. H., Montilla-Perez, P., Chang, M. W., Li, H., Chen, H., Carrette, L. L. G., Guglielmo, G. de, George, O., Palmer, A. A., & Benner, C. (2022). Glucocorticoid receptor-regulated enhancers play a central role in the gene regulatory networks underlying drug addiction. Frontiers in Neuroscience, 16, 858427.
Gearing, L. J., Cumming, H. E., Chapman, R., Finkel, A. M., Woodhouse, I. B., Luu, K., Gould, J. A., Forster, S. C., & Hertzog, P. J. (2019). CiiiDER: A tool for predicting and analysing transcription factor binding sites. PloS One, 14(9), e0215495.
Greenway, R., Barts, N., Henpita, C., Brown, A. P., Arias Rodriguez, L., Rodriguez Pena, C. M., Arndt, S., Lau, G. Y., Murphy, M. P., Wu, L., Lin, D., Tobler, M., Kelley, J. L., & Shaw, J. H. (2020). Convergent evolution of conserved mitochondrial pathways underlies repeated adaptation to extreme environments. Proc Natl Acad Sci U S A. https://doi.org/10.1073/pnas.2004223117
Kelley, J. L., Arias-Rodriguez, L., Patacsil Martin, D., Yee, M. C., Bustamante, C. D., & Tobler, M. (2016). Mechanisms Underlying Adaptation to Life in Hydrogen Sulfide-Rich Environments. Mol Biol Evol, 33(6), 1419–1434. https://doi.org/10.1093/molbev/msw020
Kim, D., Paggi, J. M., Park, C., Bennett, C., & Salzberg, S. L. (2019). Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol, 37(8), 907–915. https://doi.org/10.1038/s41587-019-0201-4
Krueger, F. (2014). Trim Galore!: A wrapper tool around Cutadapt and FastQC to consistently apply quality and adapter trimming to FastQ files (0.3.7).
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., & Durbin, R. (2009). The sequence alignment/map format and SAMtools. Bioinformatics, 25(16), 2078–2079.
Lim, J.-Y., Duttke, S. H., Baker, T. S., Lee, J., Gambino, K. J., Venturini, N. J., Ho, J. S. Y., Zheng, S., Fstkchyan, Y. S., & Pillai, V. (2021). DNMT3A haploinsufficiency causes dichotomous DNA methylation defects at enhancers in mature human immune cells. Journal of Experimental Medicine, 218(7), e20202733.
McCarthy, D. J., Chen, Y., & Smyth, G. K. (2012). Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Res, 40(10), 4288–4297. https://doi.org/10.1093/nar/gks042
Passow, C. N., Henpita, C., Shaw, J. H., Quackenbush, C. R., Warren, W. C., Schartl, M., Arias‐Rodriguez, L., Kelley, J. L., & Tobler, M. (2017). The roles of plasticity and evolutionary change in shaping gene expression variation in natural populations of extremophile fish. Molecular Ecology, 26(22), 6384–6399.
Pertea, M., Kim, D., Pertea, G. M., Leek, J. T., & Salzberg, S. L. (2016). Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. Nature Protocols, 11(9), 1650–1667.
Pfenninger, M., Lerp, H., Tobler, M., Passow, C., Kelley, J. L., Funke, E., Greshake, B., Erkoc, U. K., Berberich, T., & Plath, M. (2014). Parallel evolution of cox genes in H2S-tolerant fish as key adaptation to a toxic environment. Nat Commun, 5, 3873. https://doi.org/10.1038/ncomms4873
Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139–140.
Sandelin, A., Alkema, W., Engström, P., Wasserman, W. W., & Lenhard, B. (2004). JASPAR: an open‐access database for eukaryotic transcription factor binding profiles. Nucleic Acids Research, 32(suppl_1), D91–D94.
Warren, W. C., Garcia-Perez, R., Xu, S., Lampert, K. P., Chalopin, D., Stock, M., Loewe, L., Lu, Y., Kuderna, L., Minx, P., Montague, M. J., Tomlinson, C., Hillier, L. W., Murphy, D. N., Wang, J., Wang, Z., Garcia, C. M., Thomas, G. C. W., Volff, J. N., … Schartl, M. (2018). Clonal polymorphism and high heterozygosity in the celibate genome of the Amazon molly. Nat Ecol Evol, 2(4), 669–679. https://doi.org/10.1038/s41559-018-0473-y