Transcriptome data: salinity adaptation in Rhithropanopeus harrisii across an estuarine gradient
Tepolt, Carolyn; Blakeslee, April; Fowler, Amy (2021), Transcriptome data: salinity adaptation in Rhithropanopeus harrisii across an estuarine gradient, Dryad, Dataset, https://doi.org/10.5061/dryad.573n5tb7r
Rhithropanopeus harrisii is a common estuarine crab native to the East and Gulf Coasts of North America. Here, it is found along a broad range of salinities, spanning from ~1 PSU to ~25 PSU along estuarine gradients. As part of a larger study on the species' potential use of low-salinity refuges from parasitism, we tested for differences in gene expression with salinity using crabs from three distinct estuarine reaches along the Pamlico River in North Carolina. We acclimated crabs collected at natural sites with 1, 5, and 11 PSU mean salinity to 0.8, 3, and 15 PSU in the laboratory for one week and sequenced gill tissue (second posterior gill) from each group using mRNA-Seq.
Data in this repository includes information on sequenced samples (*csv), the de novo transcriptome assembly (two versions: with and without expression filtering; *.fasta), transcriptome annotation from EnTAP (two files: using the runN and runP functions; *.tsv), a spreadsheet of normalized read count data from salmon pseudo-alignment (*.tsv), high-quality SNPs identified from the transcriptome sequencing with GATK (*.vcf), and three custom scripts used in processing the SNP data (*.py and *.R).
Raw sequence data is archived in GenBank's SRA: BioProject PRJNA730785, BioSamples SAMN19241288-SAMN19241333.
Methods reproduced from the associated paper, "Invasion of the body snatchers: the role of parasite introduction in host distribution and response to salinity in invaded estuaries".
Salinity Exposure Experiment
Adult R. harrisii were collected in January 2019 from three populations in the Pamlico River: the Estuarium (~1 PSU mean salinity), Mallard Creek (~5 PSU mean salinity), and Wright's Creek (~11 PSU mean salinity). Crabs were acclimated for three days at their collection salinities and held at 20°C. 20 crabs from Mallard Creek and Wright's Creek were randomly divided into two groups and acclimated to 3 or 15 PSU. Crabs were ramped ±2.5 PSU every 24h until reaching final experimental salinities. For the Estuarium, 10 crabs each were ramped to 3 and 15 PSU, with a third group initiated at 0.0 PSU and transitioned to 0.8 PSU after 5d. After reaching experimental salinities, crabs were maintained for 7d in individual chambers of an 18-well polycarbonate parts-box. Crab position was randomized across treatments, and boxes were rotated daily. Crabs were fed every 3d (carnivore pellets), and water was changed 24h post-feeding. Following the acclimation period, crabs were dissected and the second posterior (primarily osmoregulatory) gill from the crab’s left and right sides were placed into cryovials containing 1ml RNALater. Preserved gills were stored at -80°C until processing.
Sample sequencing, transcriptome assembly, and mapping for gene expression
RNA was extracted from stabilized gills using a modified version of the TRI reagent protocol (Simms et al. 1991), with the addition of 10 mg of glycogen to 500 uL of the aqueous phase before precipitation followed by 75% ethanol wash steps. After extraction, 47 samples yielded sufficient DNA to proceed with cDNA library construction using an Illumina TruSeq stranded mRNA sequencing kit. Forty samples were sequenced across two lanes of 50 bp single-read sequencing on an Illumina HiSeq4000 at the Vincent J. Coates Genomics Sequencing Laboratory at UC Berkeley; seven samples (randomized across groups) were sequenced with samples for another project on a single lane of 150 bp paired-end sequencing at Genewiz (South Plainfield, NJ, USA).
In total, 46 samples yielded good sequence data and were used in the genomic analyses, with 4-9 individual replicates per treatment group. Raw reads were trimmed with Trim galore! V0.6.4_dev, using cutadapt v2.6, to remove bases with Phred <20 and ≥3 bp adaptor overlap with a maximum trimming error rate of 0.1 (https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/; Martin 2011). Reads were discarded if, after trimming, they were <20 bp (50 bp single-end reads) or <50 bp (150 bp paired-end reads). We assembled a de novo transcriptome with Trinity v2.11.0 (Haas et al. 2013), using the single-end reads from all 46 samples including the forward reads from the samples sequenced with 150 bp paired-end sequencing. For the Trinity assembly, we set the minimum contig length to 200, normalized coverage to a maximum of 50 reads, and set the minimum kmer coverage to 2. The resulting de novo transcriptome had 140,196 contigs and 214,737 isoforms; we retained only the longest isoform per gene for downstream analysis.
The de novo transcriptome was annotated using EnTAP v0.9.1 with Uniprot’s swissprot and TrEMBL protein databases and NCBI’s non-redundant nucleotide database as references (Hart et al. 2020). EnTAP was run twice, with both blastp (e-value ≤10-5) and blastx (e-value ≤10-3) alignments. Between the two approaches, we were able to annotate 16,894 contigs with a sequence match, and an additional 5,613 contigs with a gene family match. Of the contigs we could annotate with EnTAP, 1,240 were identified as non-crab contamination (e.g. matches to fungi, plants, bacteria, etc). To additionally screen for potential contamination by Loxothylacus panopaei, a parasitic barnacle which grows tissue throughout the host’s body, we ran a local blastn search with BLAST+ against a de novo transcriptome for L. panopaei (e-value ≤10-3; Camacho et al. 2009; Tobias et al., in review). To identify non-messenger RNA, we ran another local blastn search against the SILVA rRNA representative database (e-value ≤10-3; Quast et al. 2012). We identified 134 and 206 contigs as likely L. panopaei and rRNA contamination, respectively. We removed all 1,467 contigs identified as contamination with any approach from the transcriptome, leaving a de novo R. harrisii gill transcriptome of 138,729 contigs.
For all downstream analyses, paired-end reads were converted to 50 bp single-end reads by taking only the forward reads and cutting them to 50 bp before trimming as above. This was done to ensure that mapping and subsequent analysis would not be biased by sequence length or format. For differential gene expression analysis, we mapped individual reads back against the full cleaned transcriptome with Salmon v1.3.0 (Patro et al. 2017).
SNP Identification and Genoptyping
For sequence-based analyses, we further filtered the transcriptome to remove contigs with low sequence support (<1 TPM), leaving 43,908 contigs. We mapped individual sample reads back to the transcriptome with Bowtie 2 v2.4.1 (Langmead & Salzberg 2013), using the default settings. Reads were mapped against a transcriptome that included contigs that were putative outside or rRNA contamination to reduce bias; these contigs were removed post-mapping but before any subsequent analysis. We also ran blastn searches against the full mitochondrial genomes of six Portunidae species (e-value ≤10-3), and removed 49 contigs as likely mitochondrial after mapping. SNPs were identified and called using the Genome Analysis Toolkit v188.8.131.52 (GATK; McKenna et al. 2010; DePristo et al. 2011). With GATK we initially selected only biallelic SNPs with a quality ≥20 Phred; this yielded 417,963 raw SNPs. We further filtered these markers using a custom python script which retained only SNPs which had high-quality (≥20 Phred), high-coverage (≥5 reads) individual genotypes in at least 9 individuals in each of the three sources. We removed one sample from Wright’s Creek from downstream sequence analysis because of high genotype missingness (>20%). After this process, we had a set of 23,303 high-quality SNPs in 45 individuals, with each individual genotyped at ≥84% of these SNPs. We further screened this SNP panel to remove SNPs in strong linkage disequilibrium (R2 ≥ 0.8). To do this, we first calculated within and between-contig LD values using vcftools v0.1.16 (Danecek et al. 2011), converted these pairwise R2 values to a matrix using a custom R script, and used the R package ‘igraph’ v1.2.5 to identify clusters of SNPs in strong LD (Csárdi & Nepusz 2006). All but one SNP in each such cluster was removed, leaving a set of 16,592 high-quality, high-coverage, independent SNPs for downstream sequence analysis.
All necessary information is available in the uploaded files.
Thomas Harriot College of Arts and Sciences, East Carolina University, Award: Research Initiation Grant