Data from: Pesticide and pathogen exposure causes idiosyncratic gene expression responses across four diverse North American bumble bee species
Data files
Aug 04, 2025 version files 2.42 MB
-
adapters.fa
561 B
-
data_module_eigenvalues.csv
8.07 KB
-
data_qpcr_rqs.csv
59.16 KB
-
data_raw_read_counts.csv
2.34 MB
-
README.md
9.17 KB
Abstract
Bumble bee populations of certain species have been declining precipitously in North America over several decades. Hypotheses for declines include exposure to the fungal pathogen Nosema bombi and neonicotinoid pesticides. Importantly, populations of some bumble bee species remain stable despite their presumed exposure to these same stressors. Here we hypothesize that declining and stable species, respectively, exhibit distinct responses to N. bombi and neonicotinoid pesticides, detectable as differential gene expression profiles. To test this, we exposed developing larvae from colonies of Bombus occidentalis (declining) and B. impatiens (stable) to N. bombi and to the neonicotinoid imidacloprid, plus a combination of both. RNA-seq analysis revealed almost no overlap between these species in gene expression responses to the individual stressors. Combined-stressor effects were more similar, but nevertheless differed significantly in differentially expressed genes and in gene coexpression network modules. To test whether the molecular responses correlated within declining and stable species, we carried out quantitative PCR on twenty target genes and included a second pair of species, B. terricola (declining) and B. griseocollis (stable). Results showed no correlation with decline or stable status, with each of the four species exhibiting species-specific responses. Overall, these results highlight that generalizing causes of decline across different species may be misleading, as diverse species respond in a species-specific manner to particular environmental stressors.
https://doi.org/10.5061/dryad.rxwdbrvjv
Description of the data and file structure
There are a total of four datasets and eleven scripts available from this resource.
Datasets "data_raw_read_counts.csv" and "data_module_eigenvalues.csv" can be generated following the RNA-seq pipeline; "adapters.fa" can be provided by sequencing services; "data_qpcr_rqs.csv" was processed from raw data generated from the thermocycler used for the analysis.
Bash scripts (script numbers 00 to 06) are automated and can be run from a Linux OS prompt, but some modifications are to be made before running the scripts, besides installing program and library dependencies. R scripts (script numbers 07 to 10) are to be run in an R environment line by line in a prompt with active R environment or in a R GUI (such as Rstudio), running Rscript won't work. The required modifications and other details are described in the Code/software section of this README.
Files and variables
File: adapters.fa
Description: FASTA file with Illumina Truseq sequences in FASTA format. Used by Trimmomatic to perform adapter removal.
File: data_raw_read_counts.csv
Description: CSV table with the counts assigned to gene features of the B. impatiens genome v2.2 from the RNA-seq libraries. This is the input for the scripts "07_DESeq2_script.R" and "08_WGCNA_from_rawCounts.R".
Variables
- Transcripts: Gene ID from the B. impatiens genome v2.2, in the format of LOCXXXXXXXX.
- Bimp_057_00 to Bocc_263D_00: raw read counts mapped to transcripts (i. e. raw gene expression values) for all samples. The labelling pattern is as follows: "SSSS_CCC_TT"; where S = species (Bimp and Bocc), C = Colony (in numerals), and T = Treatment (00 as control, 0N as Nosema, I0 as Imidacloprid, and IN as Imidacloprid + Nosema).
File: data_module_eigenvalues.csv
Description: CSV table with the module eigenvalues (MEs) per sample generated by the WGCNA R package. This dataset is supplied to avoid building the co-expression network from scratch in case analyses with just the MEs are required, given the amount of time and computationoal resources WGCNA can require in personal computers and in laptops. This is the input file for some steps within the script "08_WGCNA_from_rawCounts.R".
Variables
- SampleID: sample name, each sample is labeled as the following labelling: "SSSS_CCC_TT"; where S = species (Bimp and Bocc), C = Colony (in numerals), and T = Treatment (00 as control, 0N as Nosema, I0 as Imidacloprid, and IN as Imidacloprid + Nosema).
- species: Species name. Bimp = B. impatiens, Bocc = B. occidentalis
- treatment: Experimental treatment. 0 = control, 0N = Nosema, I0 = Imidacloprid, IN = Imidacloprid + Nosema
- colony: Source colony identification number
- M1 to M13: ME values per sample and per module. Each module is a cluster of various co-expressed genes, the MEs represent a single expression vector per sample based on the expression of all genes included in an specific module.
- NoModule: ME value of the genes that were not included in any module. Disregard this column.
File: data_qpcr_rqs.csv
Description: CSV table with the relative quantity values (RQs) for the genes analyzed in the qPCR experiment. This is the input file for some steps within the script "10_qPCR_analyzer_withLoops.R"
Variables
- SampleName: Label of the sample, following the labelling pattern "source colony number", in numerals, followed by the "microcolony label", as characters, plus "larvae identifier", in numerals.
- Species: Bimp = B. impatiens, Bocc = B. occidentalis, Bgris = B. griseocollis, Btrc = B. terricola.
- Colony: source colony identification number
- Treatment: Experimental treatment. Explicitly labeled.
- Abd to Trypsin: RQs for the genes shown in the table head row. Note that some rows show missing values marked as "NA" (as in not available), and in order to run the scripts to generate the qPCR analyses and figures, the file must retain the "NAs".
Code/software
The scripts are numbered: scripts 00 to 06 performs step of the RNA-seq analysis that are automated for a bash shell (i.e. can be run in a prompt given the file and directory structure as well as the path are correctly set), while scripts 07 to 10 are to be run in an R environment entering each line or chunk of code separately in a prompt (be it from a Linux OS with R environment loaded or from an R oriented GUI such as Rstudio). The analysis results will be generated in the path provided in the variable $projectDir (change it manually in the script before running it). Note that all the dependency (programs or libraries used by the scripts) should be installed previously before running the scripts:
- 00_setDirs.sh: bash script that will read all needed folders, setting the directory structure. The directory structure should look like this:
/home/user/analysis/
|_src
adapter.fa # move the file from this dataset to this folder
|_data
|_raw-seq # Download BioProject PRJNA1071869 reads here
|_genome
|_results
|_fastqc-rawseq
|_fastqc-trimseq
|_trimmomatic
|_star
|_htseq-count
|_deseq2
|_wgcna
|_topGO
|_qpcr
- 01_RawReadsQC.sh: bash script that runs FastQC and MultiQC on the raw reads to produce an initial quality check report.
- 02_TrimReads.sh: bash script that runs Trimmomatic on the raw reads, to remove low quality positions and adaptor contamination. I also produces a second batch of post-trimming FastQC and MultiQC reports.
- 03_GenomePrep.sh: bash script that downloads the FASTA and GTF files of the B. impatiens genome v2.2 from the NCBI genome repository.
- 04_IndexSTAR.sh: bash script that generates a genome index by running STAR.
- 05_AlignSTAR.sh: bash script that will produce alignment files per sample, by aligning the trimmed reads against the reference genome, using STAR. MultiQC is run to generate a report with alignment statistics.
- 06_HTSeqCount.sh: bash script that runs htseq-count to summarize the read counts per gene feature and per sample. It will generate several files that can be joined with unix commands into a single file, however I provided this file in the downloadable datasets, "data_raw_read_counts.csv".
- 07_DESeq2_script.R: R script that generates all the differential gene expression analysis data, figures and tables, by running DESeq2 and other auxiliary programs. The input file should be "data_raw_read_counts.csv".
- 08_WGCNA_from_rawCounts.R: R script that generates all the weighted gene co-expression network analysis data, figures and tables, by running WGCNA and other auxiliary programs. The main input file is "data_raw_read_counts.csv", but the data file "data_module_eigenvalues.csv" can also be loaded if only module eigenvalue statistics are to be calculated (producing the whole network from scratch can be time and resource consuming for personal computers and laptops).
- 09_run_topGO.R: R script that generates all the gene ontology (GO) term enrichment analysis data, figures and tables, by running topGO. Input files have to be generated from DESeq2 output files, by selecting differentially expressed genes or genes included in a co-expression module, together with their corresponding q-values (corrected p-values with false discovery rate). The commands work for separated files with specific list of genes and q-values, the user should change manually the input file names before running the commands. The GO term annotation reference file (GAF) can be downloaded from the Hymenoptera Genome Database (https://hymenoptera.elsiklab.missouri.edu/hgd-go-annotation), the GAF has to be adapted to topGO input requirements (see topGO vignette for such requirements: https://bioconductor.org/packages/release/bioc/html/topGO.html).
- 10_qPCR_analyzer_withLoops.R: R script that generates all qPCR analysis data, figures and tables. The input file is "data_qpcr_rqs.csv".
Access information
Data was derived from the following sources:
- Sequence data are available from NCBI sequence read archive (SRA) Bio Project PRJNA1071869 (https://dataview.ncbi.nlm.nih.gov/object/PRJNA1071869), under accession numbers SRR27842857 to SRR27842900.
- Reference genome can be downloaded from NCBI genome repository (https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000188095.3/), both FASTA and GFF files.
- Gene Ontology term annotation file (GAF) can be downloaded from the Hymenoptera Genome Database (https://data.elsiklab.missouri.edu/downloads/hgd/HGD-GO-Annotation/gaf/Bombus_impatiens_HGD_go_annotation.gaf.gz)
Experimental design
Colonies from four North American bumble bee species were set in the Illinois State University facilities: Bombus impatiens (n = 7), Bombus occidentalis (n = 3), Bombus griseocollis (n = 4) and Bombus terricola (n = 2). Four microcolonies consisting of five adult workers and a larval brood clump were set from each parental colony. Each set of microcolonies from a source colony was exposed to one of four possible treatments:
- unexposed to imidacloprid and N. bombi (control)
- exposed to N. bombi, unexposed to imidacloprid
- exposed to imidacloprid, unexposed to N. bombi
- exposed to both imidacloprid and N. bombi
After 24 h of provisioning the microcolonies with untreated sugar water and pollen ad libitum, both were replaced with either control or imidacloprid (7 ppb) treated provisions for 72 h. After the first exposure event, provisions were again switched to untreated, and larvae were individually fed a 2 μL inoculum of either control or Nosema bombi (20,000 spores). After 24 h of the second exposure event, four larvae from each microcolony was flash frozen in liquid nitrogen and stored at -80 ºC.
RNA extraction
Total RNA was extracted from three larvae (leaving the fourth as backup) from each set of microcolonies with the E.Z.N.A. Total RNA Kit I (Omega Bio-tek), followoning manufacturers indications and adding a DNAse I treatment. Total RNA integrity was checked by running 1 μL in a 1% agarose gel, and quantity was assessed by analyzing 1 μL with a Qubit fluorometer.
RNA sequencing and bioinformatic analysis
For B. impatiens and B. occidentalis samples, RNA from the three larvae per microcolony was pooled in a single tube to obtain 1 μg of total RNA. Library contruction involved poly-A tail selection, followed by sequencing by a Illumina NovaSeq6000 sequencer. Single end raw reads are available in the SRA repository (NCBI) under the BioProject PRJNA1071869.
Raw reads were trimmed and cleaned from adapter sequences with Trimmomatic v0.38. FastQC was used to check read quality before and after the trimming. Trimmed reads were mapped to the B. impatiens genome v2.2 (GenBank accession number GCF_000188095.3) with STAR v2.7. Read counts were summarized per gene feature with htseq-count. Differential gene expression analysis was performed with DESeq2. Co-expression networks were built with WGCNA. Gene ontology term enrichment analysis was performed with topGO. The RNA-seq analysis pipeline and instructions on how to run it are described in the README file.
Quantitative PCR analysis
Primers for 20 differentially expressed genes were designed with NCBI's PrimerBLAST, using the available bumble bee genomes as templates, to test in all four bumble bee species. From 1 µg of RNA per sample, cDNA was synthesized with the iScript cDNA kit (BioRad). Quantitative qPCR was performed with the Luna qPCR reaction kit (NEBL) using 25 ng of cDNA per reaction as template. Reactions were run in a StepOne Thermocycler (Thermo Fisher Scientific), with the following conditions: hot start at 95 °C for 1 minute, then 40 cycles at 95 °C for 15 seconds and 60 °C for 30 seconds each, and finally a melt curve ranging from 60 °C to 95 °C with reads occurring every 0.3 °C. Stability of housekeeping genes was analyzed with GeNorm v3. Relative quantities were calculated following the Pfaffl method. Details on running the qPCR analysis are described in the README file.
