Skip to main content

Transcriptome analysis of invasive Gypsophila paniculata (baby's breath) populations from Michigan and Washington, USA.

Cite this dataset

Partridge, Charlyn; Lamar, Sarah; Beddows, Ian (2020). Transcriptome analysis of invasive Gypsophila paniculata (baby's breath) populations from Michigan and Washington, USA. [Dataset]. Dryad.


Invasive species provide an opportune system to investigate how populations respond to new or changing environments. While the impacts of invasive species increase annually, many gaps in our understanding of how these species invade, adapt, and thrive in the areas they are introduced to remain. Using the perennial forb Gypsophila paniculata as a study system, we aimed to investigate how invasive species respond to different environments. Baby’s breath (Gypsophila paniculata) was introduced to North America in the late 1800’s and has since spread throughout the northwestern United States and western Canada. We used an RNA-seq approach to explore how molecular processes may be contributing to the success of invasive G. paniculata populations that are thought to share similar genetic backgrounds across distinct habitats.  Transcription profiles were constructed for root, stem, and leaf tissue from seedlings collected from a sand dune ecosystem in Petoskey, MI (PSMI) and a sagebrush ecosystem in Chelan, WA (CHWA). Using these data we assessed differential gene expression between the two populations and identified SNPs within differentially expressed genes. We identified 1,146 transcripts that were differentially expressed across all tissues between the two populations. GO processes enriched by genes displaying higher expression in PSMI were associated with increased nutrient starvation, while enriched processes in CHWA were associated with abiotic stress. Only 7.4% of the differentially expressed genes across all three tissues contained SNPs differing in allele frequencies of at least 0.5 between the populations. In addition, common garden studies found the two populations differed in germination rate and seedling emergence success, but not in above- and below-ground tissue allocation. Our results suggest that the success of invasive G. paniculata across these two environments is likely the result of plasticity in molecular processes responding to different environmental conditions, although some genetic divergence may also be contributing to these differences.


RNA Extraction. We collected 16 G. paniculata seedlings from CHWA (June 8, 2018) and 15 seedlings from PSMI (June 1, 2018). We then dissected seedlings into three tissue types (root, stem, and leaf), placed tissue in RNAlater™ (Thermo Fisher Scientific, Waltham, MA), and flash-froze them in an ethanol and dry ice bath. We extracted total RNA from frozen tissue using a standard TRIzol® (Thermo Fisher Scientific) extraction protocol ( We resuspended the extracted RNA pellet in DNase/RNase free water. The samples were then treated with DNase to remove any residual DNA using a DNA-Free Kit (Invitrogen, Carlsbad, CA). We assessed RNA quality with a Bioanalyzer 2100 (Agilent Technologies, Santa Clara, CA) and NanoDrop™ 2000 (Thermo Fisher Scientific). RNA Integrity Number (RIN) values for individuals used in this study ranged from 6.1-8.3. However, because both chloroplast and mitochondrial rRNA can artificially deflate RIN values in plant leaf tissuewe deemed these values to be sufficient for further analysis based upon visualization of the 18S and 28S fragment peaks (see Babu & Gassmann, 2016). This resulted in high quality total RNA from 10 PSMI leaf, 10 PSMI stem, 10 PSMI root, 10 CHWA leaf, 9 CHWA stem, and 10 CHWA root samples.

cDNA Library Construction and Sequencing. Prior to sequencing, all samples were treated with a Ribo-Zero rRNA Removal Kit (Illumina, San Diego, CA). cDNA libraries were constructed using the Collibri Stranded Library Prep Kit (Thermo Fisher Scientific) before being sequenced on a NovaSeq 6000 (Illumina) using S1 and S2 flow cells. Sequencing was performed using a 2 x 100 bp paired-end read format and produced approximately 60 million reads per sample, with 94% of reads having a Q-score >30.

Transcriptome Assembly. Prior to transcriptome assembly, read quality was assessed using FastQC ( Adapters and bases with a quality score less than 20 were first removed from the raw reads using Trim Galore ( Next, rRNAs were identified using SortMeRNA (mean rRNA percent content of 5.31%) (Kopylova, Noé, & Touzet, 2012). A reference transcriptome was then assembled de novo using non-rRNA reads from all samples and Trinity v2.8.2 (Grabherr et al. 2011; Haas et al. 2013), with a normalized max read coverage of 100, a minimum k-mer coverage of 10, and k-mer size set to 32. The assembled transcriptome was annotated using Trinotate v3.1.1. Trinotate was given open reading frames (ORFs) predicted from TransDecoder and transcript homology (blastx and blastp) to the manually curated UniProt database (Bryant et al., 2017). The final assembly consisted of 223,810 putative genes and 474,313 putative transcripts (N50 = 3,121) from the 59 samples.

Differential Expression. To quantify transcript expression, reads were mapped back to the assembly using bowtie and quantified using the RSEM method as implemented in Trinity. Counts were generated for genes and transcripts. We then tested for differential gene expression using edgeR v3.22.5 in R v3.5.2 (Robinson, McCarthy, & Smyth, 2010; R Development Core Team, 2017). First, however, the count data was filtered and only transcripts with greater than 10 counts in at least 10 samples were included. Following filtering, 111,042 genes (49.61%) and 188,108 transcripts (39.66%) remained. Considering tissue type, 127,591 transcripts remained in the data from 20 root samples (26.90%), 125,261 transcripts remained in the 19 stem tissue samples (26.41%), and 112,499 transcripts remained in the 20 leaf tissue samples (23.72%). For differential expression testing, the data were stratified by tissue and filtered transcripts were then fit to the negative binomial (NB) model and tested using the quasi-likelihood F test with TMM (trimmed mean of M values) normalization. To be considered significantly differentially expressed, transcripts needed to have an adjusted p-value (BH method) below 0.05 and a log2 fold change greater than 2. For transcripts that were differentially expressed, we identified Gene Ontology (GO) biological processes that were either over- or under-represented using the PANTHER classification system v14.1, where transcripts were assessed against the Arabidopsis thaliana database ( In addition, for those transcripts that were differentially expressed across all three tissues, we converted the UniProt IDs of the transcripts to GO biological process IDs using the online database bioDBnet (, and used the metacoder package v0.3.3 (Foster, Sharpton, Grünwald, 2017) in R v3.6.0 to construct heat trees to visualize the relationship of our differentially expressed transcripts across GO biological process hierarchies.

Single Nucleotide Polymorphism (SNP) Variant Calling. We used the HaplotypeCaller tool from GATK4 to identify potential SNPs that were present in transcripts that were differentially expressed between populations (McKenna et al., 2010; DePristo et al. 2011). The bowtie mapped files were used to jointly genotype all 59 samples simultaneously with a minimum base quality and mapping quality of 30. Variant data was visualized using the vcfR package v1.8.0 (Knaus & Grünwald 2017). We identified variants associated with non-synonymous SNPs, synonymous SNPs, 5’ and 3’ UTR SNPs, 5’ and 3’ UTR indels, frame-shift and in-frame indels, premature or changes in stop codons and changes in start codons, and calculated population diversity estimates for all SNP types. The effect prediction was done using custom scripts (which can be found in the Dryad repository) and the Transdecoder predicted annotation in conjunction with the base change. We set a hard filter for the SNPs so that only those with QD scores > 2, MQ scores > 50, SOR scores < 3, and Read Post Rank Sums between -5 and 3 passed. We then calculated the allele frequencies for each SNP within PSMI and CHWA. For the subsequent evaluation, we focused on SNPs that had potential functional effects (i.e., they were not listed as ‘synonymous’ or ‘unclassified’), were in transcripts differentially expressed between PSMI and CHWA across all three tissues, and that exhibited differences in SNP allele frequencies between the populations by at least 0.5. We used the R package metacoder v0.3.3 to visualize the GO biological process hierarchies associated with transcripts containing these SNPs.

Germination Trial. On August 11, 2018 we returned to our sample sites in CHWA and PSMI and collected seeds from 20 plants per location. This date was chosen because Rice, Martínez-Oquendo, & McNair (2019) previously determined that this collection time can yield over 90% seed germination for G. paniculata collected from Empire, MI. To collect seeds, we manually broke seed pods off and placed them inside paper envelopes in bags half-filled with silica beads. We stored bags in the dark at 20 to 23˚C until the germination trial began one month later, We counted one hundred seeds from twenty plants per population and placed them in a petri dish lined with filter paper (n = 2,000 seeds per population). We established a control dish using 100 seeds from the ‘Early Snowball’ commercial cultivar (G. paniculata) sold by W. Atlee Burpee & Co in 2018, known to have germination percentages in excess of 90%. Incubators had a 12:12h dark:light photoperiod and growth chamber conditions were set at 20˚C with 114 μmol m-2 s-1 photosynthetically active radiation from fluorescent light bulbs. Each day we randomized petri dish locations within the incubator to avoid bias in temperature or light regimes. We conducted this study for fourteen days, at which point there had been no germination in any dish for two days. The same individual checked all seeds (n=4,100) daily within the same three-hour time window to minimize bias for germination. Once a seed had germinated, we removed it from the dish (method adapted from Rice et al., 2019). Using the statistical program R v3.6.0, we fit the data to a nonparametric Kaplan-Meier time-to-event curve (McNair, Sunkara, & Frobish, 2012; R Development Core Team, 2017). We then compared germination patterns between CHWA and PSMI using a pairwise log-rank test (McNair et al., 2012). To test for homogeneity within localities, we again conducted a log-rank test. Finally, to investigate the presence of family effects (i.e., differences among seeds from different parental plants), we ran a series of pairwise log-rank tests with a Holm correction for multiple comparisons (McNair et al., 2012). For all analyses in this study, we set the alpha level to 0.05.

Growth Trials. To examine population differences on seedling emergence success and above- and below-ground tissue allocation, we planted 6 seeds collected from 20 individual plants per population (n = 120 per population, n = 240 total). All seeds were planted on the same day to a standardized depth of 5 mm in a sand/potting soil mixture. Greenhouse conditions were set at 7:17 h dark:light photoperiod. Relative humidity and temperature settings during the day were 55% and 21˚C while nighttime conditions were 60% and 15.5˚C. Each day we watered plants until the soil appeared fully wet and we randomized plant position to prevent bias in temperature, light, or water regime. At the end of the seven-week trial period, we carefully removed plants from the soil and measured the length of tissue above and below the caudex using a caliper. To compare the proportion of seedlings that successfully emerged between the populations, we ran a two-sided proportion test in the R statistical program v3.6.0. We analyzed differences in the ratio of above- and below-ground tissue between populations for seedlings that successfully emerged and examined the presence or absence of family effects using a completely randomized design with subsampling ANOVA in SAS v9.4 (SAS Institute Inc., 2013).





Usage notes

File: Metadata-699228-processed-ok.tsv. Contains: metadata for the associated BioProject (Accession #PRJNA606240).

File: RSEM.isoform.counts.matrix.wAnnot. Contains: Annotated count file for each isoform 

File: import_VCF_and_get_allele_frequencies_by_population.R Contains: R code to inport the VCF file and cacluate allele frequencies

File: intersects.00_09.vcf.gz Contains: File containing SNPs of genes that were found to be differentially expressed across all three tissues

File: PARC_20180710_RNA_Normalization_DE_Analysis.Rmd Contains: R markdown file for rRNA identification, data normalization, and DE analysis

File: Metadata.txt Contains: metadata file containing sample name, population, and tissue type. 

File: Gerimation_Obs_Table.csv Contains: Raw data from the germination success portion of the experiment. It includes number of seeds germinated per day for each parent plant over the 14 day experimental period. 

File: Above_Below_RawData_Formated ContainsRaw data from the above- and below ground tissue allocation and emergence success portion of the experiment. 

File: Contains: Custom code to identify indel effects in differentially expressed trancripts

File: Contains: Custom code to identify SNP effects in differentially expressed trancripts

File: Contains Custom code to identify SNP positions in differentially expressed trancripts


Thermo Fisher Scientific (United States)

Michigan Botanical Society

Grand Valley State University

Environmental Protection Agency, Award: 00E01934

Michigan Botanical Society