Skip to main content

Genomic analyses of phenotypic differences between native and invasive populations of diffuse knapweed (Centaurea diffusa)

Cite this dataset

Turner, Kathryn; Ostevik, Kate; Grassa, Christopher; Rieseberg, Loren (2020). Genomic analyses of phenotypic differences between native and invasive populations of diffuse knapweed (Centaurea diffusa) [Dataset]. Dryad.


Invasive species represent excellent opportunities to study the evolutionary potential of traits important to success in novel environments. Although some ecologically-important traits have been identified in invasive species, little is typically known about the genetic mechanisms that underlie invasion success in non-model species. Here, we use a genome-wide association (GWAS) approach to identify the genetic basis of trait variation in the non-model, invasive, diffuse knapweed (Centaurea diffusa Lam. [Asteraceae]). To assist with this analysis, we have assembled the first draft genome reference and fully annotated plastome assembly for this species, and the one of the first from this large, weedy, genus, which is of major ecological and economic importance. We collected phenotype data from 372 individuals from four native and four invasive populations of C. diffusa grown in a common environment. Using these individuals, we produced reduced-representation genotype-by-sequencing (GBS) libraries and identified 7058 SNPs. We identify two SNPs associated with leaf width in these populations, a trait which significantly varies between native and invasive populations. In this rosette forming species, increased leaf width is a major component of increased biomass, a common trait in invasive plants correlated with increased fitness. Finally, we use annotations from Arabidopsis thaliana to identify 98 candidate genes that are near the associated SNPs and highlight several good candidates for leaf width variation.


Published in Turner, K. G., Ostevik, K. L., Grassa, C. J., & Rieseberg, L. H. 2020. Genomic analyses of phenotypic differences between native and invasive populations of diffuse knapweed (Centaurea diffusa). Frontiers in Ecology and Evolution (in press), 8. doi: 10.3389/fevo.2020.577635

Draft reference genome assembly

Seed was collected from an individual in the native range of C. diffusa (TR001-1; Table S1). Seed from this collection was grown in a greenhouse at the University of British Columbia in 2009. Young leaf tissue was sampled from one progeny (TR001-1L) and stored at -80° C to be used for draft reference genome assembly. DNA was extracted from frozen tissue using a modified Qiagen DNeasy column-less protocol (Qiagen, Valencia, CA, USA). Concentration and quality were verified by Nanodrop, Qubit 2.0 Fluorometer (Thermo Fisher Scientific, Waltham, MA, USA), and gel electrophoresis. A whole genome shotgun library for this individual was produced and sequenced using Illumina HiSeq 2000 100 bp paired-end sequencing (Genome Quebec). This produced ~69 million reads and ~14 Gbp, for a sequencing depth of ~16X. Raw reads were quality trimmed and screened for sequencing artifacts using Trimmomatic (Bolger et al., 2014).

To produce the plastome assembly, clean reads were aligned to the Lactuca sativa plastome (Timme et al., 2007) using BWA (Li & Durbin, 2009). Pairs in which both reads aligned to the L. sativa plastome were extracted from the SAM files with Picard Tools SamToFastq.jar (Picard 2009). ALLPATHS-LG (Gnerre et al., 2011) was used to merge overlapping pairs and error-correct the data, which was then assembled with Ray (Boisvert et al., 2010). Ray contigs were aligned to the L. sativa plastome with BLAST+ (Altschul et al., 1990) and scaffolded based on synteny using OSLay (Richter et al., 2007). Gaps were filled with GapFiller (Boetzer & Pirovano, 2012) resulting in a sequence containing a single ‘N’. Visual inspection indicated that the ‘N’ separated an erroneous tandem duplication, which was corrected by hand with Vim. The boundaries of the inverted repeat (IR) region were confirmed with aTRAM (Allen et al., 2015) assemblies of flanking regions. Reads were aligned to the assembly with BWA and sorted with SAMTOOLS (Li et al., 2009). Visual inspection of the alignment revealed a few small indel and substitution errors, which were hand-corrected with Vim. A final alignment and inspection revealed no errors. The plastome was annotated using DOGMA (Wyman et al., 2004) and validated for NCBI GenBank submission using Sequin 13.05. We used the blastn program (Altschul et al., 1997) to compare the gene content and sequence identity of the C. diffusa and Cynara cardunculus (GenBank: KP842711.1; Curci et al., 2016) plastid genomes. In order to calculate dN, dS, and dN/dS in protein coding regions, we made multiple alignments of the amino acid sequences using Muscle v3.8.31 (Edgar, 2004) and back translated them to codons (Grassa & Kulathinal, 2011). Lactuca sativa plastid genes (GenBank: NC_007578.1; Kanamoto et al., 2004) were also included in the analysis of protein-coding regions. We used the yn00 method (Yang & Nielsen, 2000) in PAML v4.8a (Yang, 2007) to calculate dN, dS, and dN/dS values on the aligned codons.

An additional assembly was made to target the nuclear genome albeit at low depth. Unfiltered trimmed reads were assembled with Megahit (Li et al., 2015). Megahit contigs were repeat masked using Red (Girgis, 2015) and aligned to the Globe Artichoke reference genome (Scaglione et al., 2015) using Blastn (Altschul et al., 1990). These alignments were used as input to Chromosomer (Tamazian et al., 2016), which scaffolds contigs based on the order of homologous regions in a more completely assembled reference genome. Unplaced Megahit contigs were added to the output of Chromosomer to make a reference nuclear sequence. Finally, we polished the pseudomolecules with six iterations of Pilon (Walker et al., 2014) to fix local errors.

The reference nuclear sequence was annotated for putative protein-coding genes using three types of evidence: ab initio Hidden Markov Model-based predictions, protein homology, and expressed sequence tags. Ab initio predictions were made using Augustus (Stanke et al., 2006) with the gene model trained for tomato (the phylogenetically closest model available at the time of analysis). Viridiplantae proteins were downloaded from RefSeq (O'Leary et al., 2016) and clustered at 90% identity with CD-HIT (Fu et al., 2012). Representative peptide sequences were first aligned to the nuclear reference with the Diamond (Buchfink et al., 2015) in sensitive mode to quickly identify candidate regions with sequence homology to known proteins. Peptides were realigned to the nuclear sequence at candidate loci using the more sensitive, but longer running, AAT (Huang et al., 1997). Assembled ESTs from a native (TR001-1L; the same individual used for this reference assembly) and invasive (US022-31E) genotype (Lai et al., 2012) were aligned with GMAP (Wu et al., 2016). The three lines of evidence were combined using Evidence Modeler (Haas et al., 2008) to generate the gene coordinates.

Genome-wide association study

Seeds were collected as part of a broad collaborative effort from four native range and four invaded range C. diffusa populations (Fig 1; Table S1). Field collection took place between 2006 – 2008. Because maternal environmental effects can have strong, even adaptive, impacts on offspring phenotype in some systems (Galloway, 2005), we used individuals grown for two generations in a common environment to reduce maternal effects in this study. Specifically, field collected seed from these populations were raised in a greenhouse and crossed within populations (as previously reported in Turner et al., 2014). The seeds produced by those crosses were raised in a benign control environment, provided with ample resources, and phenotyped (as reported in Turner et al., 2014). Briefly, the 381 C. diffusa individuals included in this study were phenotyped as follows: Basal leaf number and length and width of the longest basal leaf for each individual was assessed three times during the experiment: beginning 7 wks after germination (at the mean 15 leaf stage), 10 wks after germination, and at harvest, and the maximum value was noted. At the end of the experiment, 19-20 wks after germination, the diameter at the interface between root and stem was assessed. In addition, two or three young leaves were sampled for DNA extraction and stored at -80 C, and one mature leaf was harvested to measure specific leaf area (SLA). Leaves for SLA were scanned and area measured using ImageJ 1.45s (Rasband, 2011) then dried at 29° C and weighed. The rest of the above ground biomass was harvested, dried, and weighed. These phenotypes were summarized into the following values for use in the GWAS analysis: (1) maximum leaf count, (2) maximum leaf width, (3) maximum leaf length, (4) root crown diameter, (5) shoot mass, and (6) SLA.

To genotype individuals, DNA was extracted from young leaves using a Qiagen DNeasy 96 Plant Kit (Qiagen, Valencia, CA, USA) and assessed for quantity using a Qubit 2.0 Fluorometer (Thermo Fisher Scientific, Waltham, MA, USA). High quality DNA from 381 individuals was used to produce PstI-MspI genotype-by-sequencing libraries using a modified version of the protocol described in Poland et al. (2012). Briefly, we treated genomic DNA with HF-PstI and MspI at 37 °C for 5 hours. We ligated barcoded adaptors and barcoded common Y-shaped adaptors (Poland et al., 2012) to digested DNA at 22 °C for 3 hours. We pooled groups of 96 ligated samples, which were then purified using the Agencourt AMPure XP system and amplified by PCR using KAPPA HiFi Hotstart master mix. Finally, we selected DNA fragments between 300-500 bp using the Agencourt AMPure XP system for paired-end 100bp sequencing on an Illumina HiSeq 2000 platform. Sequencing runs resulted in ~402 million reads passing initial quality control at the sequencing facility, with an average quality score of 36.

Raw sequencing reads were de-multiplexed and data for each individual from all lanes were concatenated. Single nucleotide polymorphisms were called against the assembled reference sequence using dDocent v2.6.0 (Puritz et al., 2014). dDocent combines several existing software packages into a single pipeline designed specifically for paired-end GBS/RAD data and takes advantage of both forward and reverse reads for SNP discovery. SNPs were called based on a larger set of Centaurea individuals, including the 381 GWAS set and 54 individuals from additional populations. Then, using the individuals in this GWAS study only, SNPs were filtered for quality and missing data. The resulting variant call file (VCF) was filtered using vcftools v0.1.15 (Danecek et al., 2011) to a minimum quality score of 30, a minor allele frequency (MAF) of 5%, retaining individuals with no more than 40% missing data, and sites missing in no more than 10% of individuals with a minimum mean depth value greater than 20. Complex variants (multinucleotide polymorphisms and composite insertions and substitutions) were decomposed into SNP and indel representation following Puritz et al. (2014), retaining only one biallelic SNP per locus. Further filtering steps were performed to remove SNPs likely to be the result of sequencing errors, paralogs, multicopy loci or artifacts of library preparation (Methods S1). We concatenated SNPs found on unplaced scaffolds onto a single contig for analysis, and, because most GWAS analyses do not tolerate missing data, we replaced unknown genotypes with heterozygous calls. Using only SNPs without any missing data substantially decreases the number of SNPs that we could test but did not affect our results. The final data set used in downstream analyses consisted of 7058 SNPs found in 372 individuals from eight populations represented by 6-169 individuals (Fig. 1; Table S1).

We used compressed mixed linear models (Zhang et al., 2010) implemented in GAPIT (Lipka et al., 2012) to test for associations between SNPs and our phenotypes. To avoid spurious associations and account for relatedness between our individuals, our models included a kinship matrix, which was calculated using the default settings for the VanRaden (2008) algorithm. This kinship matrix was validated against the known pedigree for these individuals from the greenhouse crossing design, and our results were not affected by using alternative ways to account for relatedness. To further account for population structure in our population, we used the Bayesian information criterion (BIC)-based model selection procedure implemented in GAPIT to determine the optimal number of principal components to include in our models (up to a maximum of 30) in addition to the calculated kinship matrix.

We searched for genes near significant GWAS hits to gain insight as to what the genetic basis of these traits might be. For each significant GWAS hit we focused on genes located within 500kb in both the upstream and downstream direction; this marked the boundaries for the gene search. Genes identified in this way were functionally annotated based on sequence similarity to Arabidopsis thaliana using all-against-all BLASTP reciprocal best hits to ESTs in the TAIR 10 database (TAIR, 2020; Berardini et al. 2015). These TAIR 10 loci and associated Gene Ontology terms are reported.

Finally, we put these results in the context of previous work on invasive candidate traits in this system. We determined whether significant GWAS hits were associated with traits that significantly varied between individuals from the native and invasive ranges in previous studies (Turner et al., 2014) and may be candidate invasiveness traits. We used linear mixed effect models in the R package lme4 (R version 3.5; R Core Team, 2018; Bates et al., 2014) and Fisher’s exact tests to interrogate the relationships between range (native or invasive), genotypes, and traits. We further assessed overlap between SNPs and genes identified here and previous transcriptomic and gene expression work in this system (Lai et al., 2012; Hodgins et al., 2015; Turner et al., 2017).

Usage notes

centaurea_final_assembly.tgz - Draft genome assembly and annotation of Centaurea diffusa individual, TR001_1L, from seed collected in Turkey (native range) in 2008. - Master script describing SNP calling and filtering.

GWAS_KO.txt - List of individuals used in SNP calling and GWAS.

CdiffSNPs_GWAS_10_filt.recode.vcf.gz - vcf file used for GWAS.

GWAS_code.R - Master script describing GWAS analyses.

Cdiff_pheno.txt - individual phenotype data used in GWAS analyses.

CdiffGBS_GWAS_supmat_final.pdf - Supplementary figures and tables supporting Turner, K. G., Ostevik, K. L., Grassa, C. J., & Rieseberg, L. H. 2020. Genomic analyses of phenotypic differences between native and invasive populations of diffuse knapweed (Centaurea diffusa). Frontiers in Ecology and Evolution (in press), 8. doi: 10.3389/fevo.2020.577635

TableS3_PlastomeComparisons.xlsx - Plastome comparisons. Substitution rates for protein coding genes in pairwise comparisons of Centaurea, Cynara, and Lactuca plastid genomes using the method of Yang & Nielsen, 2000. The following rates are reported for each sequence pair:
S -- Number of synonymous sites
N -- Number of nonsynonymous sites
t -- Number of nucleotide substitutions per codon
κ -- transition/transversion rate ratio
ω (=dN/dS) -- Nonsynonymous/synonymous rate ratio
dN (+-SE) -- Number of nonsynonymous substitutions per nonsynonymous site
dS (+-SE) -- Number of synonymous substitutions per synonymous site

TableS4_candidateGeneWindow.xlsx - Annotated region of Centarea diffusa Chromosome TR001.Ccrd12 within 500kb of significant GWAS hits (gene containing two significant SNPs, TR001.Ccrd12.1184, in bold). Some genes included in this table have been investigated in previous studies of this species. For the ‘Previous_Findings’, 1 gene had a marginal effect of range on expression (‘Marginal_DE_range’), 19 genes were differentially expressed between control and drought treatments (‘DE_drought’), and 20 were not differentially expressed (‘non_DE’; Turner et al. 2017). Two genes were identified by Hodgins et al. (2015) as rapidly evolving among weedy species of the Carduoideae subfamily using stochastic branch-site models for positive selection (‘positive_selection’).


National Science Foundation, Award: 1523842

National Science Foundation, Award: Idaho EPSCoR OIA-1757324

Natural Sciences and Engineering Research Council, Award: 516658

Natural Sciences and Engineering Research Council, Award: 327475

Natural Sciences and Engineering Research Council, Award: 353026