Spatial, climate and ploidy factors drive genomic diversity and resilience in the widespread grass Themeda triandra
Data files
Nov 30, 2020 version files 11.43 MB
Abstract
This data set was used to assess the climate resilience of Themeda triandra, a foundational species and the most widespread plant in Australia, by assessing the relative contributions of spatial, environmental, and ploidy factors to contemporary genomic variation. Reduced-representation genome sequencing on 472 samples from 52 locations was used to test how the distribution of genomic variation, including ploidy polymorphism, supports adaptation to hotter and drier climates.
Methods
For reduced-representation library preparation and sequencing, genomic DNA from each individual was isolated from approximately 25 mg of silica-dried leaf tissue using the Stratec Invisorb DNA Plant HTS 96 kit (Invitek, Berlin, Germany). Libraries for each individual were created similarly to Ahrens et al. (2017). Briefly, extracted DNA was digested with PstI for genome complexity reduction, and ligated with a uniquely barcoded sequencing adapter pair. We then amplified each sample individually by PCR. Amplicons between 350 and 600 bp were selected from an agarose gel. The final library pool was sequenced on three Illumina NextSeq400 lanes using a 75 bp paired-end protocol on a high output flowcell at the Biomolecular Resources Facility at the Australian National University, generating approximately 864 million read pairs.
We checked the quality of the raw short-read sequencing reads with FastQC v0.10.1 (Andrews, 2010), then demultiplexed the raw reads associated with each sample’s unique combinatorial barcode using AXE v0.2.6 (Murray & Borevitz, 2018). During this step we were unable to assign 19% of the reads. Each sequence was trimmed to 64 basepairs while removing the barcodes. Read quality was assessed with trimmomatic v 0.38 (Bolger, Lohse, & Usadel, 2014) using a sliding window of 4 basepairs (the number of bases used to average quality) and a quality score of 15 (the average quality required among the sliding window), and if the average quality dropped below 15, the sequences were cut. Long-reads were indexed (Figure S2 for distribution of length and number of reads sequenced) using the BWA software and the index argument. Short-reads were aligned to the long-reads for more accurate SNP calling compared to a de novo pipeline. Short-reads were aligned using BWA-mem v 0.7.17-r1198 (Li, 2013), as paired reads, with 82.5% of reads successfully mapped. Samtools v 1.9 (Li et al., 2009) was used to transform the SAM files to BAM files for use within STACKS v 2.41 (Catchen, Hohenlohe, Bassham, Amores, & Cresko, 2013). The argument gstacks and populations were used in that order on the BAM files to create a VCF file, minimum thresholds (minor allele frequency = 0.01; one random SNP per read was retained) were set here for further filtering in R (R core development team 2019).
The minimum missing data threshold was set to 50% per locus and individual which resulted in an average of 30% missing data from the whole SNP dataframe. Minor allele frequency was set to 0.05 to avoid identifying patterns of population structure that may be due to locally shared alleles.
Usage notes
.lfmm file for snmf analysis - 012 file represents count of minor allele. missing values = 9 |||| the individual key file is: themea_52_012v2_indkey.key
genpop file provided for input as genind object. missing values = 0000