Data and code for: Species-specific effects of production practices on genetic diversity in plant reintroduction programs
Data files
Dec 06, 2023 version files 29.71 MB
-
Code.R
-
lan.annotated.gen
-
LAN.annotated.PerChang.csv
-
lan.annotated.vcf
-
lan.blast.alignment.txt
-
lan.filtered.fa
-
LAN.neutral.PerChang.csv
-
lan.neutral.vcf
-
lan.neutralsnps.gen
-
ped.annotated.gen
-
PED.annotated.PerChang.csv
-
ped.annotated.vcf
-
ped.blast.alignment.txt
-
ped.filtered.fa
-
PED.neutral.PerChang.csv
-
ped.neutral.vcf
-
ped.neutralsnps.gen
-
README.md
-
sag.annotated.gen
-
SAG.annotated.PerChang.csv
-
sag.annotated.vcf
-
sag.blast.alignment.txt
-
sag.filtered.fa
-
SAG.neutral.PerChang.csv
-
sag.neutral.vcf
-
sag.neutralsnps.gen
Abstract
Plant production practices can influence the genetic diversity of cultivated plant materials and, ultimately, their potential to adapt to a reintroduction site. A common step in the plant production process is the application of seed pre-treatment to alleviate physiological seed dormancy and successfully germinate seeds. In production settings, the seeds that germinate more rapidly may be favored in order to fill plant quotas. In this study, we investigated how the application of cold-moist stratification treatments with different durations can lead to differences in the genetic diversity of the propagated plant materials. Specifically, we exposed seeds of three Viola species to two different cold stratification durations, and then we analyzed the genetic diversity of the resulting subpopulations through double-digestion restriction site-associated sequencing (ddRADseq). Our results show that, in two out of three species, utilizing a short stratification period will decrease the genetic diversity of neutral and expressed loci, likely due to the imposition of a genetic bottleneck and artificial selection. We conclude that, in some species, the use of minimal stratification practices in production may jeopardize the adaptive potential and long-term persistence of reintroduced populations and suggest that practitioners carefully consider the evolutionary implications of their production protocols. We highlight the need to consider the germination ecology of target species when selecting the length of dormancy-breaking pre-treatments.
README
This README file was generated on 2023-09-06 by Zoe Diaz-Martin.
GENERAL INFORMATION
- Title of Dataset: Reproductive strategies and their consequences for divergence, gene flow, and genetic diversity in three taxa of Clarkia
- Author Information A. Corresponding Author Contact Information Name: Zoe Diaz-Martin Institution: Spelman College Address: Atlanta, GA USA Email: zoediazmartin@spelman.edu
SHARING/ACCESS INFORMATION
- Licenses/restrictions placed on the data: CC0 1.0 Universal (CC0 1.0) Public Domain
- Links to publications that cite or use the data:
Diaz-Martin, Zoe, De Vitis, Marcello, et al. Species-specific effects of production practices on genetic diversity in plant reintroduction programs. Evolutionary Applications.
- Links to other publicly accessible locations of the data: None
- Links/relationships to ancillary data sets: None
- Was data derived from another source? No A. If yes, list source(s): NA
- Recommended citation for this dataset:
Diaz-Martin, Zoe, De Vitis, Marcello, et al. (Forthcoming 2023). Data and Code for: Species-specific effects of production practices on genetic diversity in plant reintroduction programs [Dataset]. Dryad.
DATA & FILE OVERVIEW
- File List:
A) *.annotated.gen
B) *.annotated.PerChang.csv
C) *.annotated.vcf
D) *.blast.alignment.txt
E) *.filtered.fa
F) *.neutral.PerChang.csv
G) *.neutral.vcf
H) *.neutralsnps.gen
- Relationship between files, if important: * is a placeholder for the species code, therefore each species will have one file listed above. The species codes are: PED = Viola pedatifida; SAG = Viola sagittata; LAN = Viola lanceolata. The .vcf files are the quality filtered data and were used to generate the data in subsequent files.
- Additional related data collected that was not included in the current data package: We have not included the raw, unprocessed data files
- Are there multiple versions of the dataset? No A. If yes, name of file(s) that was updated: NA i. Why was the file updated? NA ii. When was the file updated? NA
#########################################################################
DATA-SPECIFIC INFORMATION FOR: *.annotated.gen
This is a GenePop formatted file for the quality filtered single nucleotide polymorphisms (SNPs) that had a match in the NCBI database.
#########################################################################
DATA-SPECIFIC INFORMATION FOR: *.annotated.PerChang.csv
- Number of variables: 3
- Number of cases/rows: PED = 475; SAG = 312; LAN = 197
- Variable List:
- Perchang: A categorical assignment of the direction of percent change between annotated paired loci in gene diversity (Hs) from short to long. Catergorical codes: Pos = Positive; Sig = Significantly positive; DEC = Decrease; NegZ = No change
- Short: Gene diversity (Hs) for locus in short subpopulation
- Long: Gene diversity (Hs) for locus in long subpopulation
#########################################################################
DATA-SPECIFIC INFORMATION FOR: *.annotated.vcf
This is a .vcf formatted file for the quality filtered single nucleotide polymorphisms (SNPs) that had a match in the NCBI database.
#########################################################################
DATA-SPECIFIC INFORMATION FOR: *.blast.alignment.txt
Output from MegaBlast alignment for all loci.
#########################################################################
DATA-SPECIFIC INFORMATION FOR: *.filtered.fa
Fasta formatted file for all quality filtered loci. This is the input file for MegaBlast.
#########################################################################
DATA-SPECIFIC INFORMATION FOR: *.neutral.PerChang.csv
- Number of variables: 3
- Number of cases/rows: PED = 2821; SAG = 1904; LAN = 1355
- Variable List:
- Perchang: A categorical assignment of the direction of percent change between neutral paired loci in gene diversity (Hs) from short to long. Catergorical codes: Pos = Positive; Sig = Significantly positive; DEC = Decrease; NegZ = No change
- Short: Gene diversity (Hs) for locus in short subpopulation
- Long: Gene diversity (Hs) for locus in long subpopulation
#########################################################################
DATA-SPECIFIC INFORMATION FOR: *.neutral.vcf
This is a .vcf formatted file for the quality filtered single nucleotide polymorphisms (SNPs) that did not have a match in the NCBI database.
#########################################################################
DATA-SPECIFIC INFORMATION FOR: *.neutralsnps.gen
This is a GenePop formatted file for the quality filtered single nucleotide polymorphisms (SNPs) that did not have a match in the NCBI database.
CODE / SOFTWARE
Code.R : This is a script for the program R. This script generates the analyses used for testing differences in measures of genetic diversity within species between short and long subpopulations.
This code will also generate figures in the manuscript.
USAGE NOTES
The files containing genetic data in .vcf format can be viewed using the Integrative Genomics Viewer (IGV; https://software.broadinstitute.org/software/igv/). All data files in this submission can also be viewed in a standard text editor.
Methods
Seed sourcing
In 2019, seeds of V. lanceolata and V. sagittata were provided by The Nature Conservancy Kankakee Sands Preserve Nursery (KS; Indiana, US). At this nursery, beds for both species were established around 2015, using seeds collected from four local native populations. Seeds from these beds were harvested and used to grow a subsequent generation of plants in a greenhouse; then these potted plants were transferred outdoors. Seeds harvested from fruits of both chasmogamous and cleistogamous flowers were mixed and then were exposed to cold-moist stratification for 60 days at 4°C. This seed mix was either used for next years’ propagation material or for direct seeding in restorations. In addition, each year the nursery staff perform a small amount of wild collecting, depending on availability, to augment the levels of genetic diversity within the seed mix. Given the propagation practices carried out at this nursery, the number of maternal lines in the sourced seed sample was not known. The seed of Viola pedatifida was provided by the Lake County Forest Preserve Nursery (LCFP; Illinois, US) and harvested in 2019 from plants grown in an in-ground nursery production bed for one generation from seeds collected from a nearby wild population. All seeds were stored at +5 °C until July 2019, when 100 randomly chosen seeds of each species were surface sterilized and sown into two Petri dishes per cold stratification treatment (25 seeds per replicate).
Cold-moist stratification of subpopulations
We performed surface-sterilization of all seeds by immersing and stirring the seeds into a 1% bleach solution for two minutes, followed by two consecutive one-minute rinses in sterile water. Seeds were then sown in Petri dishes on 1.5% agar medium, sealed with parafilm, and exposed to cold stratification treatments at 0-3 °C for different durations: for the short stratification, all species were stratified for 42 days, while for the long stratification seeds were stratified for 112 days for V. sagitatta and 84 days for V. lanceolata and V. pedatifida. These stratification lengths were determined to ensure a low but sufficient level of germination with short cold stratification, and to maximize germination with long stratification, based on conversations with practitioners.
At the end of the cold stratification periods, Petri dishes were transferred into an incubator maintained at 25/15 °C (alternating day/night temperatures, 12/12 h cycle) and 12/12 photoperiod to trigger germination. From these two stratification duration treatments, we selectively generated two ‘cold stratification subpopulations’ for each species. For the short subpopulation, we randomly selected 20 seeds that germinated within two weeks of moving to the incubator. A small proportion of seeds germinated while in cold stratification for V. sagittata and V. pedatifida, and none for V. lanceolata. For the long subpopulation, we discarded any individuals that germinated during the cold stratification, as we assumed they would pertain to the short or an intermediate subpopulation. By discarding seeds that germinated while in cold stratification, we simulated a real-world application. Practitioners often stratify seeds in moist sand in the refrigerator during the winter to overcome dormancy prior to sowing seeds in flats in the spring; seeds that germinate in cold stratification are less likely to survive this process, as the radicle can easily be damaged when sowing. While the removal of some individuals that germinated during cold stratification may limit the full range genotypes and, therefore, genetic diversity observed in the long subpopulation, this experimental group should still sample a greater portion of the theoretical distribution of genotypes in the population compared to the short subpopulation and retain greater genetic diversity.
Following two weeks of warm stratification, more seeds had germinated in the long stratification treatment than the short for V. pedatifida (ca. 30% germination in short vs. 75% in the long) and V. sagittata (ca. 65-80% germination in the short vs. 80-90% in the long). Germination was similar between the short and long stratification treatments in V. lanceolata (85+%). The randomly selected germinated seeds were planted in individual plugs with peat media and transferred to a greenhouse with mist (day temperature of 21 °C and night temperature of 18 °C, with supplemental lighting from 6 AM to 10 PM to provide a long day photoperiod and mist running 3 seconds every 20 minutes from 4 AM to 11 PM). After about one week at these conditions, the seedlings were transplanted to larger pots and transferred to a second greenhouse without mist (day temperature of 19 °C and night temperature of 17 °C with supplemental lighting 6 AM to 10 PM). Here, the plants were watered every one to two days, and grown until seed production. From the twenty plants grown for each subpopulation, we randomly selected 13-14 individuals for the genetic analysis. The collection and experimental treatment of plants in this study followed relevant institutional, national and international guidelines and legislation.
Genomic sequencing, data processing, and statistical analyses
We genotyped individuals using double-digestion restriction site-associated sequencing and the STACKS v 2.2 pipeline. First, we extracted genomic DNA from fresh leaf tissue using a DNeasy Plant Mini Kit (Qiagen, Venlo, Netherlands). Next, we used a modified double-digestion restriction site-associated sequencing, or ddRAD-seq, protocol to sample the genome of each species. We used EcoRI and MspI restriction enzymes (New England Biolabs, Ipswich, MA, USA) to digest genomic DNA, after which we ligated adapters and used AMPureXP magnetic beads (Beckman Coulter, Indianapolis, IN, USA) to individually size-select fragments between 500-900 bp. We then amplified and cleaned up libraries and finally sequenced libraries using pair-end 150-bp sequencing on an Illumina NovaSeq 6000. We used STACKS v 2.2 to call single nucleotide polymorphisms (SNPs) de novo for each species. We used a subset of individuals to optimize the STACKS parameters -m, -M, and -n. For different combinations of parameters, we compared patterns of genetic distance in Multidimensional Scaling Plots (MDS), the number of recovered variant sites, and measures of genetic diversity – optimal parameters are those that recover many SNPs, retain high levels of genetic diversity, and do not drastically change genetic relationships between individuals. We selected the default STACKS parameters (-m 3, -M 2, and -n 2) for calling SNPs using the pipeline denovo_map.pl. For each species, all samples were included as being in the same population in the population map, required at least 35% of individuals in the population retain a locus for it to be processed (-r), a minimum minor allele frequency of 0.05 (--min_maf), and one SNP was retained per stack to minimize linkage disequilibrium (--write-random-snp).
We then generated three unique datasets for each species. First, we used VCFtools 0.1.14 to attain all loci that passed quality filtering. To quality filter SNPs, we exclude the following from the analysis: individuals with more than 80% missing data, loci with more than 45% missing data (--max-missing), and loci with an average minimum mean depth of coverage below 10X (--min-meanDP). Overall, SNPs had low to moderate read depth coverage and moderate to high percentage of missing data for individuals and sites for all species. This group of high-quality SNPs comprises the putatively ‘neutral SNPs’ dataset for each species, which we used to verify that individuals from the short and long subpopulations do not comprise distinct genetic clusters for each species. We evaluated population genetic structure using the first two axes of a scaled and centered principal components analysis (PCA) to evaluate genetic structure within each species using the package adegenet() in R v. 4.2.2 (R Core Team 2022). In addition, we used the pairwise.neifst() function in the hierfstat package in R v. 4.2.2 (R Core Team 2022) to measure Nei’s FST between the short and long subpopulations of each species.
We then subset the neutral SNPs datasets to obtain two additional datasets. The second dataset that we generated for each species was the ‘annotated SNPs’ dataset. We generated a FASTA file for the neutral SNPs dataset which we ran through the National Center for Biotechnology Information’s (NCBI) database using the MegaBLAST® search program within the order of Violales (National Center for Biotechnology Information 1988). The annotated SNPs are those that had a match, or hit, with an annotated sequence in the NCBI database and was associated with a specific protein (i.e., hits to uncharacterized loci, chloroplast or mitochondrial loci that were not related to a protein were excluded). If the top hit for a locus was not associated with a protein, the second best hit that was associated with a protein was selected. The third dataset for each species was the ‘shared, annotated SNPs’ dataset. As we assumed physiological responses associated with short dormancy would be shared across taxa, we compared the NCBI database matches associated with proteins between species. For these shared annotated SNPs, we investigated the gene ontology for those proteins and determined the broad functional category of each protein based on the molecular function and biological processes listed in the PANTHER Classifications and UniProt databases as well as the literature. The three datasets used in these analyses are subsets of one another meaning that each dataset represents a unique group of SNPs.
For each species we evaluated differences in genetic diversity between the short and long subpopulations using both the neutral SNPs datasets and the annotated SNPs datasets. Here, we are specifically interested in variation and diversity of alleles, rather than how alleles are arranged into genotypes. As such, for each locus, we measured gene diversity, or HS, which considers the frequency of alleles as well as their evenness in the population. We used the basic.stats() function in the hierfstat package in R v. 4.2.2 (R Core Team 2022) to measure gene diversity for the short and long subpopulations of each species. To evaluate how the selection of short versus long moist cold stratification influences gene diversity, we tested for differences in HS using a paired Wilcoxon rank sum test with Bonferroni correction using the function wilcox_test() in the package rstatix v.0.7.2. For all comparisons the percent change in HS was then calculated in loci in the long moist cold stratification subpopulation to the short subpopulation (e.g. [HSlong/ HSshort] / HSlong). Loci whose percent change was above 2 standard deviations above the mean percent change were then identified, suggesting that the change is greater than would be expected by chance and is therefore putatively a significant increase in percent change. If a locus was fixed for homozygosity (i.e., HS=0) in the short subpopulation but non-zero in the long subpopulation, then the percent change was considered as a significant increase from long to short. If a locus was not present in at least two subpopulations, it was excluded from all analyses.
In addition, we measured rarified allelic richness (AR), or the rarefied count of alleles, for loci in each subpopulation using the allelic.richness() function in the hierfstat package. For each species we rarefied AR according to the subpopulation with the smallest sample size. We tested for differences in AR using a paired Wilcoxon rank sum test with Bonferroni correction using the function wilcox_test() in the package rstatix v.0.7.2. If a locus was not present in two or more subpopulations, it was excluded from all analyses.