Skip to main content
Dryad

Data and Code for: Reproductive strategies and their consequences for divergence, gene flow, and genetic diversity in three taxa of Clarkia

Cite this dataset

Diaz-Martin, Zoe et al. (2023). Data and Code for: Reproductive strategies and their consequences for divergence, gene flow, and genetic diversity in three taxa of Clarkia [Dataset]. Dryad. https://doi.org/10.5061/dryad.sxksn038b

Abstract

Differences in reproductive strategies can have important implications for macro- and micro-evolutionary processes. We used a comparative approach through a population genetics lens to evaluate how three distinct reproductive strategies shape patterns of divergence among as well as gene flow and genetic diversity within three closely related taxa in the genus Clarkia. One taxon is a predominantly autonomous self-fertilizer and the other two taxa are predominantly outcrossing but vary in the primary pollinator they attract. In genotyping populations using genotyping-by-sequencing and comparing loci shared across taxa, our results suggest that differences in reproductive strategies in part promote evolutionary divergence among these closely related taxa. Contrary to expectations, we found that the selfing taxon had the highest levels of heterozygosity but a low rate of polymorphism. The high levels of fixed heterozygosity for a subset of loci suggests this pattern is driven by the presence of structural rearrangements in chromosomes common in other Clarkia taxa. In evaluating patterns within taxa, we found a complex interplay between reproductive strategy and geographic distribution. Differences in the mobility of primary pollinators did not translate to a difference in rates of genetic diversity and gene flow within taxa – a pattern likely due to one taxon having a patchier distribution and a less temporally and spatially reliable pollinator. Taken together, this work advances our understanding of the factors that shape gene flow and the distribution of genetic diversity within and among closely related taxa.

README

This README file was generated on 2023-09-06 by Zoe Diaz-Martin.

GENERAL INFORMATION

  1. Title of Dataset: Reproductive strategies and their consequences for divergence, gene flow, and genetic diversity in three taxa of Clarkia
  2. Author Information A. Corresponding Author Contact Information Name: Zoe Diaz-Martin
    Institution: Spelman College Address: Atlanta, GA USA Email: zoediazmartin@spelman.edu

SHARING/ACCESS INFORMATION

  1. Licenses/restrictions placed on the data: CC0 1.0 Universal (CC0 1.0) Public Domain
  2. Links to publications that cite or use the data:

Diaz-Martin, Zoe et al. Reproductive strategies and their consequences for divergence, gene flow, and genetic diversity in three taxa of Clarkia. Heredity.

  1. Links to other publicly accessible locations of the data: None
  2. Links/relationships to ancillary data sets: None
  3. Was data derived from another source? No A. If yes, list source(s): NA
  4. Recommended citation for this dataset:

Diaz-Martin, Zoe et al. (Forthcoming 2023). Data and Code for: Reproductive strategies and their consequences for divergence, gene flow, and genetic diversity in three taxa of Clarkia [Dataset]. Dryad. https://doi.org/10.5061/dryad.sxksn038b

DATA & FILE OVERVIEW
File List:

  • A) CB.Seperate.vcf
  • B) CCA.Seperate.vcf
  • C) CCC.Seperate.vcf
  • D) Combined.All.Taxa.vcf
  • E) CombinedDataset.gen
  • F) CombinedGenhet.txt
  • G) Combined.K3.txt
  • H) CombinedPHt.csv
  • I) Combined.SNPList.txt
  • J) CB_genepop.gen
  • K) CB_K4.txt
  • L) CCA_genepop.gen
  • M) CCA_K3.txt
  • N) CCC_genepop.gen
  • O) CCC_K2.txt
  • P) Neb_All.csv
  • Q) Neb_NoCCA.csv
  • R) PHt.F.Seperate.csv <br> Relationship between files, if important: The .vcf files were used to generate the data in subsequent files. For example, the file 'Combined.All.Taxa.vcf' is the cleaned, processed combined taxa genetic data that was used to create all other files with the prefix "Combined". <br> Additional related data collected that was not included in the current data package: We have not included the raw, unprocessed data files <br> Are there multiple versions of the dataset? No A. If yes, name of file(s) that was updated: NA i. Why was the file updated? NA ii. When was the file updated? NA

#########################################################################

DATA-SPECIFIC INFORMATION FOR: CB.Seperate.vcf

This file contains the quality filtered variant calls for only Clarkia breweri indiviudals that were included in analysis.

#########################################################################

DATA-SPECIFIC INFORMATION FOR: CCA.Seperate.vcf

This file contains the quality filtered variant calls for only Clarkia concinna subspecies automixa indiviudals that were included in analysis.

#########################################################################

DATA-SPECIFIC INFORMATION FOR: CCC.Seperate.vcf

This file contains the quality filtered variant calls for only Clarkia concinna subspecies concinna indiviudals that were included in analysis.

#########################################################################

DATA-SPECIFIC INFORMATION FOR: Combined.All.Taxa.vcf

This file contains the quality filtered variant calls for only Clarkia breweri, Clarkia concinna subspecies automixa, and Clarkia concinna subspecies concinna indiviudals that were included in analysis.
To be included loci must have been present in two of three taxa.

#########################################################################

DATA-SPECIFIC INFORMATION FOR: CombinedDataset.gen

This is a genepop formatted file with genotypes for all indiviudals of Clarkia breweri, Clarkia concinna subspecies automixa, and Clarkia concinna subspecies concinna indiviudals that were included in analysis.
To be included loci must have been present in two of three taxa.

#########################################################################

DATA-SPECIFIC INFORMATION FOR: CombinedGenhet.txt

This is a GenHet formatted file with genotypes for all indiviudals of Clarkia breweri, Clarkia concinna subspecies automixa, and Clarkia concinna subspecies concinna indiviudals that were included in analysis.
To be included loci must have been present in two of three taxa.
This file is used by the GenHet package in R to produce indiviudal-based measures of the proportion of heterozygous loci (PHt).

#########################################################################

DATA-SPECIFIC INFORMATION FOR: Combined.K3.txt

This is file contains the output from the COANCESTRY .Q file for K = 3 for Clarkia breweri, Clarkia concinna subspecies automixa, and Clarkia concinna subspecies concinna indiviudals that were included in analysis.
Each row is an indiviudal (individual ID can be found in the .fam file). Each column is a distinct genetic cluster and the value is the percent ancestry of each individual to each genetic cluster.
This file is used to visualize population genetic structure similar to that of STRUCTURE outputs.

#########################################################################

DATA-SPECIFIC INFORMATION FOR: CombinedPHt.csv

  1. Number of variables: 4
  2. Number of cases/rows: 165
  3. Variable List:
    • Sample: the unique sample identifier
    • Taxa: the taxon of each sample, (CB = Clarkia breweri, CCA = Clarkia concinna subsp. automixa, CCC = Clarkia concinna subsp. concinna)
    • Site: the site from which the indiviudal was collected
    • PHt: the proportion of heterozygous loci generated from the GenHet package in R.

#########################################################################

DATA-SPECIFIC INFORMATION FOR: Combined.SNPList.txt

This is a single column list of all the loci shared between Clarkia breweri, Clarkia concinna subspecies automixa, and Clarkia concinna subspecies concinna indiviudals that were included in analysis.
This file is used by the GenHet package in R to produce indiviudal-based measures of the proportion of heterozygous loci (PHt).

#########################################################################

DATA-SPECIFIC INFORMATION FOR: CB_genepop.gen

This is a genepop formatted file with quality filtered genotypes for all indiviudals of Clarkia breweri.

#########################################################################

DATA-SPECIFIC INFORMATION FOR: CB_K4.txt

This is file contains the output from the COANCESTRY .Q file for K = 4 for Clarkia breweri indiviudals that were included in analysis.
Each row is an indiviudal (individual ID can be found in the .fam file). Each column is a distinct genetic cluster and the value is the percent ancestry of each individual to each genetic cluster.
This file is used to visualize population genetic structure similar to that of STRUCTURE outputs.

#########################################################################

DATA-SPECIFIC INFORMATION FOR: CCA_genepop.gen

This is a genepop formatted file with quality filtered genotypes for all indiviudals of Clarkia concinna subsp. automixa.

#########################################################################

DATA-SPECIFIC INFORMATION FOR: CCA_K3.txt

This is file contains the output from the COANCESTRY .Q file for K = 3 for Clarkia concinna subsp. automixa indiviudals that were included in analysis.
Each row is an indiviudal (individual ID can be found in the .fam file). Each column is a distinct genetic cluster and the value is the percent ancestry of each individual to each genetic cluster.
This file is used to visualize population genetic structure similar to that of STRUCTURE outputs.

#########################################################################

DATA-SPECIFIC INFORMATION FOR: CCC_genepop.gen

This is a genepop formatted file with quality filtered genotypes for all indiviudals of Clarkia concinna subsp. concinna.

#########################################################################

DATA-SPECIFIC INFORMATION FOR: CCC_K2.txt

This is file contains the output from the COANCESTRY .Q file for K = 2 for Clarkia concinna subsp. concinna indiviudals that were included in analysis.
Each row is an indiviudal (individual ID can be found in the .fam file). Each column is a distinct genetic cluster and the value is the percent ancestry of each individual to each genetic cluster.
This file is used to visualize population genetic structure similar to that of STRUCTURE outputs.

#########################################################################

DATA-SPECIFIC INFORMATION FOR: Neb_All.csv

  1. Number of variables: 3
  2. Number of cases/rows: 12
  3. Variable List:
    • Taxa: the taxon of each sample, (CB = Clarkia breweri, CCA = Clarkia concinna subsp. automixa, CCC = Clarkia concinna subsp. concinna)
    • Site: the site from which the indiviudal was collected
    • Neb: the number of effective breeders estimated by the program NeEstimator.

#########################################################################

DATA-SPECIFIC INFORMATION FOR: Neb_NoCCA.csv

  1. Number of variables: 3
  2. Number of cases/rows: 10
  3. Variable List:
    • Taxa: the taxon of each sample, (CB = Clarkia breweri, CCC = Clarkia concinna subsp. concinna)
    • Site: the site from which the indiviudal was collected
    • Neb: the number of effective breeders estimated by the program NeEstimator.

#########################################################################

DATA-SPECIFIC INFORMATION FOR: PHT.F.Seperate

  1. Number of variables: 5
  2. Number of cases/rows: 165
  3. Variable List:
    • Sample: the unique sample identifier
    • Site: the site from which the indiviudal was collected
    • Taxa: the taxon of each sample, (CB = Clarkia breweri, CCA = Clarkia concinna subsp. automixa, CCC = Clarkia concinna subsp. concinna)
    • PHt: the proportion of heterozygous loci generated from the GenHet package in R
    • F: the inbreeding coefficient generated from the program plink

CODE / SOFTWARE

CombinedCode.R : This is a script for the program R. This script inputs generates the analyses used for testing differences among taxa.
This code will also generate figures in the manuscript.

WithinCode.R : This is a script for the program R. This script inputs generates the analyses used for testing differences within taxa.
This code will also generate figures in the manuscript.

USAGE NOTES

The files containing genetic data in .vcf format can be viewed using the Integrative Genomics Viewer
(IGV; https://software.broadinstitute.org/software/igv/).

GENEPOP formatted files (.gen) can be viewed in R using the package adegenet(). To do so, open R or R Studio and run the following code:
1. setwd("C:/path/to/your/wd") #set your working directory to where your .gen file is located
2. install.packages('adegenet') #install the package
3. library(adegenet) #call the package you just installed
4. data <- read.genepop('CombinedDataset.gen'\, ncode=2) #create a vector that reads your file and stores it as a genind object that can be manipulated
5. summary(data) # generate a summary of your data

You can also save .gen files as .txt files to be analyzed in the GENEPOP website (https://genepop.curtin.edu.au/), which also provides instructions on software download.

Methods

DNA extraction and sequencing

We extracted genomic DNA following a modified cetyltrimethylammonium (CTAB) developed by Doyle and Doyle (1987). We used single nucleotide polymorphisms (SNPs) generated from genotyping-by-sequencing libraries which were prepared following Elshire et al. (2011) and using the restriction enzyme ApeKI to fragment the genome. To avoid any batch effect, half the individuals from each population were split between the two genomic libraries of 96. Each polymerase chain reaction (PCR) was carried out independently for all samples, and each library was then quantified using High sensitivity QubitTM (dsDNA HS Assay Kit, Thermo Fisher Scientific) and then pooled in the final step before sequencing to assure an equivalent amount of each sample was present in the final genomic library. Sequencing was performed using Illumina HiSeq, 150bp Paired-End reads at the Center for Genetic Medicine at Northwestern Medicine.

Calling single nucleotide polymorphisms (SNPs)

We used STACKS v 2.2 (Catchen et al. 2011; 2013) to call single nucleotide polymorphisms (SNPs) to generate four distinct datasets. We generated a combined set to compare measures of genetic diversity and divergence among taxa, and one dataset per taxon for comparisons between populations within taxa. To evaluate divergence among C. concinna subsp. automixa, C. concinna subsp. concinna, and C. breweri, we called SNPs that were shared among at least two taxa (i.e., the combined dataset). Because the combined dataset resulted in many loci being monomorphic within one taxon but polymorphic in the others, it was necessary to call SNPs for each taxon separately to assess genetic diversity, inbreeding, and population structure within taxa. For the combined and separate datasets, the parameters -m, -M, -n, -max-locus-stacks, and -bound-high were optimized using four samples from each population run across lanes and changing one parameter at a time. The ‘best’ parameters were those that maximize the number of SNPs while minimizing genetic distance between samples from the same populations as generated in a metric multi-dimensional scaling (MDS) plot using PLINK 2 (Purcell et al. 2007; Mastretta-Yanes et al. 2015) – the parameters used to call SNPs varied for each dataset (Appendix A, Figures S2 – 5, Table S1a – b).

For the combined dataset, we built a catalog using all samples and labeled them by taxonomic assignment (C. concinna subsp. automixa, C. concinna subsp. concinna, and C. breweri) for the population map. We ran the ‘populations’ command in STACKS and only called loci that were present in at least two of the three taxa (-p 2), in at least 50% of individuals in a taxon (r -0.5), and with a minor allele frequency greater than 0.05 (-maf 0.05), and one SNP per sequence. For the three datasets where SNPs were called separately for C. concinna subsp. automixa, C. concinna subsp. concinna, and C. breweri, we built catalogs with five samples from each population and included samples that had high numbers of reads and were collected across the population and sequenced on different plates. For the ‘populations’ command, we specified that loci needed to be in at least 80% of individuals (-r 0.80), the minor allele frequency needed to be greater than 0.05 (as suggested by Paris et al. 2017), and one SNP per sequence was allowed. All datasets were then quality filtered for read depth, missing data, and Hardy-Weinberg Equilibrium (Appendix A). In total 16 individuals failed to pass quality filtering leaving a total sample size of 166 individuals, with 52 individuals of C. concinna subsp. concinna, 29 individuals of C. concinna subsp. automixa and 84 individuals of C. breweri.

Statistical analyses

Among taxa – Genetic divergence and diversity 

We used the combined dataset to determine the amount of divergence among taxa with distinct mating systems. We used the program ADMIXTURE 1.3.0 (Alexander, Novembre, and Lange 2009) to evaluate population genetic structure among taxa by considering genetic clusters, or K, from 1-10 and employing a cross-validation procedure. We considered the most appropriate number of K to be the one with the lowest cross-validation score or the K at the ‘knee’ of the cross-validation plot. We then used ADMIXTURE to calculate pairwise FST between the genetic clusters. In addition, we evaluated the divergence among groups by using the first two axes of a scaled and centered principal components analysis (PCA) generated with the program adegenet() (Jombart 2008). All analyses were conducted in R v. 4.0.2 (R Core Team 2020), unless noted otherwise.

We also used the combined dataset to evaluate patterns of genetic diversity of the loci shared between taxa. Using the STACKS populations output, we calculated the number and percent of polymorphic loci. The low population sample size precluded the use and testing of population-based measures of genetic diversity and inbreeding. However, robust sampling at the individual level enabled the use of the genhet() function in R to measure the individual level proportion of heterozygous loci (PHt), or the number of heterozygous loci over number of genotyped loci (Coulon 2010). We then used the stats() package (R Core Team 2020) to test for taxon-based differences in PHt using a pairwise Wilcoxon rank sum test with a Bonferroni correction.

Within taxa – Genetic diversity, inbreeding, effective population size, and gene flow

Using the three datasets called for each taxon separately, we investigated inbreeding, genetic diversity, gene flow, and effective population size (NE). We again measured individual level PHt and as well as the inbreeding coefficient (F) using PLINK 2 and the –het command (Purcell et al. 2007). We tested for taxon-based differences in PHt and F using a pairwise Wilcoxon rank sum test with a Bonferroni correction. We then estimated NE of each population using the program NeEstimator v. 2.1. (Do et al. 2014). The linkage disequilibrium method, which calculates NE based on the amount of linkage disequilibrium within a population while correcting for sample size, was unable to estimate measures of NE and 95% confidence intervals (Waples and Do 2010). However, the heterozygote excess method estimated NEB, or the effective number of breeders, which gives reliable insight into NE when the effective population size is small (Zhdanova and Pudovkin 2008; Waples and Do 2010; Gilbert and Whitlock 2015). This method takes advantage of random differences in allele frequencies between parents in a small population, which results in an excess of heterozygote genotypes compared to expectations under Hardy-Weinberg Equilibrium (Pudovkin, Zaykin, and Hedgecock 1996). We tested for differences in NEB between C. concinna subsp. concinna and C. breweri using a Kruskal-Wallis rank sum test with the stats() package but were unable to include C. concinna subsp. automixa in this assessment due to low sample size. 

We again used ADMIXTURE and PCA plots as described above to compare population genetic structure and gene flow between populations within taxa. In addition, we used the program GENEPOP to calculate pairwise FST (Weir & Cockerham 1984) between populations for each taxon. We then used an independent two group Mann-Whitney U Test to test for differences in pairwise FST between C. concinna subsp. concinna and C. breweri. We were unable to include C. concinna subsp. automixa because only two populations were sampled. We also evaluated patterns of isolation by distance for C. concinna subsp. concinna and C. breweri, the two taxa with sufficient sampling. We calculated a pairwise matrix with FST / (1 – FST) (Rousset 1997) between populations as well as a pairwise matrix with the log of geographic distance between populations. We then used both in a Mantel test in the R package ade4 with 9999 replicates (Dray and Dufour 2007).

Usage notes

See README file. 

Funding

National Science Foundation, Award: DBI 1461007