Population genetics of Apostichopus californicus along the Northeastern Pacific Coast
Data files
Jan 09, 2023 version files 9.96 MB
-
ApoCal_filtered_SNPs_adap_211.gen
-
ApoCal_filtered_SNPs_all_2075.gen
-
ApoCal_filtered_SNPs_neu_1864.gen
-
collection_site_coords.csv
-
environ_data.csv
-
environ_var_info.csv
-
putadapt_SNPs_211.fasta
-
README.txt
Abstract
A growing body of evidence suggests that spatial population structure can develop in marine species despite large population sizes and high gene flow. Characterizing population structure is important for the effective management of exploited species, as it can be used to identify appropriate scales of management in fishery and aquaculture contexts. The California sea cucumber, Apostichopus californicus, is one such exploited species whose management could benefit from further characterization of population structure. Using restriction site-associated DNA (RAD) sequencing, we developed 2,075 single nucleotide polymorphisms (SNPs) to quantify genetic structure over a broad section of the species’ range along the North American west coast and within the Salish Sea, a region supporting the Washington State A. californicus fishery and developing aquaculture production of the species. We found evidence for population structure (global fixation index (FST) = 0.0068) with limited dispersal driving two patterns of differentiation: isolation-by-distance and a latitudinal gradient of differentiation. Notably, we found detectable population differences among collection sites within the Salish Sea (pairwise FST = 0.001–0.006). Using FST outlier detection and gene-environment association, we identified 10.2% of total SNPs as putatively adaptive. Environmental variables (e.g., temperature, salinity) from the sea surface were more correlated with genetic variation than those same variables measured near the benthos, suggesting that selection on pelagic larvae may drive adaptive differentiation to a greater degree than selection on adults. Our results were consistent with previous estimates of, and patterns in, population structure for this species in other extents of the range. Additionally, we found that patterns of neutral and adaptive differentiation co-varied, suggesting that adaptive barriers may limit dispersal. Our study provides guidance to decision-makers regarding the designation of management units for A. californicus and adds to the growing body of literature identifying genetic population differentiation in marine species despite large, nominally connected populations.
Methods
Approximately 50 adult A. californicus were collected by scuba divers from nine collection sites along the Pacific Coast of North America, ranging from Alaska to Oregon, including four collection sites within the southern Salish Sea. For each animal, a tissue sample was excised from a radial muscle band and stored in 100% ethanol.
DNA was extracted from tissue samples using the EZNA Mollusc DNA Kit (OMEGA Bio-tek, Norcross, GA, USA) and the Qiagen DNeasy Kit (Qiagen, Germantown, MD, USA). DNA was quantified using the Quant-iT PicoGreen dsDNA Assay Kit (Thermo Fisher Scientific, Waltham, MA, USA) and DNA quality was checked by gel electrophoresis. DNA concentration was normalized to 500 ng in 20 μL of PCR-grade water. We selected samples with high DNA quality for restriction site associated DNA (RAD) sequencing and RAD libraries were prepared following standard protocols1. Briefly, DNA samples were barcoded with an individual six-base identifier sequence attached to an Illumina P1 adapter. Samples were then pooled into sub-libraries, containing approximately 12 individuals. Sub-libraries were sheared using a Bioruptor sonicator and size selected to 200-400 bp using a MinElute Gel Extraction Kit (Qiagen, Germantown, MD, USA). P2 adapters were ligated to DNA in sub-libraries and amplified with PCR using 12–18 cycles as in Etter et al. (2011). Finally, amplified sub-libraries were combined into pools of approximately 72 individuals. Paired-end 2 x 150-base pair sequencing was performed on an Illumina HiSeq4000 (San Diego, California, USA) at the Beijing Genomics Institute and the University of Oregon Genomics and Cell Characterization Core Facility. Only forward reads were used for analysis. To estimate genotyping error, 14 individuals were sequenced twice.
Raw RAD sequencing data were demultiplexed using the process_radtags module in the pipeline STACKS v.1.442. A threshold of 800,000 reads was used to exclude poorly sequenced individuals. Because a genome was not available for A. californicus, we aligned individual sequences to the genome of a closely related species, A. parvimensis (GenBank accession number = GCA_000934455.1). The A. parvimensis genome was 760,654,621 bp, with 21,559 scaffolds and an N50 size of 9,587. We retained reads with a minimum mapping quality score of 20. Then, we used dDocent v.2.7.8 to perform a reference-guided locus assembly using the filtered reads and default parameters3. Additionally, a parallel de novo assembly was performed, which produced nearly identical results for population structure and 1.8–2.8% lower mean expected heterozygosity, 0.9–1.8% higher mean observed heterozygosity, and 1.2–3.3% higher proportions of polymorphic SNPs than in the with-reference assembly, although with similar patterns across collection sites. The reference-guided assembly was retained for further analyses due to decreased confidence in identifying genotyping errors in the de novo assembly4.
We used vcftools v.0.1.165 to remove indels and to retain only single nucleotide polymorphisms (SNPs) with a minimum quality score of 20, minimum minor allele frequency of 0.05 and maximum missing data per locus of 30% across collection sites. Individuals with more than 30% missing data across SNPs were removed. In cases of multiple SNPs per RAD tag, we retained the SNP with the highest minor allele frequency6. SNPs that were not in Hardy Weinberg Equilibrium (HWE) were considered sequencing errors or poorly assembled loci and were removed from our data set, as selection and inbreeding are unlikely to cause significant deviations from HWE equilibrium at biallelic loci7. We tested SNPs for deviations from HWE using the R package genepop v.1.1.4 (Rousset, 2008). SNPs were identified as being out of HWE if they had a q-value below 0.05 in at least 2 of the collection sites after correcting for false discovery rate, following Waples (2015).
We used a suite of R packages, stand-alone software, and custom scripts in the programming language R v.3.5.0 9 to quantify genetic diversity and population structure. Mean expected heterozygosity, observed heterozygosity, and the inbreeding coefficient (FIS) per SNP were calculated using the R package genepop. The proportion of polymorphic SNPs per collection site was calculated using a custom R script.
To investigate population structure, we first calculated Weir-Cockerham fixation index (FST)10 to quantify population differentiation using the R packages genepop and hierfstat v.0.5.7. Exact G-tests11 were used to test for significant genic differentiation using the R package genepop. To investigate patterns of spatial differentiation among collection sites, the R package adegenet v.2.1.112 was used to conduct discriminant analyses of principal components (DAPC), a multivariate method that summarizes the between-group variation (i.e., population structure), while minimizing within-group variation13. The built-in optimization algorithm was used to retain the number of principal components that minimized over-fitting and under-fitting of the model. To determine the potential number of underlying populations, the program ADMIXTURE v.1.3.0 was used to conduct a clustering analysis14. Specifically, ADMIXTURE uses a maximum likelihood-based approach to estimate individual ancestries across different assumed numbers of populations, with the best fit selected using cross-validation. To examine the presence of hierarchical population structure, we conducted analyses of molecular variance (AMOVA) using the ade4 method of the R package poppr v.2.8.115. Significance of AMOVAs was determined using permutation tests with 1,000 iterations. Using AMOVA, we investigated whether the following oceanographic barriers limit dispersal: 1) the Victoria Sill (Victoria Sill grouping), 2) Admiralty Inlet (Admiralty Inlet grouping), and 3) the North Pacific Current (NPC grouping). Because AMOVAs for each oceanographic barrier include sites in an area with other potential oceanographic barriers, we added a fourth grouping of all three oceanographic barriers (All Barriers grouping) to investigate the relative role of oceanographic barriers compared to other factors. Additionally, we conducted an AMOVA by state or province (State grouping). Although not biologically meaningful, we included the State grouping to determine how much genetic variation is captured by regional management boundaries. Isolation-by-distance (IBD) was tested with Mantel tests16 in R as linear correlation between linearized FST17 using all SNPs and shortest Euclidean distance through water (in-water distance hereafter) approximated in Google Maps18 Following Xuereb et al (2018), we also tested IBD in the northern and southern population section separately. Following Buonaccorsi et al. (2005), we estimated mean dispersal distance from the slope of the regression of linearized FST and in-water distance. We used this 1-dimensional model because it is an appropriate approximation for coastal species with dispersal dimensions greater than one dimension of the habitat (e.g., dispersal distance likely greater than water depth for A. californicus). We estimated mean dispersal distance from a set of potential population density estimates as population density estimates were unavailable.
We used two approaches to investigate putatively adaptive SNPs: FST outlier detection and gene-environment association. FST outlier detection is used to identify loci potentially under spatial selection20,21, although this method does not identify the potential cause of selection. Although gene-environment association does not explicitly test whether such associations are adaptive, this method is used to identify locus-environment associations as evidence for potential local adaptation22. SNPs were classified as putatively adaptive if they were detected as FST outliers or if they were significantly correlated to environmental predictors using gene-environment association.
For FST outlier detection, two methods were used: the program Bayescan v.2.1 20 and the R package OutFLANK v.0.2 21. Bayescan first applies linear regression to decompose FST into a population- and a locus-specific component. Using these components as Bayesian priors, the program estimates the posterior probability that a locus is under selection20. OutFLANK detects FST outliers using a maximum likelihood approach. The program first infers a distribution of neutral FST from a trimmed distribution of empirically collected FST values and uses this neutral distribution to identify outliers. OutFLANK advances earlier FST outlier methods 23 by accounting for sampling error and non-independent sampling of populations, and has lower false positive rates compared to other FST outlier methods21. We used default parameters in both programs, including a false discovery rate of 0.05. SNPs were classified as FST outliers if they were detected with either program, to include SNPs under weak selection, which are likely the majority of SNPs under selection24.
Prior to investigating gene-environment association, we gathered estimates for oceanographic variables at each collection site using the Bio-Oracle and Bio-Oracle 2 databases25, which contain geophysical, biotic, and environmental data layers for marine realms, through the R package sdmpredictors v.0.2.826. We selected a broad suite of 29 oceanographic variables (including temperature, current velocity, salinity, and pH), including those used by Xuereb, Kimber, et al. (2018) for comparison. Where possible, oceanographic variables for both sea surface and mean bottom depth (near-bottom) were used to account for the conditions experienced by pelagic larvae and benthic adults respectively. The Eld Inlet, Washington collection site was removed from these analyses because environmental predictor data were not available. We also calculated correlation among predictor variables and interpreted results in light of these correlations.
To investigate genomic evidence for local adaptation, gene-environment associations were explored using a univariate association method, Bayenv222 and a multivariate method, redundancy analysis (RDA). We used both methods as each has an advantage: the interpretation of results for univariate methods like Bayenv2 can be clearer than multivariate methods for gene-environment association, and multivariate methods such as RDA have lower false positive rates and greater sensitivity for detecting weak and multi-locus selection27.
The program Bayenv2 (Günther & Coop, 2013) uses a univariate Bayesian framework to test for significant correlation between allele frequencies and environmental predictor variables, accounting for population structure by first estimating covariance among loci. Correlations with a minimum Bayes Factor of 10, or minimum “strong” support28, were retained in the analysis.
Redundancy analyses were performed in the R package vegan v.2.5-629. Redundancy analysis summarizes the variation in a set of response variables (here, the allele frequencies) due to a set of explanatory variables (here, the oceanographic variables), using an extension of multiple linear regression that allows regression of multiple response variables on multiple explanatory variables. Here, only biallelic SNPs were retained, and allele frequencies were Hellinger-transformed prior to RDA30. To avoid overdetermination of RDA models with many environmental predictors, we conducted multiple RDAs on subsets of environmental predictors and reduced the dimensionality of our environmental predictors within sets by first combining them into orthogonal principal components. Specifically, environmental predictors were grouped into four a priori sets: (1) all, (2) sea surface, (3) near-bottom and (4) current velocity and temperature predictors (measured at either sea surface or near-bottom). Sets 2 and 3 were chosen to investigate differences in putative adaptation at pelagic and benthic life history stages. Environmental predictor set 4 was chosen for comparison with the existing study on A. californicus in a different part of the species range, as these variables were strongly correlated with genetic population structure31. Predictors were standardized to a mean of 0 and standard deviation of 1 prior to PCA. PCA was performed on each set of predictors, and principal components were retained if their corresponding eigenvalue was above the mean eigenvalue across principal components.
To account for the potential confounding factor of neutral population structure in our RDA, we conducted a complementary partial RDA for each set of predictor variables, in which the effects of spatial variation were partialled out. We used Euclidean distances among collection sites to compute distance-based Morgan’s eigenvector maps (MEMs) using the R package codep v.0.9-132, to be used as conditioning variables in partial RDAs33. We report the variance inflation factors (VIF) to identify collinearity among spatial variables and environmental predictors, although we note that the effects of such collinearity are addressed in the removal of genetic variation explained by spatial variables in partial RDAs. ANOVA permutation tests for full models and per axis were used to assess the significance of RDA results. For significant models, we identified SNPs putatively involved in local adaption based on the loadings of SNPs in ordination space for significant axes. Specifically, SNPs were classified as putatively adaptive if their loading score was outside of 3 standard deviations of the mean27.
SNPs were classified as putatively adaptive if they were detected as FST outliers using either BayeScan or OutFLANK, or if they were significantly correlated to environmental predictors using Bayenv2 or RDA. Once putatively adaptive SNPs were identified, they were used to distinguish a putatively neutral SNP set and a putatively adaptive SNP set.
To address questions related to demographic and selective processes, we used the same methods on putatively neutral and putatively adaptive data sets separately in testing for IBD, estimating dispersal distance, and estimating effective population size per collection site.
To build hypotheses for the mechanisms underlying adaptive differentiation, potential biological processes associated with putatively adaptive SNPs were identified using blastx v.2.5.034 and the UniProt Knowledge Base (Swiss-Prot, manually annotated)35. We queried the 2000 bp region flanking the SNP, following alignment against the reference genome of P. parvimensis. Matches with a maximum e-value score of 10-10 were retained. Gene ontology slim terms for biological processes were retrieved using an adaptation of the Mouse Genome Informatics database, as developed in Gavery & Roberts (2012). Gene ontology slim terms “other biological processes” and “other metabolic processes” were excluded. A particular locus could be associated with multiple gene ontologies, and a particular gene ontology could be associated with multiple gene ontology slim terms for biological processes. All matches were retained.
To contextualize genetic connectivity in terms of migration rate for management, we developed a simulation model using simuPOP v.1.1.10.937 in Python v.3.7.3 to determine which population sizes, migration rates, and number of generations of drift reproduced our empirically derived pairwise FST results. From an infinitely large ancestral population with global allele frequencies based on our empirical data, we sampled two populations of a specified size and carried those through forward-time simulations with discrete generations, random mating, and no selection. Within each population, two parents were selected at random with replacement to produce one offspring, leading to a random distribution of reproductive success, and allowing for census size to approximate effective population size. The model was parameterized using empirical global allele frequencies for putatively neutral SNPs and simulations were run for each combination of effective population size (500, 2500, 10,000) and migration rate (0.01%, 0.03%, 0.1%, 0.3%, 1%, 3%, 10%, 30%). Additionally, simulations were run for 10 (short-term), 100 (medium-term), and 1,000 (long-term) generations of drift. The long-term option was chosen to approximate equilibrium conditions, although true equilibrium depends on population history and may not be reached in some wild populations within 1,000 generations. Pairwise FST was calculated after the pre-determined number of generations of drift had elapsed and averaged across 5 replicates for each parameter combination.
- Etter, P. D., Preston, J. L., Bassham, S., Cresko, W. A. & Johnson, E. A. Local de novo assembly of RAD paired-end contigs using short sequencing reads. PloS One 6, e18561 (2011).
- Catchen, J., Hohenlohe, P. A., Bassham, S., Amores, A. & Cresko, W. A. Stacks: an analysis tool set for population genomics. Mol. Ecol. 22, 3124–3140 (2013).
- Puritz, J. B., Hollenbeck, C. M. & Gold, J. R. dDocent: a RADseq, variant-calling pipeline designed for population genomics of non-model organisms. PeerJ 2, e431 (2014).
- Shafer, A. B. A. et al. Bioinformatic processing of RAD-seq data dramatically impacts downstream population genetic inference. Methods Ecol. Evol. 8, 907–917 (2017).
- Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).
- Larson, W. A. et al. Genotyping by sequencing resolves shallow population structure to inform conservation of Chinook salmon (Oncorhynchus tshawytscha). Evol. Appl. 7, 355–369 (2014).
- Waples, R. S. Testing for Hardy–Weinberg proportions: have we lost the plot? J. Hered. 106, 1–19 (2015).
- Rousset, F. genepop’007: a complete re-implementation of the genepop software for Windows and Linux. Mol. Ecol. Resour. 8, 103–106 (2008).
- R Core Team. R: A Language and Environment for Statistical Computing. (R Foundation for Statistical Computing, 2020).
- Weir, B. S. & Cockerham, C. C. Estimating F-statistics for the analysis of population structure. evolution 38, 1358–1370 (1984).
- Goudet, J., Raymond, M., de Meeüs, T. & Rousset, F. Testing differentiation in diploid populations. Genetics 144, 1933–1940 (1996).
- Jombart, T. adegenet: a R package for the multivariate analysis of genetic markers. Bioinforma. Oxf. Engl. 24, 1403–1405 (2008).
- Jombart, T., Devillard, S. & Balloux, F. Discriminant analysis of principal components: a new method for the analysis of genetically structured populations. BMC Genet. 11, 94 (2010).
- Alexander, D. H. & Lange, K. Enhancements to the ADMIXTURE algorithm for individual ancestry estimation. BMC Bioinformatics 12, 246 (2011).
- Kamvar, Z. N., Brooks, J. C. & Grünwald, N. J. Novel R tools for analysis of genome-wide population genetic data with emphasis on clonality. Front. Genet. 6, 208 (2015).
- Mantel, N. & Valand, R. S. A technique of nonparametric multivariate analysis. Biometrics 547–558 (1970).
- Rousset, F. Genetic differentiation and estimation of gene flow from F-statistics under isolation by distance. Genetics 145, 1219–1228 (1997).
- Google. 'Measure distance" feature. Google Maps maps.google.com (2019).
- Buonaccorsi, V. P., Kimbrell, C. A., Lynn, E. A. & Vetter, R. D. Limited realized dispersal and introgressive hybridization influence genetic structure and conservation strategies for brown rockfish, Sebastes auriculatus. Conserv. Genet. 6, 697–713 (2005).
- Foll, M. & Gaggiotti, O. E. A genome scan method to identify selected loci appropriate for both dominant and codominant markers: a Bayesian perspective. Genetics (2008).
- Whitlock, M. C. & Lotterhos, K. E. Reliable detection of loci responsible for local adaptation: inference of a null model through trimming the distribution of F ST. Am. Nat. 186, S24–S36 (2015).
- Günther, T. & Coop, G. Robust identification of local adaptation from allele frequencies. Genetics 195, 205–220 (2013).
- Lewontin, R. C. & Krakauer, J. Distribution of gene frequency as a test of the theory of the selective neutrality of polymorphisms. Genetics 74, 175–195 (1973).
- Lotterhos, K. E. & Whitlock, M. C. The relative power of genome scans to detect local adaptation depends on sampling design and statistical method. Mol. Ecol. 24, 1031–1046 (2015).
- Assis, J. et al. Bio-ORACLE v2. 0: Extending marine data layers for bioclimatic modelling. Glob. Ecol. Biogeogr. 27, 277–284 (2018).
- Bosch, S., Tyberghein, L. & De Clerck, O. sdmpredictors: an R package for species distribution modelling predictor datasets. Mar. Species Distrib. Data Predict. Models 49 (2017).
- Forester, B. R., Lasky, J. R., Wagner, H. H. & Urban, D. L. Comparing methods for detecting multilocus adaptation with multivariate genotype–environment associations. Mol. Ecol. 27, 2215–2233 (2018).
- Kass, R. E. & Raftery, A. E. Bayes factors. J. Am. Stat. Assoc. 90, 773–795 (1995).
- Oksanen, J. et al. The vegan package. Community Ecol. Package 10, 719 (2007).
- Legendre, P. & Gallagher, E. D. Ecologically meaningful transformations for ordination of species data. Oecologia 129, 271–280 (2001).
- Xuereb, A., Kimber, C. M., Curtis, J. M., Bernatchez, L. & Fortin, M.-J. Putatively adaptive genetic variation in the giant California sea cucumber (Parastichopus californicus) as revealed by environmental association analysis of restriction-site associated DNA sequencing data. Mol. Ecol. 27, 5035–5048 (2018).
- Guénard, G., Legendre, P., Boisclair, D. & Bilodeau, M. Multiscale codependence analysis: an integrated approach to analyze relationships across scales. Ecology 91, 2952–2964 (2010).
- Dray, S., Legendre, P. & Peres-Neto, P. R. Spatial modelling: a comprehensive framework for principal coordinate analysis of neighbour matrices (PCNM). Ecol. Model. 196, 483–493 (2006).
- Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
- The UniProt Consortium. UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 47, D506–D515 (2019).
- Gavery, M. R. & Roberts, S. B. Characterizing short read sequencing for gene discovery and RNA-Seq analysis in Crassostrea gigas. Comp. Biochem. Physiol. Part D Genomics Proteomics 7, 94–99 (2012).
- Peng, B. & Kimmel, M. simuPOP: a forward-time population genetics simulation environment. Bioinformatics 21, 3686–3687 (2005).
Usage notes
Raw data files (*.fastq) are too large to open, but can be accessed with Unix commands and dDocent, an open source pipeline for genotyping restriction site-associated DNA sequencing data.
R (e.g., packages adegenet and genepop) and text editors (e.g., Text Wrangler) can be used to open the genepop files and fasta files of putatively adaptive loci.
R (e.g., tidyverse), Microsoft Excel, and text editors (e.g., Text Wrangler) can be used to open the environ_data.csv file, containing environmental data from each collection site.