Skip to main content

Data and scripts from: Balanced polymorphism fuels rapid selection in an invasive crab despite high gene flow and low genetic diversity

Cite this dataset

Tepolt, Carolyn (2021). Data and scripts from: Balanced polymorphism fuels rapid selection in an invasive crab despite high gene flow and low genetic diversity [Dataset]. Dryad.


Carcinus maenas is a globally invasive species which spreads and thrives across a range of temperate environments. In the northwestern Pacific, the species has spread across >12 degrees of latitude in 10 years from a single source, following its introduction <35 years ago. Using six locations spanning >1,500 km, we examined genetic structure and selection to temperature using 9,376 Single Nucleotide Polymorphisms (SNPs) derived from cardiac transcriptome sequencing.

Data in this repository includes information on sequenced samples (*.csv, *.txt), a cleaned transcriptome assembly after expression filtering (*.fasta), transcriptome annotation from EnTAP (*.tsv), list of transcripts removed from analysis after mapping (*.txt), high-quality SNPs identified from the transcriptome sequencing with GATK (seven files representing different SNP sets used in the analysis; *.vcf), and four custom scripts used in processing SNP data (*.py and *.R).

Raw sequence data is archived in GenBank's SRA. 2015-2016 samples: BioProject ID PRJNA690934 and BioSample IDs SAMN17267686–SAMN17267781. 2011 samples: BioProject ID PRJNA283611 and BioSample IDs SAMN03653390–SAMN03653413.


Methods reproduced and abridged from the associated paper, "Balanced polymorphism fuels rapid selection in an invasive crab despite high gene flow and low genetic diversity".

Sample Collection
Twelve crabs were sampled from each of six sites along the northeast Pacific range of C. maenas in 2015-2016: Elkhorn Slough, CA; San Francisco Bay, CA; Seadrift Lagoon, CA; Bodega Bay, CA; Tillamook Bay, OR; and Barkley Sound, BC. Two of these sites (Seadrift Lagoon and San Francisco Bay, CA, USA) were sampled in both years, while the remaining sites were sampled once. We also reanalyzed raw sequence data from a prior study of crabs collected in 2011 from two sites (Seadrift Lagoon, CA, USA and Barkley Sound, BC, Canada; Tepolt & Palumbi 2015). Crabs were collected by hand or trap, and hearts were dissected and stored in RNALater at -80°C.

Extraction & Sequencing
Total RNA was extracted from cardiac tissue using TRIzol (Invitrogen, Carlsbad, CA, USA) with 1-bromo-3-chloropropane (Simms, Cizdziel, & Chomczynski, 1993). RNA was quantified using the broad-range RNA assay on a Qubit 3.0 fluorometer (Invitrogen), and up to 4 μg of RNA was used to prepare individually-barcoded cDNA libraries with Ilumina's TruSeq Stranded mRNA Library Prep Kit (Illumina, San Diego, CA, USA). Libraries were sent to the University of California Berkeley’s Genomics Sequencing Laboratory, where they were quantified and pooled into groups of 16 multiplexed samples run on five lanes of an Illumina HiSeq 4000 in 50-bp single-end reads.

Sequence Processing and SNP Identification
Raw sequences were cleaned and trimmed using Trim Galore! v0.6.4 (, a wrapper for Cutadapt v2.6 (Martin 2011). A nucleotide call quality cutoff of Phred ≥20 was used, and reads ≤20bp after adapter removal and quality trimming were discarded. A published C. maenas cardiac transcriptome was used as a reference (Tepolt & Palumbi 2015), after an expression-based screening to remove poorly-supported contigs and reduce computational load. Briefly, we mapped trimmed and clipped reads from Tepolt & Palumbi 2015 back to the reference transcriptome using salmon v1.2.1 (Patro, Duggal, Love, Irizarry, & Kingsford, 2017). We retained only contigs with TPM >1, and re-annotated these contigs using EnTAP v 0.9.1 (Hart et al. 2020), comparing all 6 reading frames against the Swissprot, TrEMBL, and nr protein databases (downloaded March 2020). Annotations to Decapoda were prioritized with the program’s ‘--taxon’ flag. Contigs with a clear taxonomic mismatch to decapods (e.g., bacteria, green plants, fungi, etc.), as well as all likely mitochondrial and ribosomal contigs, were identified and removed from the project after alignment to minimize non-target mapping. This resulted in a clean reference transcriptome of 25,552 nuclear contigs.

Cleaned reads were mapped back to the C. maenas cardiac transcriptome using Bowtie2 v2.4.1 with default settings (Langmead & Salzberg 2013). Picard v2.22.0 was used to sort reads, identify and mark duplicate read sequences, and index the resulting bam files ( The Genome Analysis Toolkit (GATK) v4.1.7.0 was used to identify and genotype biallelic SNPs (McKenna et al. 2010; DePristo et al. 2011).

Across all 120 samples (24 from 2011 and 96 from 2015-2016), GATK identified 163,261 biallelic SNPs with Phred quality scores ≥20. We identified high-quality, well-supported SNPs for downstream analyses using a custom python script that retained only individual genotypes with Phred ≥20 and supported by ≥5 reads. We excluded SNPs missing high-quality genotypes at ≥4 individuals for any site-by-year samples of 12 individuals, SNPs with heterozygosity ≥0.7 (to screen out obvious paralogs), and SNPs where the alternate allele was observed only once across all individuals (to minimize bias by potential sequencing errors). In total, 9,376 SNPs had high-quality genotypes for ≥8 individuals per group and were retained for downstream processing. All 120 individuals had high-quality genotypes at >80% of these SNPs.

Identification of Putative Inversion Polymorphisms
We explored the relationship of the 9,376 high-quality SNPs to identify potential inversion polymorphisms and other disproportionately large groups of SNPs in linkage disequilibirum (LD). Pairwise R2 was calculated across all SNPs using the --geno-r2 and --interchrom-geno-r2 options in vcftools v0.1.16 (Danecek et al. 2011). We then used the R package LDna v0.64 to identify networks of SNPs in large, compact clusters, setting minimum edges to 45 (expected for 10 closely-linked SNPs) and phi to 15 (Kemppainen et al. 2015).

We identified one large outlier LD cluster, which contained 168 SNPs from 56 different contigs. Our analyses strongly suggested an inversion polymorphism (see paper Results), so for clarity we refer to this group of 168 SNPs as an “inferred inversion” throughout the rest of the manuscript.

Genetic Structure and Diversity
SNPs were separately screened to identify all sets of markers in linkage disequilibrium (LD) and generate a set of independent SNPs for population genomics. Pairwise R2 values (calculated above) were used to identify groups of two or more SNPs in LD at R2 ≥0.8 using the R package igraph v1.2.4.2 (Csardi & Nepusz 2006), and then all but one SNP in each group was removed, leaving a set of 6,848 independent SNPs. The one SNP retained from each group had the highest number of high-quality genotypes, with lower-coverage SNPs within an LD group preferentially removed. We use the term “independent” to indicate that these SNPs have been screened to remove those in strong LD, but note that these SNPs may be in LD at lower levels so are not all truly independent. Similarly, this set of 6,848 independent SNPs retained 54 of the 168 SNPs in the inferred inversion which were in lower levels of LD with each other (R2: 0.29-0.79).

To identify a subset of putatively neutral SNPs with which to examine population structure unconfounded by selection, we used BayPass v2.2 in the core model (Gautier et al. 2015), with each site-by-year sample treated as its own group. We assessed potential outliers using a simulated pseudo-observed data set of 6,848 SNPs with the parameters of the real data to set a 10% false discovery rate (FDR) threshold for SNP XtX. This conservative threshold was chosen to avoid retaining SNPs under weak selection, thus yielding a set of SNPs more likely to be truly neutral. This frequency-based approach removed all but one of the SNPs later identified as a candidate for environmental association (see below).

Markers under Selection
We identified SNPs potentially under environmental selection using BayPass and Redundancy Analysis (RDA). We ran these tests using only the five open sites, excluding Seadrift Lagoon, to identify markers potentially driving the observed signal of IBD and latitudinal structuring along the coast. For this testing we used a single sample, collected in 2015-2016, from each site: 6,662 SNPs of the full 6,848 SNP set were polymorphic in these five samples and were used for tests of selection, as well as for tests of relative migration. Tested covariates included site latitude (as Cartesian Y-values) and winter and summer sea surface temperature (SST). Temperature data were derived from NOAA’s OI SST V2 High Resolution Dataset provided by the NOAA/OAR/ESRL PSD, Boulder, Colorado, USA, from their website at (Reynolds et al. 2007). Daily temperatures were averaged over the 2 years prior to sampling, and were determined for the nearest 0.25° grid location to each study site for January (winter) and July (summer).

In BayPass, we first ran a core model analysis to identify frequency outlier SNPs, then used the auxiliary covariate model to test for associations between allele frequencies and latitude and winter and summer SST (Gautier 2015). Covariates were scaled, and we used an Ising prior of 0 since the physical order of contigs is unknown. We considered any SNP with Bayes factor (BF) ≥15 dB to be a candidate for selection with respect to a given covariate.

We ran an RDA on individual genotypes using the R package ‘vegan’ v2.5-6 (Oksanen et al. 2019), with missing genotypes imputed to be the most common observed. Few individual genotypes were missing in our data set: genotyping was ≥90% complete for 97.1% of SNPs, and no single sampling site was missing >4 individual genotypes at any SNP. We used latitude and summer temperature as covariates (RDA formula = individual genotypes ~ Y + July SST); winter temperature was excluded as it was strongly correlated with latitude (R2 = 0.79; p = 0.03). We considered any SNP >3 SD outside the mean loading for its RDA axis to be a candidate for selection (e.g., Forester et al., 2018).

SNPs were considered strong candidates for selection if they were identified by both BayPass and RDA. Latitude is largely correlated with SST in this data set, due both to the strongly linear north-south arrangement of sites along the coast and to the relative coarseness of the satellite-derived temperature data. Because of this confounding, we treated all SNPs identified as candidates for association with either latitude or temperature interchangeably.

Potential Function of Candidate SNPs
Candidate SNPs were examined for their potential impact on protein structure, to provide an initial idea of SNPs that were particularly likely to affect organismal function. Open Reading Frames (ORFs) and corresponding coding sequences for all contigs were predicted using OrfPredictor v3.0 (Min, Butler, Storms, & Tsang, 2005); as many contigs could not be annotated, sequence data alone was used to predict ORFs. Predicted sequences were then used to class the impact of a given SNP on the resulting protein sequence (untranslated, synonymous, or non-synonymous) using a custom python script. We examined SNP substitution patterns for all 9,376 candidate SNPs prior to LD screening. While putatively linked SNPs were removed from the data set prior to selection analysis, they could potentially be driving any relationships we detected in SNPs with which they are in strong LD.

Usage notes


Cm_glob_assembly_1tpm.fasta: Expression-filtered reference transcriptome

final_annotations_lvl4.tsv: EnTAP annotation (runN) for expression-filtered transcriptome

Cm_glob_mito-and-contam_seqs.txt: Transcripts removed after mapping reads because they were putative contamination or mitochondrial

Cm-WC-SERC_sample_info.txt: Assignment of individual samples to groups

WC_env-params.csv: Metadata for groups (site, year, latitude, etc)

New-WC_joint-gt_Cm_glob_1tpm_filtered_snps.vcf: Pre-QC set of 163,261 biallelic SNPs identified across all samples

Cm-WC-SERC_full_9376_hq.vcf: Set of all 9,376 high-quality, high-coverage SNPs

Cm-WC-SERC_6848_hq-ul8.vcf: Set of 6,848 independent SNPs (LD: R2 < 0.8)

Cm-WC-SERC_6662_hq-ul8-5pop.vcf: Set of 6,662 independent SNPs polymorphic in 5 populations used for migration and selection analyses

Cm-WC-SERC_6311_hq-ul8-n9.vcf: Set of 6,311 independent, putatively neutral SNPs

Cm-WC-SERC_168_clust-snps.vcf: Set of 168 SNPs in the inferred inversion

Cm-WC-SERC_97_cand-snps.vcf: Set of all 97 candidate SNPs for environmental selection

Convert_vcftools_linkage_to_single_matrix.R: Script to convert pairwise LD values from vcftools to a matrix of values Script to pull LD cluster information from igraph output Script to identify high-quality, high-coverage SNPs from raw GATK output Script to predict amino acid substition changes from SNP and transcript ORF data


National Science Foundation, Award: OCE-RAPID #1514893