Skip to main content

Base-substitution mutation rate across the nuclear genome of Alpheus snapping shrimp and the timing of isolation by the Isthmus of Panama

Cite this dataset

Silliman, Katherine et al. (2021). Base-substitution mutation rate across the nuclear genome of Alpheus snapping shrimp and the timing of isolation by the Isthmus of Panama [Dataset]. Dryad.


The formation of the Isthmus of Panama and final closure of the Central American Seaway (CAS) provides an independent calibration point for examining the rate of DNA substitutions. This vicariant event has been widely used to estimate the substitution rate across mitochondrial genomes and to date evolutionary events in other taxonomic groups. Nuclear sequence data is increasingly being used to complement mitochondrial datasets for phylogenetic and evolutionary investigations; these studies would benefit from information regarding the rate and pattern of DNA substitutions derived from the nuclear genome.

To estimate the genome-wide neutral mutation rate (µ), genotype-by-sequencing (GBS) datasets were generated for three transisthmian species pairs in Alpheus snapping shrimp. A range of bioinformatic filtering parameters were evaluated in order to minimize potential bias in mutation rate estimates that may result from SNP filtering. Using a Bayesian coalescent approach (G-PhoCS) applied to 44,960 GBS loci, we estimated µ to be 2.64E−9 substitutions/site/year, when calibrated with the closure of the CAS at 3 Ma. Post-divergence gene flow was detected in one species pair. Failure to account for this post-split migration inflates our substitution rate estimates, emphasizing the importance of demographic methods that can accommodate gene flow.

Results from our study, both parameter estimates and bioinformatic explorations, have broad-ranging implications for phylogeographic studies in other non-model taxa using reduced representation datasets. Our best estimate of µ that accounts for coalescent and demographic processes is remarkably similar to experimentally derived mutation rates in model arthropod systems. These results contradicted recent suggestions that the closure of the Isthmus was completed much earlier (around 10 Ma), as mutation rates based on an early calibration resulted in uncharacteristically low genomic mutation rates. Also, stricter filtering parameters resulted in biased datasets that generated lower mutation rate estimates and influenced demographic parameters, serving as a cautionary tale for the adherence to conservative bioinformatic strategies when generating reduced-representation datasets at the species level. To our knowledge this is the first use of transisthmian species pairs to calibrate the rate of molecular evolution from GBS data.


Reduced representation genotype-by-sequencing datasets were generated for 24 individuals across three species pairs of snapping shrimp that span the Isthmus of Panama: A. malleator/A. wonkimi, A. formosus/A. panamensis, and A. colombiensis /A. estuariensis. Genomic DNA was extracted from 24 individuals using the DNeasy tissue kit (Qiagen Inc., Valencia, California), and samples were treated with RNase following the manufacturer’s protocol. Three to four replicate GBS libraries per individual were optimized, generated, and sequenced at the Cornell University Biotechnology Resource Center Genomic Diversity Facility following the protocol of Elshire et al. 2011), resulting in a total of 96 samples. Briefly, genomic DNA was digested with EcoT22I (A|TGCAT) and barcoded adapters were ligated onto resulting restriction fragments. Pooled libraries were sequenced on a single Illumina HiSeq 2000/2500 lane, obtaining 100 base pair, single-end sequencing reads. Sequence reads from replicate libraries were combined for downstream analyses. 

Raw sequencing reads were de-multiplexed, quality filtered, and de novo clustered using pyRAD v.3.0.2, a pipeline optimized to produce aligned orthologous loci across distantly related taxa using restriction-site associated DNA. Demultiplexing used sample-specific barcode sequences, allowing one mismatch in the barcode sequence. Base calls with a Phred quality score under 20 were converted to Ns, and reads containing more than 4 Ns were discarded. Adapter sequences, barcodes, and the cut site sequences were trimmed from reads passing filter, with only reads greater than 50 bp retained. For within-sample clustering, a minimum coverage cutoff of 5× was employed. Consensus sequences with more than eight heterozygous sites were discarded as potential paralogs. Clustered orthologs containing heterozygous sites that were shared by more than two samples were also discarded as putative paralogs. The same clustering threshold of 85% was used for both within- and across-sample clustering.

We generated 11 datasets that varied in included samples (10–24), the minimum number of samples (m) that had to be shared by each locus (3–6), and the minimum number of species (s) shared by each locus (1–3). In particular, one A. malleator individual had very few sequencing reads and therefore was excluded from some datasets.  The primary dataset, Am4s2, includes all samples (A), with at least four individuals (m4) and two species (s2) genotyped at each locus. Custom Python scripts were used to convert the pyRAD .loci file into .gphocs files, which are provided in this repository.

Usage notes

This repository includes the output of pyRAD, files for running analyses with G-PhoCS, and a table with the summarized results of those analyses. The ReadMe file provides specific info on each file.


National Science Foundation, Award: 0820128