Skip to main content
Dryad

Data from: Genomic signatures of spatially divergent selection at clownfish range margins

Cite this dataset

Clark, Rene et al. (2021). Data from: Genomic signatures of spatially divergent selection at clownfish range margins [Dataset]. Dryad. https://doi.org/10.5061/dryad.5x69p8d30

Abstract

Understanding how evolutionary forces interact to drive patterns of selection and distribute genetic variation across a species’ range is of great interest in ecology and evolution, especially in an era of global change. While theory predicts how and when populations at range margins are likely to undergo local adaptation, empirical evidence testing these models remains sparse. Here, we address this knowledge gap by investigating the relationship between selection, gene flow, and genetic drift in the yellowtail clownfish, Amphiprion clarkii, from the core to the northern periphery of the species range. Analyses reveal low genetic diversity at the range edge, gene flow from the core to the edge, and genomic signatures of local adaptation at 56 Single Nucleotide Polymorphisms (SNPs) in 25 candidate genes, most of which are significantly correlated with minimum annual sea surface temperature. Several of these candidate genes play a role in functions that are up-regulated during cold stress, including protein turnover, metabolism, and translation. Our results illustrate how spatially divergent selection spanning the range core to the periphery can occur despite the potential for strong genetic drift at the range edge and moderate gene flow from the core populations. 

Usage notes

Data and scripts for the analyses in "Genomic signatures of spatially divergent selection at clownfish range margins," which identifies the population structure and signatures of selection in populations of Amphiprion clarkii (the yellowtail clownfish) across the northern half of its range.

Data

A list of the files either read into R scripts or copied to Amarel for analyses. There are also files containing metadata and general output files. Files generated by scripts are not included, with a few exceptions.

Aclarkii_metadata.csv
This file contains metadata for the 25 clownfish used in the main analyses. Data include the individual ID used for sequencing, capture location (site name, latitude, and longitude), collector name, capture date, dissection method, sea surface salinity data at collection location (mean, minimum, maximum, range, and variation), and sea surface temperature data at collection location (mean, minimum, maximum, range, and variation). For some individuals, the standard length (mm), time of collection, time of incubation, time of dissection, and time spent in the freezer are also recorded.

Aclarkii_ref_transcriptome.fa
Fasta file containing the reference sequences that reads from each individual were mapped to.

allele_freqs_all.xlsx
This file contains the non-polarized allele frequencies for every SNP at each sampling site. The file contains four sheets. The first sheet (allele_freqs_all) contains the allele frequency for all SNPs in all sampling sites and contains 4 columns. The first column contains the contig & bp (separated by "_") for each SNP. The second column contains the population ID and the third column indicates whether or not the SNP is an outlier. Columns five & six contain the unfolded and folded allele frequency, respectively. Sheets 2-4 (allele_freqs_J_noout, etc.) contain the allele frequencies for only non-outlier SNPs in a given population. In these sheets, columns 1-4 contain the same information as the first sheet. Columns 5-6 contain the allele count and folded allele count, respectively. This file is used to create site frequency spectra for Stairway Plot analyses.

annotations_combined.xlsx
This file contains the functional and structural annotations for every outlier SNP. The first column is the contig each SNP is found on, and the second column is the bp position of the SNP. Columns 3 & 4 are the reference and alternate allele, respectively. Column 5 contains the gene name (if available), while column 6 contains the name of the protein coded by the gene. Column 7 contains the molecular function GO annotation and column 8 contains the biological process GO annotation. Column 9 is the structural annotation from SnpEff. Columns 10-14 are the Bayes Factor (deciban units) output from BayPass for sea surface salinity mean, sea surface temperature (SST) mean, SST minimum, SST maximum, and latitude, respectively. Column 15 is the XtX value from BayPass.

apcl.est
est file that contains the prior distributions for each parameter estimated in the *fastsimcoal2* model. This file, along with apcl.tpcl, is read into fastsimcoal2.

apcl_raw_bestlhoods.csv
This file contains the parameter estimates & maximum likelihood (ML) from each of the 50 fastsimcoal2 replicate runs with the observed multi-dimensional site frequency spectrum. The first column contains the run number (1-50). Columns 2-7 contain the parameter estimates for each parameter estimated by the model (POPONE = effective population size (Ne) for Japan, POPTWO = Ne for Philippines, POPTHREE = Ne for Indonesia, DISPONE = Japan-Philippines migration rate, DISPTWO = Philippines-Indonesia migration rate, TDIVTWO = time (backwards in generations) of Philippines-Indonesia population split). Columns 8 & 9 contain the estimated ML and observed ML, respectively.

apcl.tpl
tpl file that contains the model structure (sampling scheme information, number/types of demographic events, etc.) for fastsimcoal2. This file, along with apcl.est, is read into fastsimcoal2.

Contig_length.csv
This file contains the length of each contig that contains a SNP. The first column is the contig name and the second column is the length of the contig (bp). This file is read into pi.R.

fsc_maxLhood_CI_summary.csv
This file contains the parameter values & maximum likelihood (ML) from the best ML run for each bootstrapped SFS from fastsimcoal2. Columns 1-6 contain the parameter estimates for each parameter estimated by the model (POPONE = effective population size (Ne) for Japan, POPTWO = Ne for Philippines, POPTHREE = Ne for Indonesia, DISPONE = Japan-Philippines migration rate, DISPTWO = Philippines-Indonesia migration rate, TDIVTWO = time (backwards in generations) of Philippines-Indonesia population split). Columns 7 & 8 contain the estimated ML and observed ML, respectively. This file is read into fsc_CIs.R.

output.hicov2.snps.only.mac2.vcf
VCF file that contains all SNPs that passed all bioinformatic filters. This file is read into Fst_script.R and is used for most upstream analyses.

polarized_allele_freqs.csv
This file contains the polarized allele frequencies for every SNP at each sampling site. The first column contains the contig & bp (separated by "_") for each SNP. The second column contains the polarized allele frequency (polarized to Japan). The third column indicates whether or not the SNP is an outlier and the fourth column contains the population ID.  This file is read into Allele Freq Line Plots.R.

relatedness_input_mac2.txt
This file contains the genotype information for every SNP. The first column contains the population ID and individual ID (separated by "_"). A population ID of JJ indicates Japan, NN indicates Indonesia, and PP indicates the Philippines. Columns 2-8425 are SNPs (each SNP has 2 columns) and contain values of 100-130 to indicate which nucleotide a particular individual has (A=100, T=110, G=120, C=130). This file is read into relatedness.R.

STRUCTURE_mac2.str
This file is formatted for input into the program STRUCTURE. It contains 4212 SNPs across all 25 clownfish. The column contains the fish ID (each individual has two rows). The second column contains the population ID. Columns 3-4215 are SNPs and contain values of 1-4 to indicate which nucleotide a particular individual has (A=1, T=2, G=3, C=4). This file is read into RDA.R and used for some STRUCTURE analyses.
 

R Scripts

A list of the R scripts. If the input files for a script were created by running code and/or calling programs on a remote workstation, details on the code used can be found in the Upstream Analyses section.

Allele Freq Line Plots.R
This script creates polarized allele frequency plots (outlier and non-outlier).

BF Correlation Plots.R
This script reads in a .csv file of Bayes Factors and XtX values outputted by BayPass and creates correlation plots by environmental variables.

Bootstrap_forSFS.R
This script creates bootstrapped VCF files. These files can then be read into easySFS.py to create bootstrapped SFS for reading into fastsimcoal2 and calculating the 95% confidence intervals.

Diversity_Script.R
This script calculates Ho, He, and Fis from the allele frequencies.

ECDFS for Sim v Real BFs.R
This script contains code for Mann-Whitney U-tests and plots of empirical cumulative distribution functions for Bayes Factors from raw data and permuted data.

fsc_CIs.R
This script reads in the parameter estimates from the best maximum likelihood run for each bootstrapped SFS from fastsimcoal2 and calculates the 95% confidence intervals. 

Fst_script.R
This script calculates the per-SNP Fst and the site pairwise Fst values from the VCF.

PCAs.R
This script reads in eigenvector information from plink and creates PCA plots.

pi.R
This script reads in the .csv files containing the site pi output from VCFtools and calculates the mean (+ bootstrapped 95% confidence interval) for pi in each site and with all sites pooled.

Pull_BFs.R
This script reads in raw summary_betai.txt files from BayPass and reorganizes them into one .csv file for easier downstream analysis. It also pulls out candidate SNPs. 

RDA.R
This script runs a redundancy analysis (RDA), creates bi-plots for visualization, and identifies outliers from RDA using two different methods.

relatedness.R
This script calculates the pairwise relatedness (point estimates for all possible pairs, mean within-site relatedness, and 95% confidence interval for within-site relatedness).

STRUCTURE_script.R
This script reads in STRUCTURE output files, runs CLUMPP, and creates output plots for visualization of STRUCTURE results. It also creates maximum likelihood (ML) and Evanno method plots to identify the "best" value of K.

TajimaD_script.R
This script reads in the .csv files containing Tajima's D output from VCFtools and calculates the mean (+ standard error) Tajima's D in each site and with all sites pooled. It also contains code for Mann-Whitney U-tests and plots of empirical cumulative distribution functions (outlier vs. all transcripts).

TD_v_pi.R
This script creates plots of Tajima's D v. pi for each sampling site and for all sampling sites pooled.

Write Simulation Pop Data for BayPass.R
This script creates permuted datasets to run in BayPass and create distributions of Bayes Factors under the null hypothesis (no association between allele frequencies and environmental factors). 

Write_Sim_Data_Het_CIs.R
This script writes simulation populations and bootstraps them to create the 95% confidence intervals for Ho, He, and Fis.

XtX_Calibration.R
This script contains modified code from Gautier (2015) to create pseudo-observed datasets (PODs) to generate 95% significance thresholds for XtX values. Once the PODs are run in BayPass, it also contains modified code to calculate the XtX significance cut-off.
 

Upstream Analyses

This directory contains the code for upstream analyses that create the input files for some of the R scripts. All scripts are written to be run on Amarel, Rutger's high-performance computing cluster. 

BayPass_upstream.md
This file contains code for running BayPass. It creates input files for Pull_BFs.R & XtX_Calibration.R and takes input files from Write Simulation Pop Data for BayPass.R.

Fastsimcoal2_upstream.md
This file contains code for running fastsimcoal2. It also contains information on how to create SFS with easySFS.py. It creates best point estimates for demographic parameters and input files for fsc_CIs.R.

PCA_upstream.md
This file contains code for running plink to generate eigenvalue and eigenvector inputs for PCAs. It creates input files for PCAs.R.

pi_upstream.md
This file contains code for calculating site-pi using VCFtools. It creates input files for pi.R & TD_v_pi.R.

StairwayPlot_upstream.md
This file contains code for running Stairway Plot v.2. It creates figures with estimates of demographic histories for each sampling site.

STRUCTURE_upstream.md
This file contains code for running STRUCTURE. It creates input files for STRUCTURE_script.R.

TajimasD_upstream.md
This file contains code for running SnpEff and calculating Tajima's D using VCFtools. It also contains information on how mapping to Amphiprion frenatus was done. It creates input files for TajimaD_script.R & TD_v_pi.R.