ClinePlotR: Visualizing genomic clines and detecting outliers in R

Martin, Bradley T.1 ; Chafin, Tyler K.1 ; Douglas, Marlis R.1 ; Douglas, Michael E.1

Published Aug 17, 2020 on Dryad. https://doi.org/10.5061/dryad.b2rbnzsc8

Data files

Aug 17, 2020 version files 5.26 GB

exampleData.zip

5.26 GB

Abstract

Patterns of multi-locus differentiation (i.e., genomic clines) often extend broadly across hybrid zones and their quantification can help diagnose how species boundaries are shaped by adaptive processes, both intrinsic and extrinsic. In this sense, the transitioning of loci across admixed individuals can be contrasted as a function of the genome-wide trend, in turn allowing an expansion of clinal theory across a much wider array of biodiversity. However, computational tools that serve to interpret and consequently visualize ‘genomic clines’ are limited.

Here, we introduce the ClinePlotR R-package for visualizing genomic clines and detecting outlier loci using output generated by two popular software packages, bgc and Introgress.

ClinePlotR bundles both input generation (i.e, filtering datasets and creating specialized file formats) and output processing (e.g., MCMC thinning and burn-in) with functions that directly facilitate interpretation and hypothesis testing. Tools are also provided for post-hoc analyses that interface with external packages such as ENMeval and RIdeogram

Our package increases the reproducibility and accessibility of genomic cline methods, thus allowing an expanded user base and promoting these methods as mechanisms to address diverse evolutionary questions in both model and non-model organisms.

Transcriptomic alignment

Terrapene carolina carolina and Terrapene mexicana triunguis ddRADseq reads were mapped to the Terrapene mexicana triunguis reference transcriptome (GenBank Accession ID: GCA_002925995.2). Only mapped reads were retained to obtain a "transcriptome alignment."

Scaffold alignment

Terrapene carolina carolina and Terrapene mexicana triunguis ddRADseq reads were mapped to the Terrapene mexicana triunguis reference genome (GenBank Accession ID: GCA_002925995.2), which in this case consisted of unplaced scaffolds. Thus, this alignment is dubbed the "scaffold alignment."

Each alignment was done separately in ipyrad v0.7.30 and mapping was done with minimap2. The alignments were also required to have >50% minimum coverage per site and a minimum 20X coverage depth.

Running BGC

Using vcftools, VCF files for each of the above alignments were filtered to retain only bi-allelic SNPs, thinned to only one SNP per ddRAD locus, and subjected to a minimum minor allele frequency (MAF) filter of 5.0%.

vcf2bgc.py was then run to convert the ipyrad-style VCF files to BGC format, with the genotype uncertainty model incorporating read counts.

BGC (Bayesian Genomic Clines) was run on both alignments separately, with five independent runs each. 1.8 million MCMC iterations for the transcriptomic alignment and 1.0 million for the scaffold alignment were discarded as burn-in, followed by 200,000 post burn-in generations each. Samples were thinned to one every 50 iterations. The genotype uncertainty model was used with a sequencing error rate of 0.001-0.002 (determined using ipyrad).

ClinePlotR

bgcPlotter

For each alignment, the five independent runs were aggregated to yield 20,000 total MCMC samples, with LnL and parameter traces confirming appropriate mixing and convergence.

ClinePlotR was then used to identify significant outlier loci using two established criteria (Gompert and Buerkle, 2011, 2012). A Phi plot and an alpha X beta contour plot were made for each alignment.

For the chromosome plot, the transcriptome and scaffold alignments were mapped to the closely related, chromosome-level Trachemys scripta reference genome using minimap2 and PAFScaff (since the Terrapene reference genome is scaffold-level only). This allowed us to convert scaffold coordinates to chromosome coordinates.

The chromosome plot (ideogram) was made by joining GFF annotation information with the transcriptIds from the ddRAD dataset. Both these and the scaffoldIds that now contained chromosome coordinates were input into our plot_outlier_ideogram() function, which uses RIdeogram to make the chromosome plot. The plot is a karyotype-style chromosome plot with alpha and beta values represented as heatmap bands to visualize the distribution of outliers across the genome.

Introgress functions

The 19 BioClim raster layers plus mean annual solar radiation and mean annual wind speed were obtained from https://worldclim.org. A layer from the National Land-cover Database (2011) was also obtained. These layers were input into our prepare_rasters() function to crop them to the same sampling extent (plus a small buffer). ENMeval was then run to find the most informaive raster features. The raster values at each sampling locality were extracted using our extractPointValues() function.

Introgress was then run using our wrapper function, runIntrogress(). A minimum allele frequency differential (delta) value of 0.8 was required for a locus to be retained. The estimated genomic clines and hybrid indices were then input into the clinesXenvironment() function to plot genomic clines and hybrid indices X latitude, longitude, and the environmental variables (raster values).

exampleDataset.zip

Running Example Dataset

To run the example dataset, use the R scripts included in ClinePlotR/scripts (located on GitHub: https://github.com/btmartin721/ClinePlotR)

Directory structure

bgc/
- genes (transcriptomic alignment)
  - eatt_bgc_p0in.txt (parental 0 data from bgc analysis)
  - eatt_bgc_p1in.txt (parental 1 data from bgc analysis)
  - eatt_bgc_admixedin.txt (admixed data from bgc analysis)
- fulldataset (scaffold alignment)
  - eatt_bgc_p0in.txt (parental 0 data from bgc analysis)
  - eatt_bgc_p1in.txt (parental 1 data from bgc analysis
  - eatt_bgc_admixedin.txt (admixed data from bgc analysis)
  - bgc_lociFiles
    - genes
      - eatt_bgc_loci.txt (transcript_IDs\tSNP_position)
    - fulldataset
      - eatt_bgc_loci.txt (scaffold_IDs\tSNP_position)
  - bgc_outputFiles
    - genes (bgc output files from transcriptomic alignment)
    - fulldataset (bgc output files from scaffold alignment)
  - bgc_vcfFiles
    - filtered VCF files that were converted to BGC format using vcf2bgc.py (see ClinePlotR/scripts)
  - bgcPlotter_output
    - all output plots and tables from bgcPlotter functions
  - scafInfo
    - output table with scaffold info
ENMeval_bioclim/
- localityInfo
  - Sample locality info; sampleIds,popIds,lat,lon (coordinates in decimal degrees)
- outputFiles
  - All output plots and tables from ClinePlotR ENMeval functions
- rasterLayers
  - original
    - two example original raster files
    - only these two examples were used because including all of them would be too large to include here (>150GB)
  - cropped
    - raster layers cropped to sampling extent
    - these are the layers actually used in the ENMeval analysis
- Robjects
  - R objects that can be loaded into R with readRDS()
  - represent output from different steps in the pipeline
gff/
- GFF file from Terrapene mexicana triunguis reference genome (GenBank Accession ID: GCA_002925995.2)
introgress/
- extractedRasterValues
  - raster values extracted from each sample point
- inputFiles
  - INTROGRESS input files
    - EATT_p1data.txt
    - EATT_p2data.txt
    - EATT_admix.txt
    - EATT_loci.txt
    - eatt_inds.txt (individuals used for this analysis)
- outputFiles
  - INTROGRESS output tables and plots
- outputRobject
  - R objects saved from running INTROGRESS
PAFScaff/
- output *.tdt file from PAFScaff
- used to make the chromosome plot
popmaps/
- Population maps showing which individuals belong to which population
- two columns: indID\tpopID
- Separate popmaps for bgc and ENMeval