Data from: Distances and their visualization in studies of spatial-temporal genetic variation using single nucleotide polymorphisms (SNPs)

Georges, Arthur 1

Research facility: Institute for Applied Ecology, University of Canberra

Published Jan 15, 2024 on Dryad. https://doi.org/10.5061/dryad.4b8gthtkn

Data files

Jan 15, 2024 version files 90.81 MB

Eulamprus_filtered.Rdata

1.78 MB
README.md

3.74 KB
ref_variables.csv

4.98 KB
Script_S1.R

5.40 KB
Script_S2.R

22.96 KB
seed_gl.Rdata

6.50 MB
silicodart_starting_data.Rdata

18.64 MB
sim_variables.csv

4.32 KB
snp_starting_data.Rdata

63.84 MB

Abstract

Distance measures are widely used for examining genetic structure in datasets that comprise many individuals scored for a very large number of attributes. Genotype datasets composed of single nucleotide polymorphisms (SNPs) typically contain bi-allelic scores for tens of thousands if not hundreds of thousands of loci.

We examine the application of distance measures to SNP genotypes and sequence tag presence-absences (SilicoDArT) and use real datasets and simulated data to illustrate pitfalls in the application of genetic distances and their visualization.

The datasets used to illustrate points in the associated review are provided here together with the R script used to analyse the data. Data are either simulated internal to this script or are SNP data generated as part of other studies and included as compressed binary files readily accessable by reading into R using R base function readRDS(). Refer to the analysis script for examples.

https://doi.org/10.5061/dryad.4b8gthtkn

This Dryad entry contains the datafiles and associated R script to generate the analyses presented in the companion review article. They include SNP datasets for Australian Turtles and the Australian Blue Mountains Skink, and the associated SilicoDArT data (null alleles matrix) for the turtles. They are for illustration purposes, and have been modified to meet the requirements of the analysis being presented.

Description of the data and file structure

The turtle SNP data comprises a matrix of entities (individuals) versus attributes (loci) taking on the states 0 for homozygous reference allele, 2 for homozygous alternate allele and 1 for the heterozygous state. The data are stored in compressed form as an adegenet genlight object with associated locus metadata (e.g. callrate, reproducibility) and individual metadata (e.g. latitude, longitude, population). The SNP scores can be examined in R with as.matrix(gl); the names of the locus metadata can be obtained with names(gl@other$loc.metrics); the names of the individual metadata can be obtained with names(gl@other$ind.metrics). The SilicoDArT data is similarly stored and accessed, but has states of 0 for absence; 1 for presence. Missing data for both SNP and SilicoDArT data are scored as NA.

SNP_starting_data.Rdata -- a disk copy of the genlight object containing the SNP scores for the turtle SNP genotypes. Can be accessed in R with readRDS("filename"). Refer to the accompanying script_S1.R.

silicodart_starting_data.Rdata -- a disk copy of the genlight object containing the presence-absence scores of the SilicoDArT dataset for the turtles. Can be accessed in R with readRDS("filename"). Refer to the accompanying script_S1.R.

eulamprus_filtered.Rdata -- a disk copy of the genlight object containing the SNP scores for the alpine skink SNP genotypes. Can be accessed in R with readRDS("filename"). Refer to the accompanying script_S1.R.

seed_gl.Rdata -- a disk copy of the genlight object containing the SNP scores for the seed population for the simulation to demonstrate the impact on the PCA of closely related individuals (Figure 9 of the companion manuscript). Refer to the accompanying script_S2.R.

script_S1.R -- R script used to access the data and undertake analyses.

ref_variables.csv -- a reference table for the simulations associated with assessing the impact of tight linkage among SNP loci on a PCA. Refer R script script_S1.R.

sim_variables.csv -- a reference table for the simulations associated with assessing the impact of tight linkage among SNP loci on a PCA. Refer R script script_S1.R.

script_S2.R -- a copy of the script used to generate the analyses associated with assessing the impact of closely related individuals on a PCA.

To access the datasets (those with .Rdata extension), use my.genlight.object <- readRDS("path/filename"). To interrogate the datasets after loading, use the accessors provided in the R package {adegenet}.

Sharing/Access information

The data and script can be used for any purpose preferably citing the Dryad URL provided.

Code/Software

The script necessary to undertake the analyses depends on the R software package dartR available on the Comprehensive R Archive Network (CRAN) which can be installed as dartR.verse. dartR works with adegenet genlight objects such as those listed above.

A dataset was constructed from a SNP matrix generated for the freshwater turtles in the genus Emydura, a recent radiation of Chelidae in Australasia. The dataset (SNP_starting_data.Rdata) includes selected populations that vary in level of divergence to encompass variation within species and variation between closely related species. Sampling localities with evidence of admixture between species were removed. Monomorphic loci were removed, and the data was filtered on call rate (>95%), repeatability (>99.5%) and read depth (5x < read depth < 50x). Where there was more than one SNP per sequence tag, only one was retained at random. The resultant dataset had 18,196 SNP loci scored for 381 individuals from 7 sampling localities or populations – Emydura victoriae [Ord River, NT, n=15], E. tanybaraga [Holroyd River, Qld, n=10], E. subglobosa worrelli [Daly River, NT, n=25], E. subglobosa subglobosa [Fly River, PNG, n=55], E. macquarii macquarii [Murray Darling Basin north, NSW/Qld, n=152], E. macquarii krefftii [Fitzroy River, Qld, n=39] and E. macquarii emmotti [Cooper Creek, Qld, n=85]. The missing data rate was 1.7%, subsequently imputed by nearest neighbour to yield a fully populated data matrix. The data are a subset of those published by Georges et al. (2018, Molecular Ecology 27:5195-5213) for illustrative purposes only. A companion SilicoDArT dataset (silicodart_starting_data.Rdata) is also included.

The above manipulations were performed in R package dartR. Principal Components Analysis was undertaken using the glPCA function of the R adegenet package (as implemented in dartR). Principal Coordinates Analysis was undertaken using the pcoa function in R package ape implemented in dartR.

To exemplify the effect of missing values on SNP visualisation using PCA, we simulated ten populations that reproduced over 200 non-overlapping generations. Simulated populations were placed in a linear series with low dispersal between adjacent populations (one disperser every ten generations). Each population had 100 individuals, of which 50 individuals were sampled at random. Genotypes were generated for 1,000 neutral loci on one chromosome. We then randomly selected 50% of genotypes and set them as missing data. Principal Components Analysis was undertaken using the glPCA function of the R adegenet package. The R script to implement this is provided (Supplementary_script_for_ms.R).

The data for the Australian Blue Mountains skink Eulamprus leuraensis were generated for 372 individuals collected from 17 swamps isolated to varying degrees in the Blue Mountains region of New South Wales. Tail snips were collected and stored in 95% ethanol. The tissue samples were digested with proteinase K overnight and DNA was extracted using a NucleoMag 96 Tissue Kit (MachereyNagel, Duren, Germany) coupled with NucleoMag SEP (Ref. 744900) to allow automated separation of high-quality DNA on a Freedom Evo robotic liquid handler (TECAN, Miinnedorf, Switzerland). SNP data were generated by the commercial service of Diversity Arrays Technology Pty Ltd (Canberra, Australia) using published protocols. A total of 13,496 loci were scored which reduced to 7,935 after filtering out secondary SNPs on the same sequence tag, filtering on reproducibility (threshold 0.99) and call rate (threshold 0.95), and removal of monomorphic loci. The resultant data (Eulamprus_filtered.Rdata) is used to demonstrate the impact of a substantial inversion on the outcomes of a PCA.

To test the effect of having closely related individuals (parents and offspring) on the PCoA pattern we ran a simulation using dartR, where we picked up two individuals to become the parents with 2-8 offspring. We ran a PCoA for all of the simulated cases. The R code used is included in the R script uploaded here.

Refer to the companion manuscript for links to the literature associated with the above techniques.