Data from: Lineage diversity within a widespread endemic Australian skink to better inform conservation in response to regional-scale disturbance
Data files
Feb 09, 2024 version files 492.19 MB
-
bassiana_full.Rdata
-
bassiana_ingroup.Rdata
-
bassiana_raw.Rdata
-
metadata_raw.txt
-
R_Code_Final_Dryad.R
-
README.md
-
Report_DSk20-4927_1_moreOrders_SNP_2.csv
-
Report_DSk20-4927_1_moreOrders_SNP_3.csv
Abstract
This dataset was used to examine the phylogeographic genetic structure of Eastern three lined skink Bassiana duperreyi. It comprises SNP data used for population genetics and phylogenetic reconstruction. The data were used to provide foundational work for the detailed taxonomic re-evaluation of this species complex and to reinforce the need for biodiversity assessment to include an examination of cryptic species and/or cryptic diversity below the level of species. Such information on lineage diversity within species and its distribution in the context of disturbance at a regional scale can be factored into conservation planning regardless of whether a decision is made to formally diagnose new species taxonomically and nomenclaturally.
README: Data from: Lineage diversity within a widespread endemic Australian skink to better inform conservation in response to regional-scale disturbance
https://doi.org/10.5061/dryad.tx95x69zc
This Dryad entry contains the datafiles and associated R script to prepare the data in a form conducive to analysis, including the analyses presented in the companion article. They include SNP datasets for the Eastern three-lined skink Bassiana duperreyi.
Description of the data and file structure
The SNP data comprises a matrix of entities (individuals) versus attributes (loci) taking on the states 0 for homozygous reference allele, 2 for homozygous alternate allele and 1 for the heterozygous state. As such, the data have no units of measurement but do when summed, conveniently, represent the frequency of the alternate allele.
The data are stored in compressed form as an adegenet genlight object with associated locus metadata (e.g. callrate, reproducibility) and individual metadata (e.g. latitude, longitude, population). The SNP scores can be examined in R with as.matrix(gl); the names of the locus metadata can be obtained with names(gl@other$loc.metrics); the names of the individual metadata can be obtained with names(gl@other$ind.metrics).
Locus metadata are:
AlleleID | Unique identifier for the sequence in which the SNP marker occurs |
---|---|
AlleleSequence | In 1 row format: the sequence of the Reference allele. In 2 rows format: the sequence of the Reference allele is in the Ref row, the sequence of the SNP allele in the SNP row |
AvgCountRef | The sum of the tag read counts for all samples, divided by the number of samples with non-zero tag read counts, for the Reference allele row |
AvgCountSnp | The sum of the tag read counts for all samples, divided by the number of samples with non-zero tag read counts, for the SNP allele row |
AvgPIC | The average of the polymorphism information content (PIC) of the Reference and SNP allele rows |
CallRate | The proportion of samples for which the genotype call is either "1" or "0", rather than "-" |
CloneID | Unique identifier for the sequence in which the SNP marker occurs |
FreqHets | The proportion of samples which score as heterozygous |
FreqHomRef | The proportion of samples which score as homozygous for the Reference allele |
FreqHomSnp | The proportion of samples which score as homozygous for the SNP allele |
OneRatioRef | The proportion of samples for which the genotype score is "1", in the Reference allele row |
OneRatioSnp | The proportion of samples for which the genotype score is "1", in the SNP allele row |
PICRef | The polymorphism information content (PIC) for the Reference allele row |
PICSnp | The polymorphism information content (PIC) for the SNP allele row |
RepAvg | The proportion of technical replicate assay pairs for which the marker score is consistent |
SNP | In 2 rows format: this column is blank in the Reference row, and contains the base position and base variant details in the SNP row. In 1 row format: contains the base position and base variant details |
SnpPosition | The position (zero indexed) in the sequence tag at which the defined SNP variant base occurs |
TrimmedSequence | Same as the full sequence, but with removed adapters in short marker tags |
Individual metadata are:
no | Specimen identifier |
---|---|
id | Sample identifier |
pop | Population code |
popname | Population name |
legend | Location label useful for figure legends |
bioregion | Bioregion from which the animal was captured |
species | Species of lizard |
elevation | Elevation above sea level |
age | Age category, e.g. adult, neonate |
lat | Latitude of location of capture |
lon | Longitude of location of capture |
Missing data are scored as NA. Missing data arise both because the target sequence tag is present in the genome but missed by chance because of finite read depth or because the sequence tag is not amplified because of a mutation at one or both of the restriction enzyme sites (null allele).
1. Report_DSk20-4927_1_moreOrders_SNP_2.csv
Raw data as provided by Diversity Arrays Technology in 2-row format. Refer to https://www.diversityarrays.com/ for details of this format, and to Georges et al. (2018) for an overview of how the data were generated.
2. Report_DSk20-4927_1_moreOrders_SNP_3.csv
Raw data as provided by Diversity Arrays Technology in 2-row format. Refer to https://www.diversityarrays.com/ for details of this format, and to Georges et al. (2018) for an overview of how the data were generated.
*3. metadata_raw.csv *
Metadata associated with each individual, including population assignments, sex, stage of maturity and location of capture.
4. bassiana_raw.Rdata
Contains the raw data, as per Report_DSk20-4927_1_moreOrders_SNP_2.csv **and Report_DSk20-4927_1_moreOrders_SNP_3.csv **in binary format. Can be read in to dartR using gl <- readRDS(file="bassiana_raw.Rdata")
5. bassiana_full.Rdata
Contains the data post filtering (see **R_code_final_Dryad.r) **for all individuals in the study. Can be read in to dartR using gl <- readRDS(file="bassiana_full.Rdata")
6. bassiana_ingroup.Rdata
Contains the data post filtering (see *R_code_final_Dryad.r) **for the ingroup taxon *Bassiana duperreyi only. Can be read in to dartR using gl <- readRDS(file="bassiana_ingroup.Rdata")
7. R_code_final_Dryad.r
R code used to generate the results. Point of entry is indicated, and begins by reading in the data from one of bassiana_raw.Rdata, bassiana_full.Rdata or bassiana_ingroup.Rdata using the function readRDS().To access the datasets (those with .Rdata extension), use my.genlight.object <- readRDS("path/filename") or dartR.base::gl.load(("path/filename"). To interrogate the datasets after loading, use the accessors provided in the R package {adegenet}. Refer to the accompanying ms_initial_read.R.
Code/Software
The script necessary to undertake the SNP analyses depends on the R software package dartR.base available on the Comprehensive R Archive Network (CRAN). dartR.base works with adegenet genlight objects such as those listed above.
Software for undertaking phylogenetic analysis on the SNPs include SVDQuartets (Singular Value Decomposition Quartets, Chifman & Kubatko, 2014) available through Paup (Swofford, 2003).
Data availability
These data are freely available unconditionally for download and use. Although some of the data were generated using a commercial service, the intellectual property associated with the data reside entirely with the authors, and their are no restrictions on use arising.
References
Chifman, J., & Kubatko, L. (2014). Quartet inference from SNP data under the coalescent model. Bioinformatics, 30, 3317–3324. https://doi.org/10.1093/bioinformatics/btu530
Georges, A., Gruber, B., Pauly, G., Adams, M., White, D., Young, M., Kilian, A., Zhang, X., Shaffer, H. B., & Unmack, P. J. (2018). Genome-wide SNP markers breathe new life into phylogeography and species delimitation for the problematic short-necked turtles (Chelidae: Emydura) of eastern Australia. Molecular Ecology, 27, 5195–5213. https://doi.org/10.1111/mec.14925
Gruber, B., Unmack, P. J., Berry, O. F., & Georges, A. (2018). dartr: An r package to facilitate analysis of SNP data generated from reduced representation genome sequencing. Molecular Ecology Resources, 18(3), 691–699. https://doi.org/10.1111/1755-0998.12745
Kilian, A., Wenzl, P., Huttner, E., Carling, J., Xia, L., Blois, H., Caig, V., Heller-Uszynska, K., Jaccoud, D., Hopper, C., Aschenbrenner-Kilian, M., Evers, M., Peng, K., Cayla, C., Hok, P., & Uszynski, G. (2012). Diversity Arrays Technology: a generic genome profiling technology on open platforms. In F. Pompanon & A. Bonin (Eds.), Data Production and Analysis in Population Genomics: Methods and Protocols (pp. 67–89). Humana Press. https://doi.org/10.1007/978-1-61779-870-2_5
Mijangos, J., Gruber, B., Berry, O., Pacioni, C., & Georges, A. (2022). dartR v2: an accessible genetic analysis platform for conservation, ecology, and agriculture. Methods in Ecology and Evolution, 3, 2150–2158. https://doi.org/https://doi.org/10.1111/2041-210X.13918
Swofford, D. L. (2003). Phylogenetic Analysis Using Parsimony * (and other methods). Version 4. In PAUP. Phylogenetic Analysis Using Parsimony (and Other Methods). Version 4. Sinauer Associates.
Methods
Briefly, samples of tissue were collected from across the range of the species, Bassiana duperreyi, including from Australian Museums, DNA was extracted, double digested and genotyped for SNP markers using the technology of Diversity Arrays Technology (DArT, Canberra). The data were analysed in the software package dartR available on the CRAN repository, as per the script provided. Structure across the landscape was used to inform assessment of the impact of regional scale disturbance.
Skin tissues and extracted DNA were provided to DArT for processing, sequencing and informative SNP marker identification using DArTseqTM (Kilian et al., 2012). DArT performed a genome complexity reduction technique using double digestion of genomic DNA with two restriction endonucleases PstI (5′- CTGCA|G- 3′) and SphI (5′- GCATG|C- 3′), fragment-size selection and next-generation sequencing on an Illumina HiSeq2500 (CA, USA). Sequences were processed using proprietary DArT analytical pipelines (for full details refer to Georges et al. (2018). Initial filtering was based primarily on average and variance of sequencing depth, average allele counts and call rate across samples. Approximately one-third of samples were sequenced twice as technical replicates, with scoring consistency identifying high quality SNP markers with low error rates. We applied further quality control filtering using the R package dartR 2.7.2 (Gruber et al., 2018; Mijangos et al., 2022). These filters were for reproducibility across technical replicates (< 99%), call rate removing both loci and individuals with > 5% missing data, read depth (< 8x and above > 50x) to remove low coverage SNPs and potential paralogs and by removing all but one of multiple SNPs per locus.
Usage notes
The script necessary to undertake the SNP analyses depends on the R software package dartR.base available on the Comprehensive R Archive Network (CRAN). dartR.base works with adegenet genlight objects such as those in the file list provided.
Software for undertaking phylogenetic analysis on the SNPs include SVDQuartets (Singular Value Decomposition Quartets, Chifman & Kubatko, 2014) available through Paup (Swofford, 2003).