Data from: Conservation prioritisation of genomic diversity to inform management of a declining mammal species

Data files

Feb 23, 2024 version files 36.29 GB

barcodes.zip
3.81 KB
filtered.zip
3.67 MB
rawseq.zip
36.23 GB
README.md
11.82 KB
unfiltered.zip
51.88 MB

Abstract

In our present age of extinction, conservation managers must use limited resources efficiently to conserve species and the genetic diversity within them. To conserve intraspecific variation, we must understand the geographic distribution of the variation and plan management actions that will cost-effectively maximise its retention. Here, we use a genome-wide single-nucleotide polymorphism (SNP) dataset consisting of 12,962 loci and 384 individuals to inform conservation management of the Endangered northern quoll (Dasyurus hallucatus), a carnivorous marsupial distributed patchily across northern Australia. Many northern quoll populations have declined or are currently declining, driven by the range-expanding cane toad (Rhinella marina). We (1) confirm population genomic structure, (2) investigate the contribution of each population to overall diversity, (3) conduct genomic prioritisation analyses at several spatial and hierarchical scales using popular conservation planning algorithms, and (4) investigate patterns of inbreeding. We find that the conservation of a single population, or even several populations, will not prevent the loss of substantial amounts of genomic variation and adaptive capacity. Rather, the conservation of at least eight populations from across the species distribution is necessary to retain 90% of SNP alleles. We also show that more geographically isolated populations, such as those on islands, have very small contributions to overall diversity and show relatively high levels of inbreeding compared to mainland populations. Our study highlights the importance of conserving multiple genetically distinct populations to effectively conserve genetic diversity in species undergoing widespread declines, and demonstrates the importance of using multiple criteria to inform and prioritise conservation management.

https://doi.org/10.5061/dryad.kwh70rzbf

Author/Principal Investigator Information
Name: Brenton von Takach
ORCID: https://orcid.org/0000-0002-7999-3521
Institution: Curtin University
Email: brenton.vontakach@curtin.edu.au

Co-investigator Information
Name: Sam Banks
ORCID: https://orcid.org/0000-0003-2415-0057
Institution: Charles Darwin University
Email: sam.banks@cdu.edu.au

Date of data collection
1993 to 2021

Geographic location of data collection
Queensland, Northern Territory and Western Australia, Australia

Description of the data and file structure
This project uses ddRAD sequencing data from three rounds of sequencing of 568 northern quoll tissue samples. The first two rounds of sequencing data are available via the BioPlatforms Australia data portal (links below).

The third round of sequencing data is included as part of this data upload. This third round was paired end sequencing but only the forward reads were used and are included here. For full details see the published manuscript.

The three rounds of sequencing data were demultiplexed in four stages. All commands, barcodes and information relating to these are also included below.

Individual quoll SNP genotypes were called using ANGSD, and primarily analysed in R. Descriptions of all genotyping and analysis commands and scripts are included as part of the file lists below.

Information about funding sources that supported the collection of the data:

Bioplatforms Australia is a non-profit organisation that supports Australian Life science research by investing in state-of-the-art infrastructure and expertise in genomics, proteomics, metabolomics and bioinformatics.
Investment funding for Bioplatforms Australia is provided by the Commonwealth Government National Collaborative Research Infrastructure Strategy.
The Bioplatforms Australia Oz Mammal Genomics Initiative is one of many national collaborative projects that generate high-impact data and knowledge resources to support some of Australia’s biggest scientific challenges (https://data.bioplatforms.com/about).
These challenges span agriculture, biomedicine, the environment and industry, as well as extending to relevant international endeavours.

See published manuscript for complete Acknowledgements.

Sharing/access information

Reference genome

The chromosome-length genome assembly used in the analysis is available here:
https://www.dnazoo.org/assemblies/Dasyurus_hallucatus
This genome assembly is a collaborative effort of DNA Zoo labs with Assoc. Prof. Ben Phillips and Dr. Stephen Frankenberg from the University of Melbourne along with Dr. Adnan Moussalli from Museums Victoria.
The chromosome-length assembly is based on a draft assembly produced using 10x Genomics Chromium linked-read sequencing of a male northern quoll fibroblast cell line, established from a tissue sample kindly provided by the Territory Wildlife Park, and assembled using the Supernova assembler with additional funding from the Hermon-Slade Foundation.

Raw sequencing data

The third round of sequencing data has been provided as part of this data package (see "rawseq" folder file list below). The first two rounds of sequencing are available at the following three links:
https://data.bioplatforms.com/dataset/bpa-omg-genomics-ddrad-102_100_100_52650-htyj5drxx
https://data.bioplatforms.com/dataset/bpa-omg-genomics-ddrad-102_100_100_52623-hnvthbgxf
https://data.bioplatforms.com/dataset/bpa-omg-genomics-ddrad-102_100_100_52623-h7mh7bgxf
These sequencing data are free and publicly accessible, but you may need to create a login profile to download the data.
See the "c_dasy_dplex.txt" file in the "scripts" folder and the barcode files in the "barcodes" folder for all information and commands used in demultiplexing the raw sequence data.

Data and file overview

File list

Folder 'rawseq':
This folder contains the two forward read raw sequence data files that make up the third round of sequencing data.

12730-1A_HMKFYDRX2_CTTGTA_L001_R1.fastq.gz = first half of the third round of the sequencing data. gzip format.
12730-1B_HMKFYDRX2_ACAGTG_L001_R1.fastq.gz = second half of the third round of the sequencing data. gzip format.

Folder 'barcodes':
This folder contains all the barcodes used to demultiplex all raw sequence data using the process_radtags function of STACKS.

dasy_index_barcodes_1.txt = barcodes used for first batch of demultiplexing (files beginning 52623_H7M)
dasy_index_barcodes_2.txt = barcodes used for second batch of demultiplexing (files beginning 52623_HNV)
dasy_index_barcodes_3.txt = barcodes used for third batch of demultiplexing (files beginning 52650_HTY)
dasy_index_barcodes_4_NEW_NEWER.txt = all barcodes used for fourth batch of demultiplexing (files beginning 12730-1A and 12730-1B)

Folder 'scripts':
This folder contains all scripts used to analyse the data and produce the results and outputs for the manuscript.

Subfolder 'commandline':
This folder contains all command line scripts and programs used in the analyses.

b_index_ref_genome.txt = command used to index the reference genome
c_dasy_dplex.txt = commands used to demultiplexing all raw fastq files using process_radtags
d_dasy_bwa_mapNEW.rb = Ruby script that runs mapping of all demultiplexed files to the reference genome, using BWA MEM command
e_dasy_sam_bamNEW.rb = Ruby script to convert SAM files to BAM files
f_dasy_filt_sort_indexNEW.rb = Ruby script that filters out unmapped reads from BAM files, sorts reads in BAM files, and indexes BAM files
g_angsd.txt = ANGSD commands used to call genotypes with information on each run
Metapop2_linux_code_quoll.txt = commands used to install and run metapop2 on the pawsey nimbus cloud instance

Subfolder 'R':
This folder contains all the R scripts used in the analyses.

dasy_genomics_20230620_publish.R = primary R script used to filter SNPs, output final genotypes, conduct some of the analyses, make figures, and prepare input files for other analyses.
nimbus_script_prioritisation.R = formal prioritisation analysis of allelic conservation for populations within whole species distribution. run on remote pawsey nimbus instance.
nimbus_script_prioritise_KIMB.R = formal prioritisation analysis of allelic conservation for populations within the Kimberley region. run on remote pawsey nimbus instance.
nimbus_script_prioritise_regions.R = formal prioritisation analysis of allelic conservation for regions across the whole species distribution. run on remote pawsey nimbus instance.
nimbus_script_prioritise_het.R = calculation of heterozygosities for all possible combinations of populations within species. run on remote pawsey nimbus instance.
Metapop2_quollsFINAL.R = script used to prepare all the subsets of populations and work with input and output files for MetaPop2 software.

Folder 'unfiltered':
This folder contains the unfiltered genotypes from ANGSD and the sample metadata prior to filtering.

sampinfo190623.csv = Metadata for all 501 samples used in ANGSD to call genotypes (a subset of all samples sent for sequencing - see "g_angsd.txt" file for details about ANGSD runs). Contact Brenton von Takach for locations of samples for the Yampi population. Headers include:
- sample - name given to individual
- state - Australian jurisdiction where the tissue sample was collected
- region - regional name of location where the tissue sample was collected
- popn - name given to locality in which a given set of tissue samples was collected.
- lat - latitude in decimal degrees of where an individual was captured
- long - longitude in decimal degrees of where an individual was captured
- treatment - category defining whether the population was on an 'island', had been exposed to cane toads ('post-toad'), or was naive to cane toads ('pre-toad').
- date - date that the tissue sample was collected
- sex - category of biological sex (i.e. 'M' for male, 'F' for female, and blank for unknown)

Subfolder 'angsd':
This folder contains the ANGSD filelist and output files, including called genotypes.

angsd_full_log_190623.txt = the full log file of all text output produced by running ANGSD
dasy_190623.arg = essentially a reduced log file output by ANGSD containing the arguments used in the function, to output genotypes and other files
dasy_190623.depthGlobal = The sequencing depth for all samples
dasy_190623.depthSample = This file contains nInd number of lines. Column1 is the number sites that has sequencing depth=0,Column2 is the number of sites that has sequencing depth=1 etc
dasy_190623.geno.gz = unfiltered SNP by sample matrix. gzip format.
dasy_190623.mafs.gz = minor allele frequencies for all loci. The allele frequency is obtained based on the genotype likelihoods. Headers: chromo, position, major, minor, knownEM, pK-EM, nInd. gzip format.
dasy_bam.filelist = full list of 501 samples used by ANGSD (a subset of all samples sent for sequencing - see "g_angsd.txt" file for details about ANGSD runs)

Folder 'filtered':
This folder contains the filtered genotypes and sample metadata, as well as all initial input genind files used in the MetaPop2 analysis.

dasy_genos_20231012.csv = filtered quoll genotypes used in primary R script (dasy_genomics_20230620_publish.R). The first column is the sample name and each column thereafter represents a single-nucleotide polymorphism (SNP). Headers for each SNP column are named by joining the scaffold number with the position on the scaffold, e.g. scaffold 1 at position 345 would be 'HiC_scaffold_1-345'. SNPs are defined as either 0 (homozygous reference allele), 1 (heterozygous), 2 (homozygous alternate allele), or NA for missing data.
samp_info_20231012.csv = filtered sample metadata used in primary R script (dasy_genomics_20230620_publish.R). Contact Brenton von Takach for locations of samples for the Yampi population. Headers include:
- sample - name given to individual
- state - Australian jurisdiction where the tissue sample was collected
- region - regional name of location where the tissue sample was collected
- popn - name given to locality in which a given set of tissue samples was collected.
- lat - latitude in decimal degrees of where an individual was captured
- long - longitude in decimal degrees of where an individual was captured
- treatment - category defining whether the population was on an 'island', had been exposed to cane toads ('post-toad'), or was naive to cane toads ('pre-toad').
- date - date that the tissue sample was collected
- sex - category of biological sex (i.e. 'M' for male, 'F' for female, and blank for unknown)

Subfolder 'metapop2':
This folder contains all input files used to run the metapop2 analysis at the three hierarchical levels used in the manuscript.

quolls_pop_in_kimb20231017.RData = 7 populations from the Kimberley region, using only polymorphic SNPs in the Kimberley. Minimum population size is 5 so was analysed at 5 individuals per population.
quolls_pop_in_species20231017.RData = 20 populations from across the species distribution, using all SNPs & all individuals. Minimum populations size is 5 so it was analysed at 5 individuals/population.
quolls_region_in_species20231017.RData = QLD, NT, Groote, Kimberley, Pilbara. Uses all SNPs, but subset to 5 individuals per population (to unbias sampling effects) and then pooled within regions. Analysed as 10 individuals per region.

In total, we processed 568 samples (including 36 technical replicates) for DNA extraction and sequencing. Tissue samples were obtained from researchers who were working across the entire distribution range of the northern quoll. Tissue collection spanned nearly three decades, from 1993 to 2021, and included the regional jurisdictions of Queensland, the Northern Territory, and Western Australia. To capture live animals, various trapping designs tailored to local conditions and research aims were used, but generally 10–20 cage traps were arranged in small grids (e.g. 1 ha) at intervals of about 1 km along roadsides or small vehicle tracks. In most cases, our samples consisted of small (2 mm diameter) ear tissue samples collected from live individuals. Individuals were sexed by visual inspection during processing, prior to release. Due to low capture rates of the target species in many areas, we assigned a single a priori population name (sampling locality) to all samples obtained from a given set of grids, even if the samples were collected over several years. Samples were typically stored in 70–100% ethanol, either in a freezer or at room temperature, until being sent to us for analysis.

Tissue samples were extracted in plate format using the standard protocol of the Qiagen DNeasy 96 Blood & Tissue Kit, with an extended lysis. This involved incubation at 56°C for 2 h, followed by a reduction in temperature to 37°C overnight. Following extraction, double-stranded DNA concentrations were quantified and normalised to 200 ng DNA in a total volume of 25 µL. These samples were then arranged in 96-well plates for double-digest restriction-associated DNA (ddRAD) sequencing at the Australian Genome Research Facility in Melbourne, Victoria. Each plate included several within-plate and among-plate technical replicates (for a total of 36 technical replicates) and a negative control (blank).

To determine the optimal combination of two restriction enzymes for ddRAD sequencing, three establishment samples (broadly representative of the species distribution) were used. PstI and NlaIII were deemed the most suitable enzymes for achieving the best genome representation while minimising repetitive sequences. The library preparation protocol included (1) digestion using PstI and NlaIII, (2) ligation with one of 48 unique inline barcoded adapters compatible with the restriction site overhang, (3) manual sample pooling, (4) DNA purification using the QIAquick PCR Purification Kit and SPRIselect paramagnetic beads, (5) size-selection targeting fragments of 280–375 bp in size using the BluePippin from Sage Science, and (6) a PCR amplification step where one of two multiplexing index primers was added (von Takach et al., 2022b). After indexing, libraries were pooled together and loaded onto flow cells for 150 bp single-end or paired-end sequencing (with only single-end reads used for analysis). Sequencing was performed on either an Illumina NextSeq 500 (three plates) or a NovaSeq 6000 (three plates) platform.

Our bioinformatics pipeline used a combination of tools and custom scripts to analyse sequencing data. Raw sequence data (2.271 billion reads) were first processed using the stacks process_radtags function for demultiplexing (Catchen et al., 2013), retaining 95.8% (2.176 billion) of reads. The demultiplexed files were then mapped to the published northern quoll chromosome-length genome assembly (https://www.dnazoo.org/assemblies/Dasyurus_hallucatus) using the bwa version 0.7.17 mem algorithm (Li, 2013), generating sequence alignment map (SAM) files for each sample (individuals and technical replicates), resulting in 2.25 billion alignments.

The SAM files were compressed to binary alignment map (BAM) files, and then each BAM file was filtered for unmapped alignments (retaining 98.8% = 2.223 billion alignments), then sorted and indexed using samtools v1.7-1 (Li et al., 2009). The filtered and sorted BAM files were then used to call single-nucleotide polymorphisms (SNPs) via the angsd (v0.93) software package (Korneliussen et al., 2014). The following filters were applied in angsd: minimum mapping quality of 20 (excluding reads that mapped poorly or mapped to repeat regions of the genome), minimum base quality of 20, minimum call rate of 0.5 (across samples), minimum depth per site of 1250, maximum depth per site of 53,200, minimum depth per individual of 6, maximum depth per individual of 100, allele balance ratio of 0.2, SNP likelihood p value ≤ 1×10^-5, and genotype posterior probability ≥ 0.98 (based on GATK genotype likelihood with a uniform prior). This retained 486,083 SNPs that were genotyped in ≥ 50% of samples.

All subsequent filters and analyses were conducted in a custom R (v4.3.1) (R Core Team, 2023) script that retained loci with < 5% missing data across samples and a minor allele count ≥ 3, as well as loci with observed heterozygosity < 0.6 (to exclude potential erroneously merged reads). Samples with > 35% missing data across SNPs were also excluded, to remove poorly genotyped individuals. To account for bias resulting from linkage disequilibrium (LD), we used the ‘SNPRelate’ package to prune SNPs in LD (Zheng et al., 2012), setting the LD threshold to 0.5 and the sliding window size to 500k bp. The gl.filter.sexlinked function in the ‘dartR’ package (Gruber et al., 2018; Mijangos et al., 2022) was then used to check for sex-linked SNPs. We also applied the filter.sex.linked function of Robledo-Ruiz et al. (2022), which checks for Y-linked loci, sex-biased loci, X-linked loci, and XY gametologs. No sex-linked loci were identified.

We then checked similarity between pairs of technical replicates, finding a mean similarity of 99.97%. Close pairing of technical replicates was also checked visually with a hierarchical clustering dendrogram (Supplementary Material Figure S1), after which we removed one individual from each pair of technical replicates, and one individual from each pair of close relatives (relatedness ≥ 0.25), with relatedness calculated using the method-of-moments technique in the ‘beta.dosage’ function of the hierfstat package (Goudet, 2005; Goudet et al., 2018). This approach estimates kinship values between pairs of individuals relative to the average kinship values of all pairs of individuals in the sampled population (i.e., within each locality). The final dataset contained 12,962 SNPs and 384 individuals.