Genomic insights into the critically endangered King Island scrubtit
Data files
Jun 03, 2024 version files 118.62 MB
Abstract
Small, fragmented or isolated populations are at risk of population decline due to fitness costs associated with inbreeding and genetic drift. The King Island scrubtit Acanthornis magna greeniana is a critically endangered subspecies of the nominate Tasmanian scrubtit A. m. magna, with an estimated population of < 100 individuals persisting in three patches of swamp forest. The Tasmanian scrubtit is widespread in wet forests on mainland Tasmania. We sequenced the scrubtit genome using PacBio HiFi and undertook a population genomic study of the King Island and Tasmanian scrubtits using a double-digest restriction site-associated DNA (ddRAD) dataset of 5,239 SNP loci. The genome was 1.48 Gb long, comprising 1,518 contigs with an N50 of 7.715 Mb. King Island scrubtits formed one of four overall genetic clusters, but separated into three distinct subpopulations when analysed independently of the Tasmanian scrubtit. Pairwise FST values were greater among the King Island scrubtit subpopulations than among most Tasmanian scrubtit subpopulations. Genetic diversity was lower and inbreeding coefficients were higher in the King Island scrubtit than all except one of the Tasmanian scrubtit subpopulations. We observed crown baldness in 8/15 King Island scrubtits, but 0/55 Tasmanian scrubtits. Six loci were significantly associated with baldness, including one within the DOCK11 gene which is linked to early feather development. Contemporary gene flow between King Island scrubtit subpopulations is unlikely, with further field monitoring required to quantify the fitness consequences of its small population size, low genetic diversity and high inbreeding. Evidence-based conservation actions can then be implemented before the taxon goes extinct.
README: Genomic insights into the critically endangered King Island scrubtit
https://doi.org/10.5061/dryad.12jm63z66
The dataset comprises:
1) the raw .vcf file "CAGRF220911987_GA.vcf"
2) a csv of the final filtered dataset used for genomic analysis within the manuscript: "scrubtit_genos_140623.csv". Rows contain samples and columns contain SNPs
3) a csv of sample metadata: "Sampinfo_ross_check.csv". Rows contain samples and columns contain metadata
4) an annotated R- script for repeat analysis
Description of the data and file structure
2) a csv of the final filtered dataset used for genomic analysis within the manuscript: "scrubtit_genos_140623.csv". Rows contain samples and columns contain SNPs
column 1: sample ID (as per 'sample1') in sample medtadata (column 1).
columns 2-5239: SNP ids
Missing data code: NA
3) a csv of sample metadata: "Sampinfo_ross_check.csv". Rows contain samples and columns contain metadate
column 1- sample = sequencing sample ID
column 2- sample1 = linked sample ID in the sample x SNP matrix
column 3- state = state in which sample was obtained
column 4- region = 2-level factor of Mainland (Tas) or King Island
column 5- popn= 11 level factor for sample subpopulation as shown in Figure 1 of the manuscript. Also includes 'Weinglangta' where individual was collected for genome sequencing
column 6- popn2 = 3 level factor for sampling area - mainland (Tas), east coast (Tas) or King Island
column 7 & 8 - WGS84 decimal lat/long of sampling location
column 9- treatment - all 'wild'
column 10- sampling date in dd/mm/yyyy
column 11- sex - molecular sex of sample
column 12- notes - whether or not individual showed baldness on crown
column 13- tech rep- whether or not sample was a technical replicate in the sequencing. no = 'blank', yes = 'Technical Replicate'
column 14- specimen_id- Threatened species initiative specimen ID
Sharing/Access information
The genome assembly and raw transcriptome data will be made available under NCBI’s BioProject PRJNA1014961. The raw PacBio HiFi reads are publicly available from the Bioplatforms Australia Threatened Species Initiative: https://data.bioplatforms.com/organization/threatened-species. The assembled genome, global transcriptome, and genome annotation generated in this study are available on Amazon Web Services Australasian Genomes Open Data Store: https://awgg-lab.github.io/australasiangenomes/genomes.html.
Code/Software
All analysis was conducted in R using the package versions stated within the manuscript.
Methods
2.1 Sample collection
To obtain indicative genetic diversity metrics across mainland Tasmania, we sampled between five and eleven scrubtits from seven a-priori subpopulations on mainland Tasmania (including Bruny Island) during the non-breeding season (January – March 2021). Due to small population sizes and licensing restrictions on King Island, we sampled five individuals from each of the three locations during the same non-breeding season (Table 1, Figure 1). We trapped scrubtits using a single 6m mist net and one minute of scrubtit song broadcast using portable speakers (ANU animal ethics permit # A2021/33). We sampled blood (< 20 μl per individual) using the standard brachial venepuncture technique with a 0.7mm needle into 70% ethanol. For two individuals from whom we were unable to safely obtain blood, we collected feathers shed during handling. One male Tasmanian scrubtit was collected under licence (see acknowledgements) for genome sequencing, from which organ tissue samples (heart, spleen, kidney, gonads, brain, liver) were taken (Table S1). For each individual we took standard morphometric measurements and scanned for any unusual physical features such as feather abnormalities or skin lesions that may be indicators of poor health. A single observer (CY) sampled and measured all birds, and the maximum capture time was 35 minutes. No birds showed adverse reactions to sampling and all flew off strongly upon release. The fifteen individuals sampled on King Island was the maximum permissible sample size under licence conditions.
2.2 DNA extraction, sexing and sequencing
High molecular weight DNA was extracted from flash frozen heart and kidney using the Nanobind Tissue Big DNA Kit v1.0 11/19 (Circulomics). A Qubit fluorometer (Thermo Fisher Scientific) was used to quantify DNA concentrations with the Qubit dsDNA BR assay kit (Thermo Fisher Scientific). RNA was extracted from heart, spleen, kidney, gonads, brain, and liver stored in RNA later using the RNeasy Plus mini Kit (Qiagen) with RNAse-free DNAse (Qiagen) digestion. RNA quality was assessed via Nanodrop (Thermo Fisher Scientific). We extracted DNA for population genomics from blood and feather samples using the Monarch® Genomic DNA Purification Kit (New England BioLabs, Victoria, Australia). We quantified DNA concentrations using a Qubit 3.0 fluorometer (yield range 10.3 – 209 ng μl-1, Table S1) and standardised the concentration of each sample to 10-30 ng µl-1 DNA for 20 – 25 μl and determined the sex of individuals using a polymerase chain reaction (PCR) protocol adapted from Fridolfsson and Ellegren (1999, Supplementary file S1). We arranged the samples on a single 96 well plate, containing five technical replicates of the samples with the highest DNA concentrations, an additional 21 non-technical replicates including all of the King Island samples, five extra samples from mainland Tasmania and one negative control.
Double-digest restriction associated DNA (ddRAD) sequencing following Peterson et al. (2012) was undertaken at the Australian Genome Research Facility, Melbourne on an Illumina NovaSeq 6000 platform using 150bp paired-end reads. Samples were first quantified using Quantifluor and visualised on 1 % agarose e-gel to ensure all samples exceeded the minimum input DNA quantity of 50 ng. Three establishment samples with at least 250 ng DNA that were representative of the distribution of the samples (2 Tasmanian scrubtits, 1 King Island scrubtit) were used to determine the optimal combination of restriction enzymes, which were EcoRI and HpyCH4IV. Further details on the library preparation protocol are provided in Supplementary file S1.
2.3 Genome sequencing and assembly
Full methodological details of the genome and transcriptome sequencing and assembly are provided in Supplementary file S2. In summary, high molecular weight DNA was sent for PacBio HiFi library preparation with Pippin Prep and sequencing on one single molecule real-time (SMRT) cell of the PacBio Sequel II (Australian Genome Research Facility, Brisbane, Australia). Total RNA was sequenced as 100 bp paired-end reads using Illumina NovaSeq 6000 with Illumina Stranded mRNA library preparation at the Ramaciotti Centre for Genomics (University of New South Wales, Sydney, Australia). Genome assembly was conducted on Galaxy Australia (The Galaxy Community, 2022) following the genome assembly guide (Price & Farquharson, 2022) using HiFiasm v0.16.1 with default parameters (Cheng et al., 2021; Cheng et al., 2022). Transcriptome assembly was conducted on the University of Sydney High Performance Computer, Artemis. Genome annotation was performed using FGENESH++ v7.2.2 (Softberry; (Solovyev et al., 2006)) on a Pawsey Supercomputing Centre Nimbus cloud machine (256 GB RAM, 64 vCPU, 3 TB storage) using the longest open reading frame predicted from the global transcriptome, non-mammalian settings, and optimised parameters supplied with the Corvus brachyrhynchos (American crow) gene-finding matrix. The mitochondrial genome was assembled using MitoHifi v3 (Uliano-Silva et al., 2023). Benchmarking universal single copy orthologs (BUSCO) was used to assess genome, transcriptome and annotation completeness (Manni et al., 2021).
2.4 Bioinformatics pipeline and SNP filtering
Raw sequence data were processed using Stacks v2.62 (Catchen et al., 2013) and aligned to the genome with BWA v0.7.17-r1188 (Li & Durbin, 2009). Full details of the bioinformatics pipeline, which produced a variant call format (VCF) file containing 45,488 variants for SNP filtering in R v4.0.3 (R Core team 2020) are provided in Supplementary file S1. We filtered genotyped variants using the “SNPfiltR” v1.0.0 package (DeRaad, 2022) based on (i) minimum read depth (≥ 5), (ii) genotype quality (≥ 20), (iii) maximum read depth (≤ 137), and (iv) allele balance ratio (0.2 – 0.8). Then, using a custom R script, we filtered SNPs based on (i) the level of missing data (< 5%); (ii) minor allele count (MAC ≥ 3), (iii) observed heterozygosity (< 0.6), and (iv) linkage disequilibrium (correlation < 0.5 among loci within 500,000 bp).
To ensure that relationships between individuals could be accurately inferred from the data, we used these SNPs and samples to construct a hierarchical clustering dendrogram based on genetic distance, with visual examination of the dendrogram confirming that all 24 replicates paired closely together on long branches (Figure S1). The percentage difference between called genotypes of technical replicates was also used to confirm that genotyping error rates were low after filtering (mean 99.91% ± 0.005% SE similarity between replicates). We therefore removed one of each replicate pair from all further analyses. We also made a higher-level bootstrapped dendrogram by using genetic distances among sampling localities instead of individuals (Figure S2).
We used “tess3r” (Caye et al. 2016, 2018) to perform a genome scan for loci under selection, using the Bejamini-Hochberg algorithm (Benjamini & Hochberg, 1995), with a false discovery rate of 1 in 10,000 to correct for multiple testing. Because this method identified zero candidate loci under selection, we also used the gl.outflank function in “dartR” v2.0.4 to implement the OutFLANK method (Whitlock & Letterhos 2015) to infer the distribution of FST for loci unlikely to be strongly affected by spatially diversifying selection. This method also identified zero putatively adaptive loci, leaving a final dataset for formal population genetic analysis containing all 70 originally sampled individuals, 5,239 biallelic SNPs, and an overall missing data level of 0.98 %. The number of SNPs and samples removed from the dataset at each filtering step is provided in Table S2.
See accompanying Supplementary File for further information on library preparation, molecular sexing, library preparation, bioinformatics, genome sequencing, assembly and annotation. References cited above are provided in the main document.