Genome assembly of the Australian black tiger shrimp (Penaeus monodon) reveals a novel fragmented IHHNV EVE sequence

Huerlimann, Roger 1 ; Yinan, Wang2; Kasinadhuni, Naga 2 3 ; Chon-Kit Kenneth, Chan2; Jabbari, Jafar2; Cowley, Jeff A 4 ; Wade, Nicholas M 4 ; Siemering, Kirby 2 ; Gordon, Lavinia 2 ; Tinning, Matthew 2 ; Montenegro, Juan D 2 ; Maes, Gregory E 5 ; Sellars, Melony J 4 ; Coman, Greg J 4 ; McWilliam, Sean 4 ; Zenger, Kyall R 1 ; Khatkar, Mehar S 6 ; Raadsma, Herman W 6 ; Donovan, Dallas7; Gopala, Krishna7; Jerry, Dean R 1 ; Wang, Yinan 2 ; Chan, Chon-Kit Kenneth 2 ; Jabbari, Jafar S 2 ; Krishna, Gopala1

Published Dec 19, 2022 on Dryad. https://doi.org/10.5061/dryad.f4qrfj6xh

Data files

Dec 19, 2022 version files 4.87 GB

Pmono_genome.all.gff

4.87 GB
README_.md

1.16 KB

Abstract

Abstract Shrimp are a valuable aquaculture species globally; however, disease remains a major hindrance to shrimp aquaculture sustainability and growth. Mechanisms mediated by endogenous viral elements have been proposed as a means by which shrimp that encounter a new virus start to accommodate rather than succumb to infection over time. However, evidence on the nature of such endogenous viral elements and how they mediate viral accommodation is limited. More extensive genomic data on Penaeid shrimp from different geographical locations should assist in exposing the diversity of endogenous viral elements. In this context, reported here is a PacBio Sequel-based draft genome assembly of an Australian black tiger shrimp (Penaeus monodon) inbred for 1 generation. The 1.89 Gbp draft genome is comprised of 31,922 scaffolds (N50: 496,398 bp) covering 85.9% of the projected genome size. The genome repeat content (61.8% with 30% representing simple sequence repeats) is almost the highest identified for any species. The functional annotation identified 35,517 gene models, of which 25,809 were protein-coding and 17,158 were annotated using interproscan. Scaffold scanning for specific endogenous viral elements identified an element comprised of a 9,045-bp stretch of repeated, inverted, and jumbled genome fragments of infectious hypodermal and hematopoietic necrosis virus bounded by a repeated 591/590 bp host sequence. As only near complete linear ∼4 kb infectious hypodermal and hematopoietic necrosis virus genomes have been found integrated in the genome of P. monodon previously, its discovery has implications regarding the validity of PCR tests designed to specifically detect such linear endogenous viral element types. The existence of joined inverted infectious hypodermal and hematopoietic necrosis virus genome fragments also provides a means by which hairpin double-stranded RNA could be expressed and processed by the shrimp RNA interference machinery.

Shrimp breeding and selection for sequencing:

A second-generation (G2) male Penaeus monodon that had undergone a single cycle of inbreeding was selected for genomic sequencing. The original wild-caught broodstock were collected from a Queensland east coast location (approximately 17.3°S, 146.0°E) in September 2013. In October 2013, 14 first-generation (G1) families were produced from the brood stock at Seafarm Flying Fish Point hatchery (approximately 17.5°S, 146.1°E). In February 2015, pleopod tissue was sampled from 50 female and 50 male G1 broodstock. These tissues were genotyped (using 2 x 60 SNP panels (Sellars et al. 2014) to identify the parental origin of each broodstock and to select related mating pairs to generate the inbred G2 progeny. In August 2015, groups of 50 juvenile males from 5 inbred G2 families were euthanized to collect muscle tissue from the first abdominal segment for sequencing and the second most anterior pair of pleopods for genotyping. These tissues, as well as the remainder of each shrimp (archived source of tissue for sequencing) were snap frozen under dry ice pellets and stored at -80°C. Each shrimp was then genotyping using the 120-SNP panel (Sellars et al. 2014) and a genome-wide SNP assay based on DArTSeq (Guppy et al. 2020). After ranking the 50 males based on inbreeding coefficient (F) and multi-locus heterozygosity (MLH) data from the 120-SNP panel, the individual (named Nigel) with the highest inbreeding coefficient was chosen for genomic sequencing. The choice was confirmed using a genome-wide SNP assay based on DArTSeq of the top five inbred shrimp based on the 120-SNP panel which recovered the same ranking (Nigel: MLH of 0.231 and F of 0.271).

DNA extraction, library preparation and genome sequencing:

Multiple extraction methods were trialed to generate intact high-quality genomic DNA from stored muscle tissue of the single selected inbred shrimp. All DNA extractions and sequencing runs were carried out at the Australian Genome Research Facility (AGRF), Melbourne, Australia. For Illumina sequencing, the MagAttract HMW DNA kit (QIAGEN) was used and PCR-free fragment shotgun libraries were prepared using the ‘with-bead pond library’ construction protocol described by Fisher et al. (Fisher et al. 2011) with some modifications. The library was sequenced on two HiSeq 2500 lanes using a 250 bp PE Rapid sequencing kit (Illumina). The same DNA was also used to create a 10X Genomics Chromium library as per manufacturer instructions, which was sequenced on two HiSeq 2500 lanes using a 250 bp PE Rapid sequencing kit. For PacBio sequencing, the following DNA extraction methods were used with varying success: MagAttract HMW DNA kit (QIAGEN), Nanobind HMW Tissue DNA kit-alpha (Circulomics), and CTAB/Phenol/Chloroform. Libraries were prepared using the SMRTbell Template Prep Kit 1.0 (PacBio), loaded using either magbeads or diffusion, and sequenced using the Sequel Sequencing Kits versions 2.1 and 3.0 on a PacBio Sequel (Supplementary Table 1). The same muscle tissue was also used to prepare three Dovetail Hi-C libraries according to manufacturer's instructions. Two libraries were sequenced on a shared lane of a NovaSeq S1 flow cell, and a third library was sequenced on one lane of a NovaSeq SP flow cell, with both sequencing runs generating 100 bp paired-end reads.

Genome assembly:

The quality of the initial short-read genome assemblies using either DISCOVAR de novo (Weisenfeld et al. 2014) with Illumina data, or Supernova (Weisenfeld et al. 2017) with 10X Genomics Chromium data was poor. The most contiguous assembly was achieved using wtdbg2/redbean (Version 2.4, Ruan and Li 2019) with 75 X times coverage of PacBio data, setting the estimated genome size to 2.2 Gb, but without using the wtdbg2 inbuilt polishing. The raw assembly was subjected to two rounds of polishing using the PacBio subreads data in arrow (Version 2.3.3, github.com/PacificBiosciences/GenomicConsensus) and one round of polishing using the Illumina short-read data in pilon (Version 1.23, Walker et al. 2014). Scaffolds were constructed in two steps. Medium-range scaffolding carried out using 10X Genomics Chromium data with longranger (Version 2.2.2, https://support.10xgenomics.com/genome-exome/software/downloads/latest) and ARCS (Version 1.0.6, Yeo et al. 2017), while long-range scaffolding was performed using dovetail Hi-C data, and intra- and inter-chromosomal contact maps were built using HiC-Pro (Version 2.11.1, Servant et al. 2015) and SALSA (commit version 974589f, Ghurye et al. 2017). This genome assembly was then submitted to NCBI GenBank, which required the removal of two small scaffolds and the splitting of one scaffold. The overall quality of the final V1.0 genome was assessed using BUSCO, and through mapping of RNA-seq, and Illumina short-reads using HiSAT2 (version 2.1.0, Kim et al. 2019).

Repeat annotation:

Repeat content was assessed with de novo searches using RepeatModeler (V2.0.1) and RepeatMasker (V4.1.0) via Dfam TE-Tools (V1.1, https://github.com/Dfam-consortium/TETools) within Singularity (V2.5.2, Kurtzer et al. 2017). Additionally, tandem repeat content was determined using Tandem Repeat Finder (V4.0.9, Benson 1999) within RepeatModeler. Analyses and plotting of interspersed repeats were carried out as per Cooke et al. (2020, github.com/iracooke/atenuis_wgs_pub/blob/master/09_repeats.md). Additionally, the genomes of the Black tiger shrimp (Thai origin, www.biotec.or.th/pmonodon; Kim et al. 2019), Whiteleg shrimp (P. vannamei, NCBI accession: QCYY00000000.1; Zhang et al. 2019), Japanese blue crab (Portunus trituberculatus, gigadb.org/dataset/100678; Tang et al. 2020), and Chinese mitten crab (Eriocheir japonica sinensis, NCBI accession: LQIF00000000.1) were run through the same analyses for comparison.

Gene prediction and annotation:

In order to generate an RNA-seq based transcriptome, raw data from a previous study (NCBI project PRJNA421400; Huerlimann et al. 2018) was mapped to the masked genome using STAR (Version 2.7.2b; Dobin et al. 2013), followed by Stringtie (Version 2.0.6; Pertea et al. 2015) (Supplementary Table 2). Additionally, the IsoSeq2 pipeline (PacBio) was used to process the ISO-seq data generated in this study (Supplementary Table 2). Finally, the genome annotation was carried out in MAKER2 (v2.31.10; Campbell et al. 2014; Cantarel et al. 2008; Holt and Yandell 2011) using the assembled RNA-seq and ISO-seq transcriptomes together with protein sequences of other arthropod species.

Genome assembly of the Australian black tiger shrimp (Penaeus monodon) reveals a novel fragmented IHHNV EVE sequence

Data files

Abstract

Methods

Usage notes

Works referencing this dataset