Winter, Sven 1 ; de Raad, Jordi1; Wolf, Magnus1; Coimbra, Raphael T. F.1; de Jong, Menno J.1; Schöneberg, Yannis1; Christoph, Maria2; von Klopotek, Hagen2; Bach, Katharina2; Pashm Foroush, Behgol2; Hanack, Wiebke2; Kauffeldt, Aaron Hagen2; Milz, Tim2; Ngetich, Emmanuel Kipruto2; Wenz, Christian2; Sonnewald, Moritz3; Nilsson, Maria A.1; Janke, Axel1

Published Dec 26, 2022; Updated Jan 13, 2023 on Dryad. https://doi.org/10.5061/dryad.7pvmcvdxv

Abstract

Despite increasing sequencing efforts, numerous fish families still lack a reference genome, which complicates genetic research. One such understudied family is the sand lances (Ammodytidae, literally: ‘sand burrower’), a globally distributed clade of over 30 fish species that tend to avoid tidal currents by burrowing into the sand. Here, we present the first annotated chromosome-level genome assembly of the great sand eel (Hyperoplus lanceolatus). The genome assembly was generated using Oxford Nanopore Technologies long sequencing reads and Illumina short reads for polishing. The final assembly has a total length of 808.5 Mbp, of which 97.1% were anchored into 24 chromosome-scale scaffolds using proximity-ligation scaffolding. The assembly is highly contiguous with a scaffold and contig N50 of 33.7 Mbp and 31.3 Mbp, respectively, and has a BUSCO completeness score of 96.9%. The presented genome assembly is a valuable resource for future studies of sand lances, as they are of great ecological and commercial importance and may also contribute to studies aiming to resolve the suprafamiliar taxonomy of bony fishes.

Genome assembly

We assembled the genome of Hyperoplus lanceolatus from Oxford Nanopore (ONT) reads using WTDBG2 v. 2.5 (Ruan & Li, 2019) using the preset for ONT reads (flag '-x ont') followed by three iterations of long-read polishing with racon v.1.4.3 (Vaser et al., 2017), one iteration of polishing with Medaka v.0.11.5 (Oxford Nanopore Technologies LTD., 2018) and three iterations of short-read polishing with pilon v.1.23 (Walker et al., 2014). The assembly was scaffolded into chromosome-scale scaffolds with the Dovetail Genomics´ HiRise pipeline (Putnam et al., 2016) using proximity-ligation data generated by the Dovetail Omni-C kit. Subsequently, gap-closing was performed using TGS-GapCloser v.1.1.1 (Xu et al., 2020), followed by the removal of haplotigs with purge_dups v.1.2.5 (Guan et al., 2020). The resulting final assembly, incl. the mitochondrial genome generated with MitoZ v.2.4 (Meng et al., 2019), can be found under the filename:

TBG_H_lanceolatus_asm_v1.1.fasta

Transcriptome

A transcriptome was assembled using the best practice guidelines described at https://informatics.fas.harvard.edu/best-practices-for-de-novo-transcriptome-assembly-with-trinity.html from Illumina RNAseq data generated from brain, heart, gill, muscle, liver, gonad, and pyloric gland tissue. The transcriptome can be found under the file name:

Hlan002_transcriptome_cleaned_for_ncbi_final.fasta

Annotation

Prior to gene annotation, we used RepeatModeler v. 2.0.1 (Flynn et al., 2020) for the generation of a de novo repeat library. This library was combined with an Actinopterygii-specific library from RepBase (Bao et al., 2015) and used as a custom repeat library for the masking of repeats with RepeatMasker v.4.1.0 (http://www.repeatmasker.org/RMDownload.html). First, we hard-masked all repeats in the assembly and in addition, we generated a masked assembly with hard-masked interspersed repeats andsoft-masked simple repeats. Both the masked assembly files, the de novo repeat library, and the related RepeatMasker output files can be found under the filenames:

De Novo Repeat Library:

consensi.fa.classified

all repeats hard-masked:

TBG_H_lanceolatus_asm_final.purged_mtgenome.fa.masked
TBG_H_lanceolatus_asm_final.purged_mtgenome.fa.tbl
TBG_H_lanceolatus_asm_final.purged_mtgenome.fa.out

Interspersed repeats hardmasked:

TBG_H_lanceolatus_asm_final_hardmaskedTEs.fasta
TBG_H_lanceolatus_asm_final_hardmaskedTEs.out
TBG_H_lanceolatus_asm_final_hardmaskedTEs.tbl

Interspersed repeats hardmasked (see above) and simple repeats soft-masked:

TBG_H_lanceolatus_asm_final_hardmaskedTEs_softmaskedSR.fasta
TBG_H_lanceolatus_asm_final_hardmaskedTEs_softmaskedSR.out
TBG_H_lanceolatus_asm_final_hardmaskedTEs_softmaskedSR.tbl

Homology-based gene prediction was performed with the GeMoMa pipeline v.1.7.1 with mapped RNAseq data as evidence and the following five references:

Acanthochromis polyacanthus (GCA_002109545.1) Perca fluviatilis (GCA_010015445.1), Gasterosteus aculeatus (GCA_016920845.1), Betta splendens (GCA_900634795.3), Acanthopagrus latus (GCA_904848185.1)

The predicted proteins were functionally annotated with InterProScan and BlastP against the Swiss-Prot database.

Annotation files:

H_lanceolatus_GeMoMa_all.fun.gff
H_lanceolatus_GeMoMa_proteins.fun.fasta
H_lanceolatus_GeMoMa_CDS.fun.fasta
H_lanceolatus_GeMoMa_summary

All commands used to generate the assembly, the annotation, and additional analyses are listed in the protocol file:

H_lanceolatus_assembly_commands.txt

A chromosome-scale reference genome assembly of the great sand eel, Hyperoplus lanceolatus

Data files

Abstract

A chromosome-scale reference genome assembly of the great sand eel, Hyperoplus lanceolatus

Data files

Abstract

Methods

Works referencing this dataset