A chromosome-scale genome assembly of the okapi (Okapia johnstoni)
Winter, Sven; Coimbra, Raphael T. F.; Helsen, Philippe; Janke, Axel (2022), A chromosome-scale genome assembly of the okapi (Okapia johnstoni), Dryad, Dataset, https://doi.org/10.5061/dryad.37pvmcvp3
The okapi (Okapia johnstoni), or forest giraffe, is the only species in its genus and the only extant sister group of the giraffe within the family Giraffidae. The species is one of the remaining large vertebrates surrounded by mystery because of its elusive behavior as well as the armed conflicts in the region where it occurs, making it difficult to study. Deforestation puts the okapi under constant anthropogenic pressure, and it is currently listed as “Endangered” on the IUCN Red List. Here, we present the first annotated de novo okapi genome assembly based on PacBio continuous long reads, polished with short reads, and anchored into chromosome-scale scaffolds using Hi-C proximity ligation sequencing. The final assembly (TBG_Okapi_asm_v1) has a length of 2.39 Gbp, of which 98% are represented by 28 scaffolds >3.9 Mbp. The contig N50 of 61 Mbp and scaffold N50 of 102 Mbp, together with a BUSCO score of 94.7%, and 23,412 annotated genes, underline the high quality of the assembly. This chromosome-scale genome assembly is a valuable resource for future conservation of the species and comparative genomic studies among the giraffids and other ruminants.
We assembled the genome of the okapi from pacbio CLR reads using WTDBG2 v. 2.5 (Ruan & Li, 2019) using the preset for PacBio Sequel reads (flag '-x sq') followed by three iterations of long-read polishing with racon v.1.4.3 (Vaser et al., 2017) and three iterations of short-read polishing with pilon v.1.23 (Walker et al., 2014). The assembly was scaffolded into chromosome-scale scaffolds with the Dovetail Genomics´ HiRise pipeline (Putnam et al., 2016) using publically available Hi-C data (SRR8616855, SRR8616856). Subsequently, three iterations of gap-closing were performed using TGS-GapCloser v.1.1.1 (Xu et al., 2020). The resulting final assembly can be found under the filename:
Prior to gene annotation, we used RepeatModeler v. 2.0.1 for the generation of a de novo repeat library. This library was combined with a Cetartiodactyla-specific (Flynn et al., 2020) library from RepBase (Bao et al., 2015) and used as a custom repeat library for the masking of repeats with RepeatMasker v.4.1.0 (http://www.repeatmasker.org/RMDownload.html). Interspersed repeats were hard-masked while simple repeats were soft-masked. The masked assembly file can be found under the filename:
After repeat masking we used the GeMoMa pipeline v.1.7.1 (Keilwagen et al., 2016, 2018) for homology-based gene prediction with the alignment tool MMSeqs2 (Steinegger & Söding, 2017). As references we used the assemblies and annotations of the following ten mammals species from GenBank: Bos taurus (GCF_002263795.1), Homo sapiens (GCF_000001405.39), Mus musculus (GCF_000001635.27), Sus scrofa (GCF_000003025.6), Camelus dromedarius (GCF_000803125.2), Equus caballus (GCF_002863925.1), Ovis aries (GCF_002742125.1), Tursiops truncatus (GCF_011762595.1), Cervus hanglu yarkandensis (GCA_010411085.1), and Capra hircus (GCF_001704415.1). Subsequently, the predicted genes were annotated by a BLASTP v.2.11.0+ (Zhang et al., 2000) search against the Swiss-Prot database (release 2021-01). with an e-value cutoff of 10-6. We further annotated Gene ontology terms, motifs, and domains using InterProScan v.5.50.84 (Jones et al., 2014; Quevillon et al., 2005). The annotation results (gff3, CDS fasta, proteins fasta) can be found under the filenames:
For detailed methods and additional results, please read the linked publication in the Journal of Heredity.
LOEWE Centre for Translational Biodiversity Genomics