A chromosome-scale high-contiguity genome assembly of the threatened cheetah (Acinonyx jubatus)
Data files
Jan 18, 2023 version files 11.11 GB
-
Ajub_assembly_commands.txt
20.87 KB
-
consensi.fa.classified
651.50 KB
-
README.md
3.96 KB
-
VMU_Ajub_asm_v1.0_hardmaskedTE_softmaskedSR.fasta
2.43 GB
-
VMU_Ajub_asm_v1.0_hardmaskedTE_softmaskedSR.fasta.out
88.87 MB
-
VMU_Ajub_asm_v1.0_hardmaskedTE_softmaskedSR.fasta.tbl
2.37 KB
-
VMU_Ajub_asm_v1.0_hardmaskedTE.fasta
2.43 GB
-
VMU_Ajub_asm_v1.0_hardmaskedTE.fasta.out
597.16 MB
-
VMU_Ajub_asm_v1.0_hardmaskedTE.fasta.tbl
2.36 KB
-
VMU_Ajub_asm_v1.0.fasta
2.42 GB
-
VMU_Ajub_asm_v1.0.fasta.masked
2.43 GB
-
VMU_Ajub_asm_v1.0.fasta.out
733.86 MB
-
VMU_Ajub_asm_v1.0.fasta.tbl
2.36 KB
Abstract
The cheetah (Acinonyx jubatus, SCHREBER 1775) is a large felid and is considered the fastest land animal. Historically, it inhabited open grassland across Africa, the Arabian Peninsula, and southwestern Asia; however, only small and fragmented populations remain today. Here, we present a de novo genome assembly of the cheetah based on PacBio continuous long reads and Hi-C proximity ligation data. The final assembly (VMU_Ajub_asm_v1.0) has a total length of 2.38 Gb, of which 99.7% are anchored into the expected 19 chromosome-scale scaffolds. The contig and scaffold N50 values of 96.8 Mb and 144.4 Mb, respectively, a BUSCO completeness of 95.4% and a k-mer completeness of 98.4%, emphasize the high quality of the assembly. Furthermore, annotation of the assembly identified 23,622 genes and a repeat content of 40.4%. This new highly contiguous and chromosome-scale assembly will greatly benefit conservation and evolutionary genomic analyses and will be a valuable resource, e.g., to gain a detailed understanding of the function and diversity of immune response genes in felids.
The presented data is related to the eponymous publication "A chromosome-scale high-contiguity genome assembly of the threatened cheetah (Acinonyx jubatus)" soon to be published in the Journal of Heredity.
Any questions regarding this dataset or the publication can be addressed to the corresponding authors, Sven Winter (sven.winter@vetmeduni.ac.at) and Pamela Burger (pamela.burger@vetmeduni.ac.at).
Assembly:
The assembly was generated from one PacBio CLR library sequenced on one SMRTCell on a Sequel IIe using Flye v. 2.9, including one iteration of long-read polishing followed by one iteration of short-read polishing with pilon v.1.23 using trimmed standard Illumina short-reads generated on the Illumina Novaseq 6000 platform. Subsequently, the contigs of the polished assembly were anchored into chromosome-scale scaffolds with YaHS v.1.1 using publically available Hi-C data for the cheetah (SRR8616936, SRR8616937) that were prepared following the Arima Hi-C mapping pipeline (https://github.com/VGP/vgp-assembly/blob/master/pipeline/salsa/arima_mapping_pipeline.sh). Finally, two iterations of gap-closing were performed with TGS-GapCloser v. 1.1.1 using a different random subset of PacBio reads (25%) for each iteration.
Annotation:
Repeat Annotation
To improve gene prediction, we first identified and masked the repeats in the assembly. A de novo repeat library was generated with RepeatModeler v.2.0.1 and combined with a Felidae-specific repeat library from RepBase. This custom repeat library was then used with RepeatMasker v.4.1.0 to hard-mask interspersed repeats and soft-mask simple repeats.
Gene annotation
We predicted genes in the masked assembly based on homology using the GeMoMa pipeline v. 1.7.1 and the following reference assemblies and annotation files: Homo sapiens (GCF_000001405.40), Mus musculus (GCF_000001635.27), Lynx canadensis(GCF_007474595.2), Canis lupus familiaris (GCF_014441545.1), Prionailuris bengalensis (GCF_016509475.1), Leopardus geoffroyi (GCF_018350155.1), Felis catus (GCF_018350175.1), Panthera tigris (GCF_018350195.1), and Panthera leo (GCF_018350215.1).
We functionally annotated the predicted proteins using InterProScan v.5.50.84 and a BLASTP v.2.11.0 search against the Swiss-Prot database (release 2021-02).
For more details on assembly quality assessment and comparative analyses to other Felidae assemblies, please read the original manuscript.
This dataset comprises the following files:
VMU_Ajub_asm_v1.0.fasta (final unmasked assembly, also available at GenBank under accession GCA_027475565.1)
VMU_Ajub_asm_v1.0.fasta.masked (final assembly with all repeats hard-masked)
VMU_Ajub_asm_v1.0.fasta.out (RepeatMasker output for the fully hard-masked assembly VMU_Ajub_asm_v1.0.fasta.masked)
VMU_Ajub_asm_v1.0.fasta.tbl (RepeatMasker output for the fully hard-masked assembly VMU_Ajub_asm_v1.0.fasta.masked)
VMU_Ajub_asm_v1.0_hardmaskedTE.fasta (final assembly with all interspersed repeats hard-masked)
VMU_Ajub_asm_v1.0_hardmaskedTE.fasta.out (RepeatMasker output for the fully hard-masked assembly VMU_Ajub_asm_v1.0_hardmaskedTE.fasta)
VMU_Ajub_asm_v1.0_hardmaskedTE.fasta.tbl (RepeatMasker output for the fully hard-masked assembly VMU_Ajub_asm_v1.0_hardmaskedTE.fasta)
VMU_Ajub_asm_v1.0_hardmaskedTE_softmaskedSR.fasta (final assembly with all interspersed repeats hard-masked and simple repeats soft-masked)
VMU_Ajub_asm_v1.0_hardmaskedTE_softmaskedSR.fasta.out (RepeatMasker output for the fully hard-masked assembly VMU_Ajub_asm_v1.0_hardmaskedTE_softmaskedSR.fasta)
VMU_Ajub_asm_v1.0_hardmaskedTE_softmaskedSR.fasta.tbl (RepeatMasker output for the fully hard-masked assembly VMU_Ajub_asm_v1.0_hardmaskedTE_softmaskedSR.fasta)
consensi.fa.classified (de novo repeat library for the final assembly VMU_Ajub_asm_v1.0.fasta generated With RepeatModeler2)
Ajub_assembly_commands.txt (List with all commands used to generate the assembly and all related analyses)
