Chromosome-level genome assembly and annotation of Pterygoplichthys pardalis

Xia, Wangxiao 1 ; Xu, Hao1; Liu, Yaowen2; Jiang, Hui3; Shi, Jing1; Wu, Yonghong1; Yu, Yameng1; Li, Xiaomin4; Fan, Wenbo4; Zhang, Yuanwei5 ; Xu, Lixian4

Published Nov 16, 2022; Updated May 14, 2025 on Dryad. https://doi.org/10.5061/dryad.bk3j9kdgh

Data files

Nov 16, 2022 version files 1.64 GB

Pterygoplichthys_pardalis_genome.fasta

1.53 GB
Pterygoplichthys_pardalis.cds

41.17 MB
Pterygoplichthys_pardalis.gff3

53.21 MB
Pterygoplichthys_pardalis.pep

14.71 MB
README.md

683 B

May 14, 2025 version files 1.89 GB

Pterygoplichthys_pardalis_denovo.gff3

52.12 MB
Pterygoplichthys_pardalis_genome.fasta

1.53 GB
Pterygoplichthys_pardalis_homology.gff3

194.14 MB
Pterygoplichthys_pardalis_transcripts.gff3

9.86 MB
Pterygoplichthys_pardalis.cds

41.17 MB
Pterygoplichthys_pardalis.gff3

53.21 MB
Pterygoplichthys_pardalis.pep

14.71 MB
README.md

3.85 KB

Abstract

Suckermouth catfishes, with their evolved powerful features, have become notorious invasive species, causing significant damage to aquatic ecosystems. However, the lack of high-quality genomes severely restricts research on this group within the field. In this study, we de novo assembled the chromosome-level genome assembly of Pterygoplichthys pardalis using multiple platforms of sequencing data, including Illumina short reads, Nanopore long reads, and Hi-C sequencing reads, resulting in a 1.51 Gb genome assembly. Multiple evaluations, including read mapping ratio (98.52%), transcript mapping ratio (99.61%), conserved BUSCO gene set (98.8%), and N50 score (49.47 Mb), indicated the high continuity and accuracy of the genome assembly we generated. Genome annotation found that 0.97 Gb of genome sequences are repetitive sequences, accounting for 64.47% of the genome assembly. Further, 23,859 protein-coding genes were successfully predicted, 92.92% of which could be annotated in functional databases. This high-quality genome assembly of P. pardalis provides a valuable resource for understanding the genetic underpinnings of P. pardalis's invasive success and offers critical data for future fisheries research and management.

Dataset DOI: 10.5061/dryad.bk3j9kdgh

Description of the data and file structure

Dataset DOI: 10.5061/dryad.bk3j9kdgh

The catfish samples used in this study were purchased from an ornamental fish wholesale market in Xi'an, China. The remaining samples of this specimen (Catfish_01) have been cryopreserved at -80°C in the Biodiversity Repository of the Institute of Basic and Translational Medicine at Xi'an Medical University. All animal specimens were collected legally in accordance with the policy of the Animal Care and Use Ethics of the institution. Genomic DNA was extracted from the muscle tissue of one suckermouth catfish (P. pardalis) using the Blood & Cell Culture DNA Mini Kit (Qiagen).

Files and variables

File: Pterygoplichthys\_pardalis.cds

Description: Gene coding sequence

File: Pterygoplichthys\_pardalis.pep

Description: Gene Protein Sequence

File: Pterygoplichthys\_pardalis.gff3

Description: Final gene annotation results based on three methods

File: Pterygoplichthys\_pardalis\_genome.fasta

Description: Genome assembly file

We add these files in the new version:

File: Pterygoplichthys\_pardalis\_denovo.gff3

Description: The de novo-based prediction of protein-coding genes

File: Pterygoplichthys\_pardalis\_transcripts.gff3

Description: The transcript-based prediction of protein-coding genes

File: Pterygoplichthys\_pardalis\_homology.gff3

Description: The homology-based annotation of protein-coding genes

Code/software

Genome assembly

The genome assembly was performed with the following steps: 1) Long reads from the Nanopore platform were used for the contig-level assembly using NextDenovo (v2.2). Key parameters were carefully set to ensure optimal assembly, including a read cutoff of 1k, a seed cutoff of 59754, and a blocksize of 5g. 2) Cleaned short reads generated from the Illumina short-insert library were mapped onto the assembled contigs using BWA (v0.7.17) . To further enhance the accuracy of the assembly at the single-base level, we performed two iterations of correction using Pilon (v1.22) . 3) We mapped the Hi-C sequencing reads to the corrected contigs, and subsequently utilized Juicer (v1.5.7) and 3D de novo assembly (v180922) to perform chromosome-level genome assembly.

Genome annotation

Tandem repetitive sequences within the genome were identified using Tandem Repeat Finder (v4.07) .Non-interspersed repeats in the genome were annotated using RepeatMasker (v4.1.0).Transposable elements (TEs) in the genome were annotated at both the DNA and protein levels. A *de novo *repeat library at the DNA level was constructed using RepeatModeler (v1.0.4) enabling the identification of potential novel repetitive sequences. The genome assembly was searched against Repbase (v23.06 ) using RepeatMasker to detect homologous repetitive sequences, providing a more comprehensive picture of the repetitive sequence content. RM-BLASTX within RepeatProteinMask (v4.1.0) was employed to query the TE protein database at the protein level.

Access information

Other publicly accessible locations of the data:

Not applicable

Data was derived from the following sources:

Not applicable

Version changes

29-April-2025:

Added Pterygoplichthys_pardalis_denovo.gff3 file which contains the de novo-based prediction of protein-coding genes;
Added Pterygoplichthys_pardalis_transcripts.gff3 file which contains the transcript-based prediction of protein-coding genes;
Added Pterygoplichthys_pardalis_homology.gff3 file which contains the homology-based annotation of protein-coding genes.