Data from: Comparative genomics of Pinna rudis and Pinna nobilis reveals conserved and divergent features of the bivalve defensome
Data files
May 12, 2026 version files 132.54 MB
-
Pinna_nobilis_cds_description.txt
8.79 MB
-
Pinna_nobilis_longest_cds.fasta
60.74 MB
-
Pinna_predicted_TLR.fasta
145.62 KB
-
Pinna_rudis_cds_description.txt
7.85 MB
-
Pinna_rudis_longest_cds.fasta
55.01 MB
-
README.md
3.22 KB
Abstract
This dataset contains the first high-quality annotated genome assembly of Pinna rudis and an updated genome assembly of Pinna nobilis. The assemblies were generated to facilitate studies on genetic factors associated with mass mortality events that have affected P. nobilis but not P. rudis. The dataset includes predicted coding sequences for both genomes and their corresponding functional annotations. These resources support future analyses of genomic variation, immune-related gene families, and potential mechanisms of disease resilience in Pinna species.
Dataset DOI: 10.5061/dryad.0rxwdbsf2
Description of the data and file structure
README – Pinna rudis and Pinna nobilis Genome Assemblies and Annotations
This dataset contains the first high-quality annotated genome assembly of Pinna rudis (NCBI SUB15717246) and an updated genome assembly of Pinna nobilis (NCBI SUB15718621). The resources include predicted coding sequences (CDS) and functional annotation files. These files support investigations into genomic variation and potential mechanisms of disease resilience in Pinna species.
Files and variables
The dataset is organized into the following components:
- Coding DNA sequences : Pinna_rudis_longest_cds.fasta, Pinna_nobilis_longest_cds.fasta
- Annotations : Pinna_rudis_cds_description.txt, Pinna_nobilis_cds_description.txt
- Predicted TLR protein : Pinna_predicted_TLR.fasta
Genome assemblies have been deposited in the NCBI and are currently under processing.
Code/software
The genome assemblies and annotations in this dataset were generated using a combination of long-read and transcriptome-based workflows. Raw PacBio CCS reads were quality trimmed with Fastp v0.23.4, and contaminant sequences were screened using Kraken2 (confidence 0.2 for long reads). Preliminary de novo assemblies were produced using Flye v2.9.3-b1797, followed by haplotypic duplication removal with purge_dups. RNA-seq reads used for scaffolding and gene prediction were filtered with Trimmomatic v0.39 (LEADING:6, TRAILING:8, SLIDINGWINDOW:10:30, MINLEN:120) and screened with Kraken2 (confidence 0.1). Transcriptomes were assembled with Trinity (Galaxy Version 2.15.1) and evaluated with BUSCO. Genome completeness was assessed using BUSCO v6 with metaeuk_odb12 and molluscan_odb12, and assembly statistics were obtained with QUAST v5.2.0. Repeat families were identified using RepeatModeler and soft-masked with RepeatMasker v4.1.3. Gene prediction was performed with BRAKER3 (GeneMark-EX) on the unmasked assemblies using RNA-seq evidence. The most probable transcripts from gene prediction were subsequently selected using TSEBRA. Predicted transcripts containing an identifiable CDS and longer than 90 nucleotides were translated into their corresponding proteins and functionally annotated using Diamond v2.1.11 and either the NCBI non-redundant (nr) protein database (April 2025 release), with the query-cover and e-value parameters set at 50% and 1e-10, respectively, or a reference set of bivalve proteins (UniProt; Taxon ID 6544; November 2025 release), or RefSeq Protein, with the e-value set at 1e-5. A final consolidated annotation set was generated iteratively, starting from the annotations obtained with the molluscan dataset. For sequences initially annotated as “uncharacterized proteins,” the annotation was progressively refined by replacing them with matches from RefSeq Protein and, if necessary, from the nr database. Gene Ontology identifiers (GO id) were obtained using InterProScan 5 (interproscan-5.74-105.0-64).
