Data from: Comparative genomics of Pinna rudis and Pinna nobilis reveals conserved and divergent features of the bivalve defensome

Coupé, Stéphane 1 ; Foulquié, Mathieu1 2; Vazquez Luis, Maite3; Alvarez Perez, Elvira3; Prévot, Jean-Marc4; Vicente, Nardo5; Bunet, Robert2

Published May 12, 2026 on Dryad. https://doi.org/10.5061/dryad.0rxwdbsf2

Data files

May 12, 2026 version files 132.54 MB

Pinna_nobilis_cds_description.txt

8.79 MB
Pinna_nobilis_longest_cds.fasta

60.74 MB
Pinna_predicted_TLR.fasta

145.62 KB
Pinna_rudis_cds_description.txt

7.85 MB
Pinna_rudis_longest_cds.fasta

55.01 MB
README.md

3.22 KB

Abstract

This dataset contains the first high-quality annotated genome assembly of Pinna rudis and an updated genome assembly of Pinna nobilis. The assemblies were generated to facilitate studies on genetic factors associated with mass mortality events that have affected P. nobilis but not P. rudis. The dataset includes predicted coding sequences for both genomes and their corresponding functional annotations. These resources support future analyses of genomic variation, immune-related gene families, and potential mechanisms of disease resilience in Pinna species.

Dataset DOI: 10.5061/dryad.0rxwdbsf2

Description of the data and file structure

README – Pinna rudis and Pinna nobilis Genome Assemblies and Annotations

This dataset contains the first high-quality annotated genome assembly of Pinna rudis (NCBI SUB15717246) and an updated genome assembly of Pinna nobilis (NCBI SUB15718621). The resources include predicted coding sequences (CDS) and functional annotation files. These files support investigations into genomic variation and potential mechanisms of disease resilience in Pinna species.

Files and variables

The dataset is organized into the following components:

Coding DNA sequences : Pinna_rudis_longest_cds.fasta, Pinna_nobilis_longest_cds.fasta
Annotations : Pinna_rudis_cds_description.txt, Pinna_nobilis_cds_description.txt
Predicted TLR protein : Pinna_predicted_TLR.fasta

Genome assemblies have been deposited in the NCBI and are currently under processing.

Code/software

The genome assemblies and annotations in this dataset were generated using a combination of long-read and transcriptome-based workflows. Raw PacBio CCS reads were quality trimmed with Fastp v0.23.4, and contaminant sequences were screened using Kraken2 (confidence 0.2 for long reads). Preliminary de novo assemblies were produced using Flye v2.9.3-b1797, followed by haplotypic duplication removal with purge_dups. RNA-seq reads used for scaffolding and gene prediction were filtered with Trimmomatic v0.39 (LEADING:6, TRAILING:8, SLIDINGWINDOW:10:30, MINLEN:120) and screened with Kraken2 (confidence 0.1). Transcriptomes were assembled with Trinity (Galaxy Version 2.15.1) and evaluated with BUSCO. Genome completeness was assessed using BUSCO v6 with metaeuk_odb12 and molluscan_odb12, and assembly statistics were obtained with QUAST v5.2.0. Repeat families were identified using RepeatModeler and soft-masked with RepeatMasker v4.1.3. Gene prediction was performed with BRAKER3 (GeneMark-EX) on the unmasked assemblies using RNA-seq evidence. The most probable transcripts from gene prediction were subsequently selected using TSEBRA. Predicted transcripts containing an identifiable CDS and longer than 90 nucleotides were translated into their corresponding proteins and functionally annotated using Diamond v2.1.11 and either the NCBI non-redundant (nr) protein database (April 2025 release), with the query-cover and e-value parameters set at 50% and 1e-10, respectively, or a reference set of bivalve proteins (UniProt; Taxon ID 6544; November 2025 release), or RefSeq Protein, with the e-value set at 1e-5. A final consolidated annotation set was generated iteratively, starting from the annotations obtained with the molluscan dataset. For sequences initially annotated as “uncharacterized proteins,” the annotation was progressively refined by replacing them with matches from RefSeq Protein and, if necessary, from the nr database. Gene Ontology identifiers (GO id) were obtained using InterProScan 5 (interproscan-5.74-105.0-64).