Metagenomic and genomic data of symbiotic bacteria from wood-boring shipworms Lyrodus pedicellatus and Teredo bartschi

Flatau, Ron 1 ; Gasser, Mark 2 ; Altamia, Marvin 3 ; Distel, Dan 1

Published Dec 22, 2025; Updated Feb 17, 2026 on Dryad. https://doi.org/10.5061/dryad.ksn02v7jd

Data files

Dec 22, 2025 version files 120.50 MB

README.md

9.57 KB
Suplementary_Table_S3_2025-09-30.pdf

86.31 KB
Suplementary_Table_S3_2025-09-30.txt

7.34 KB
Supplementary_data_description.docx

14.32 KB
Supplementary_data_files.zip

120.24 MB
Supplementary_Table_S1_2025-09-29.pdf

45.99 KB
Supplementary_Table_S1_2025-09-29.txt

1.39 KB
Supplementary_Table_S2_2025-09-29.pdf

93.04 KB
Supplementary_Table_S2_2025-09-29.txt

2.06 KB

Feb 08, 2026 version files 120.52 MB

README.md

10.81 KB
Suplementary_Table_S3_2025-09-30.pdf

86.31 KB
Suplementary_Table_S3_2025-09-30.txt

7.34 KB
Supplementary_data_description.docx

14.32 KB
Supplementary_data_files.zip

120.24 MB
Supplementary_Table_S1_2025-09-29.pdf

45.99 KB
Supplementary_Table_S1_2025-09-29.txt

1.39 KB
Supplementary_Table_S2_2026-01-25.pdf

108.79 KB
Supplementary_Table_S2_2026-01-25.txt

3.07 KB

Feb 17, 2026 version files 120.52 MB

README.md

10.81 KB
Suplementary_Table_S3_2025-09-30.pdf

86.31 KB
Suplementary_Table_S3_2025-09-30.txt

7.34 KB
Supplementary_data_description.docx

14.32 KB
Supplementary_data_files.zip

120.24 MB
Supplementary_Table_S1_2026-02-15.pdf

72.97 KB
Supplementary_Table_S1_2026-02-15.txt

1.85 KB
Supplementary_Table_S2_2026-02-12.pdf

87.34 KB
Supplementary_Table_S2_2026-02-12.txt

3.69 KB

Abstract

Shipworms (Bivalvia: Teredinidae) are the most prolific wood consumers in marine environments. These wormlike marine bivalves digest wood using carbohydrate-active enzymes (CAZymes) produced by intracellular bacterial endosymbionts housed within their gills. Although several shipworm species are known to host multiple co-occurring symbiont species, the factors that influence symbiont community assembly, including the phylogenetic identity and metabolic capabilities of the symbionts, remain poorly understood. We sequenced gill symbiont metagenomes from multiple specimens of two shipworm species, Teredo bartschi (22 specimens) and Lyrodus pedicellatus (14 specimens), which were reared together in laboratory co-culture. From these metagenomes, we assembled 90 metagenome-assembled genomes (MAGs) representing seven distinct symbiont species. The metagenome of each host specimen contained between 1 and 5 symbiont species, with each including at least one nitrogen-fixing symbiont. Six of the seven identified symbiont species were found in both host species, demonstrating a lack of host species specificity in these symbioses. We identified patterns of symbiont occurrence and co-occurrence in these two hosts and used these to constrain the core set of CAZyme and nitrogen fixation gene classes necessary to support host survival. Our results indicate that, in these two host species, symbiont community composition reflects the symbionts' capabilities for carbohydrate degradation and nitrogen fixation, rather than strict species-specific mechanisms of host and symbiont sorting.

This dataset supports experimental investigations into the microbiomes of wood-boring shipworms (Lyrodus pedicellatus and Teredo bartschi). Shipworm specimens were collected between 2023 and 2025 and processed for metagenomic sequencing. The dataset includes sample metadata, mitochondrial DNA–based host identification, genome quality metrics for bacterial symbiont isolates and metagenome-assembled genomes (MAGs), pairwise genome similarity metrics, predicted carbohydrate-active enzyme (CAZyme) functions, and genome assemblies in FASTA format.

All supplementary files are intended to support interpretation, reuse, and reanalysis of the data presented in the associated publication.

File Inventory

Supplementary_data_files.zip
Compressed archive containing three Excel datasets and a directory of FASTA genome assemblies.

CheckM_isolates_results.xlsx
CheckM_MAGs_results.xlsx
ANI_AF_results.xlsx
Isolates_and_MAGs_fasta/ (directory of FASTA files)
Supplementary_Table_S1_2025-09-29.pdf: Sample metadata for all shipworm specimens.
Supplementary_Table_S2_2026-01-25.pdf: Host species identification based on mitochondrial DNA recovered from metagenomes.
Suplementary_Table_S3_2025-09-30.pdf: Core CAZyme module subfamilies with predicted substrates and enzymatic activities.

Additional machine-readable table files (TXT format)

Supplementary_Table_S1_2025-09-29.txt: Plain-text version of Supplementary Table S1 containing sample metadata.
Supplementary_Table_S2_2026-01-25.txt: Plain-text version of Supplementary Table S2 containing mitochondrial DNA–based host identification results.
Suplementary_Table_S3_2025-09-30.txt: Plain-text version of Supplementary Table S3 containing core CAZyme module subfamilies and predicted functions.
Supplementary_Figure_S1.pdf: Schematic representation of the experimental design and bioinformatic workflow.
Supplementary_data_description.docx: Narrative description of the experimental context and purpose of the dataset.

Note on duplicate content (PDF AND TXT FORMATS)

Supplementary Tables S1, S2, and S3 are each provided in both PDF and TXT formats. The PDF files are included for human readability and visual inspection, while the TXT files provide machine-readable versions suitable for computational reuse. The PDF and TXT versions contain identical underlying data, with no differences in content. Users may choose either format depending on their intended use.

Detailed Description of Files and Variables

Supplementary_Table_S1_2025-09-29.pdf and Supplementary_Table_S1_2025-09-29.txt

Sample metadata table

Variables

Sample ID: Unique identifier for each shipworm specimen. Type: alphanumeric string.
Host species: Morphological identification of the host species. Values: Lyrodus pedicellatus or Teredo bartschi.
Collection and processing date: Date of specimen collection and processing. Format: dd/mm/yyyy.

Missing values
No missing values reported.

Supplementary_Table_S2_2026-01-25.pdf and Supplementary_Table_S2_2026-01-25.txt

Host species identification based on mitochondrial DNA

This table reports host species identification based on mitochondrial DNA sequences recovered from each sample metagenome. Identification is supported by pairwise sequence identity and recovered fraction relative to reference mitochondrial genomes and mitochondrial cytochrome c oxidase subunit I (COI) sequences.

Variables

Sample ID: Unique identifier for each sample. Type: alphanumeric string.
Morphological identification: Host species identification based on morphology. Values: Lyrodus pedicellatus or Teredo bartschi.
Comparison to GenBank reference mitogenome: L. pedicellatus (OM910820).
Pairwise nucleotide identity and recovered fraction of the mitochondrial genome relative to the L. pedicellatus reference.
Format: identity/recovered portion, reported as proportions between 0 and 1.
Comparison to GenBank reference mitogenome: T. bartschi (OM910823).
Pairwise nucleotide identity and recovered fraction of the mitochondrial genome relative to the T. bartschi reference.
Format: identity/recovered portion, reported as proportions between 0 and 1.
Comparison to GenBank reference COI: L. pedicellatus (OM910820).
Pairwise nucleotide identity and recovered fraction of the COI gene relative to the L. pedicellatus reference.
Format: identity/recovered portion, reported as proportions between 0 and 1.
Comparison to GenBank reference COI: T. bartschi (OM910823).
Pairwise nucleotide identity and recovered fraction of the COI gene relative to the T. bartschi reference.
Format: identity/recovered portion, reported as proportions between 0 and 1.

Interpretation key

Values approaching 1 indicate high sequence identity and near-complete recovery of the corresponding mitochondrial region.
Consistent high identity to one species reference and lower identity to the alternative reference supports host species assignment.
Values reported as 0/0 indicate that no mitochondrial DNA alignment was recovered for that reference sequence.

Missing values

Missing mitochondrial or COI matches are explicitly reported as 0/0 or NA, indicating absence of detectable alignment rather than missing data.

Supplementary_Table_S3_2025-09-30.pdf and Supplementary_Table_S3_2025-09-30.txt

Core CAZyme module subfamilies and predicted functions

Variables

Subfamily name: Identifier for the CAZyme subfamily. Format: CAZy or dbCAN subfamily nomenclature.
Predicted substrate: Primary carbohydrate or polymer targeted by the enzyme module.
Lignocellulose component: Structural component of lignocellulose associated with the substrate. Values: cellulose, hemicellulose, lignin.
Predicted activity (EC): Enzyme Commission number describing the predicted catalytic activity. Format: EC number or NA.
Notes: Additional functional or biochemical details.
Reference source: Database supporting the annotation. Values include CAZy, BRENDA, and dbCAN.

Missing values

NA indicates cases where no EC number is assigned, typically for binding-only modules or unclassified activities.

CheckM_isolates_results.xlsx

Genome quality assessment of symbiont isolate assemblies

Sheet: CheckM_isolates_results

Variables

Bin Id: Isolate genome identifier.
Marker lineage: CheckM marker lineage assignment.
Genomes: Number of genomes used in marker set definition. Units: count.
Markers: Number of markers in the lineage set. Units: count.
Marker sets: Number of marker sets. Units: count.
Completeness: Estimated genome completeness. Units: percent (%).
Contamination: Estimated genome contamination. Units: percent (%).
Strain heterogeneity: Estimated strain heterogeneity. Units: percent (%).
Genome size (bp): Assembly size. Units: base pairs (bp).
Ambiguous bases: Number of ambiguous bases. Units: count.
Scaffolds: Number of scaffolds. Units: count.
Contigs: Number of contigs. Units: count.
N50 (scaffolds): N50 scaffold length. Units: bp.
N50 (contigs): N50 contig length. Units: bp.
Mean scaffold length (bp): Mean scaffold length. Units: bp.
Mean contig length (bp): Mean contig length. Units: bp.
Longest scaffold (bp): Maximum scaffold length. Units: bp.
Longest contig (bp): Maximum contig length. Units: bp.
GC: GC content. Units: percent (%).
GC std (scaffolds >1 kbp): Standard deviation of GC content among scaffolds longer than 1 kbp. Units: percent (%).
Coding density: Coding density. Units: percent (%).
Translation table: Genetic code translation table identifier. Units: integer.
Predicted genes: Number of predicted genes. Units: count.
Marker occurrence distribution: Counts of markers occurring 0, 1, 2, 3, 4, or 5+ times. Units: count.

Missing values: Blank or NA-like entries where CheckM metrics could not be computed.

CheckM_MAGs_results.xlsx

Genome quality assessment of metagenome-assembled genomes (MAGs)

Sheet: CheckM_MAGs_results

Variables

Sample: Sample identifier.
Species number: Numeric species or group identifier.
Host: Host species label.
Completeness: Estimated genome completeness. Units: percent (%).
Contamination: Estimated genome contamination. Units: percent (%).
Strain heterogeneity: Estimated strain heterogeneity. Units: percent (%).
Genome size (bp): Assembly size. Units: bp.
Contigs: Number of contigs. Units: count.
N50 (contigs): N50 contig length. Units: bp.
Mean contig length (bp): Mean contig length. Units: bp.
GC: GC content. Units: percent (%).
Marker occurrence distribution: Counts of markers occurring 0, 1, 2, 3, 4, or 5+ times. Units: count.
Closest isolate or MAG: Identifier of the closest comparison genome.
ANI (1->2): Average Nucleotide Identity for the reported comparison. Units: percent (%).
AF (1->2): Alignment fraction for the reported comparison. Units: percent (%).

Missing values: Blank or NA-like entries where metrics could not be computed.

ANI_AF_results.xlsx

Pairwise genome similarity metrics

Sheet: Index
Purpose: Interpretation key for the ANI_AF matrix.
Contents: Text indicating that ANI values are reported above the diagonal and alignment fraction values are reported below the diagonal.

Sheet: ANI_AF
Purpose: Square matrix of pairwise genome similarity metrics.

Structure and variables

Row labels: Genome IDs.
Column labels: Genome IDs.

Matrix values:

Above diagonal: Average Nucleotide Identity (ANI). Units: percent (%).
Below diagonal: Alignment fraction (AF). Units: percent (%).
Diagonal: Self-comparisons, typically 100 percent.

Missing values: Blank cells or NA-like entries where ANI or AF could not be computed.

Isolates_and_MAGs_fasta

FASTA genome assemblies

Description: Directory containing nucleotide genome assemblies of symbiont isolates and metagenome-assembled genomes.

Variables

FASTA header: Assembly or contig identifier.
Sequence: Nucleotide sequence composed of standard IUPAC DNA characters.

Missing values: Not applicable.

Experimental Workflow

A schematic overview of the experimental design and bioinformatic workflow is provided in Supplementary_Figure_S1.pdf. The figure illustrates shipworm cultivation, gill dissection, DNA extraction, metagenomic sequencing, read processing, genome assembly, binning, quality control, taxonomic assignment, and downstream comparative analyses.

Software Requirements

Any standard text editor or bioinformatics software for FASTA files.

Metagenomic and genomic data of symbiotic bacteria from wood-boring shipworms Lyrodus pedicellatus and Teredo bartschi

Data files

Abstract

README: Metagenomic and genomic data of symbiotic bacteria from wood-boring shipworms Lyrodus pedicellatus and Teredo bartschi

File Inventory

Contents

Additional machine-readable table files (TXT format)

Note on duplicate content (PDF AND TXT FORMATS)

Detailed Description of Files and Variables

Supplementary_Table_S1_2025-09-29.pdf and Supplementary_Table_S1_2025-09-29.txt

Supplementary_Table_S2_2026-01-25.pdf and Supplementary_Table_S2_2026-01-25.txt

Supplementary_Table_S3_2025-09-30.pdf and Supplementary_Table_S3_2025-09-30.txt

CheckM_isolates_results.xlsx

CheckM_MAGs_results.xlsx

ANI_AF_results.xlsx

Isolates_and_MAGs_fasta

Experimental Workflow

Software Requirements

Change log