Metagenomic and genomic data of symbiotic bacteria from wood-boring shipworms Lyrodus pedicellatus and Teredo bartschi
Data files
Dec 22, 2025 version files 120.50 MB
-
README.md
9.57 KB
-
Suplementary_Table_S3_2025-09-30.pdf
86.31 KB
-
Suplementary_Table_S3_2025-09-30.txt
7.34 KB
-
Supplementary_data_description.docx
14.32 KB
-
Supplementary_data_files.zip
120.24 MB
-
Supplementary_Table_S1_2025-09-29.pdf
45.99 KB
-
Supplementary_Table_S1_2025-09-29.txt
1.39 KB
-
Supplementary_Table_S2_2025-09-29.pdf
93.04 KB
-
Supplementary_Table_S2_2025-09-29.txt
2.06 KB
Feb 08, 2026 version files 120.52 MB
-
README.md
10.81 KB
-
Suplementary_Table_S3_2025-09-30.pdf
86.31 KB
-
Suplementary_Table_S3_2025-09-30.txt
7.34 KB
-
Supplementary_data_description.docx
14.32 KB
-
Supplementary_data_files.zip
120.24 MB
-
Supplementary_Table_S1_2025-09-29.pdf
45.99 KB
-
Supplementary_Table_S1_2025-09-29.txt
1.39 KB
-
Supplementary_Table_S2_2026-01-25.pdf
108.79 KB
-
Supplementary_Table_S2_2026-01-25.txt
3.07 KB
Feb 17, 2026 version files 120.52 MB
-
README.md
10.81 KB
-
Suplementary_Table_S3_2025-09-30.pdf
86.31 KB
-
Suplementary_Table_S3_2025-09-30.txt
7.34 KB
-
Supplementary_data_description.docx
14.32 KB
-
Supplementary_data_files.zip
120.24 MB
-
Supplementary_Table_S1_2026-02-15.pdf
72.97 KB
-
Supplementary_Table_S1_2026-02-15.txt
1.85 KB
-
Supplementary_Table_S2_2026-02-12.pdf
87.34 KB
-
Supplementary_Table_S2_2026-02-12.txt
3.69 KB
Abstract
Shipworms (Bivalvia: Teredinidae) are the most prolific wood consumers in marine environments. These wormlike marine bivalves digest wood using carbohydrate-active enzymes (CAZymes) produced by intracellular bacterial endosymbionts housed within their gills. Although several shipworm species are known to host multiple co-occurring symbiont species, the factors that influence symbiont community assembly, including the phylogenetic identity and metabolic capabilities of the symbionts, remain poorly understood. We sequenced gill symbiont metagenomes from multiple specimens of two shipworm species, Teredo bartschi (22 specimens) and Lyrodus pedicellatus (14 specimens), which were reared together in laboratory co-culture. From these metagenomes, we assembled 90 metagenome-assembled genomes (MAGs) representing seven distinct symbiont species. The metagenome of each host specimen contained between 1 and 5 symbiont species, with each including at least one nitrogen-fixing symbiont. Six of the seven identified symbiont species were found in both host species, demonstrating a lack of host species specificity in these symbioses. We identified patterns of symbiont occurrence and co-occurrence in these two hosts and used these to constrain the core set of CAZyme and nitrogen fixation gene classes necessary to support host survival. Our results indicate that, in these two host species, symbiont community composition reflects the symbionts' capabilities for carbohydrate degradation and nitrogen fixation, rather than strict species-specific mechanisms of host and symbiont sorting.
This dataset supports experimental investigations into the microbiomes of wood-boring shipworms (Lyrodus pedicellatus and Teredo bartschi). Shipworm specimens were collected between 2023 and 2025 and processed for metagenomic sequencing. The dataset includes sample metadata, mitochondrial DNA–based host identification, genome quality metrics for bacterial symbiont isolates and metagenome-assembled genomes (MAGs), pairwise genome similarity metrics, predicted carbohydrate-active enzyme (CAZyme) functions, and genome assemblies in FASTA format.
All supplementary files are intended to support interpretation, reuse, and reanalysis of the data presented in the associated publication.
File Inventory
Supplementary_data_files.zip
Compressed archive containing three Excel datasets and a directory of FASTA genome assemblies.
Contents
- CheckM_isolates_results.xlsx
- CheckM_MAGs_results.xlsx
- ANI_AF_results.xlsx
- Isolates_and_MAGs_fasta/ (directory of FASTA files)
- Supplementary_Table_S1_2025-09-29.pdf: Sample metadata for all shipworm specimens.
- Supplementary_Table_S2_2026-01-25.pdf: Host species identification based on mitochondrial DNA recovered from metagenomes.
- Suplementary_Table_S3_2025-09-30.pdf: Core CAZyme module subfamilies with predicted substrates and enzymatic activities.
Additional machine-readable table files (TXT format)
- Supplementary_Table_S1_2025-09-29.txt: Plain-text version of Supplementary Table S1 containing sample metadata.
- Supplementary_Table_S2_2026-01-25.txt: Plain-text version of Supplementary Table S2 containing mitochondrial DNA–based host identification results.
- Suplementary_Table_S3_2025-09-30.txt: Plain-text version of Supplementary Table S3 containing core CAZyme module subfamilies and predicted functions.
- Supplementary_Figure_S1.pdf: Schematic representation of the experimental design and bioinformatic workflow.
- Supplementary_data_description.docx: Narrative description of the experimental context and purpose of the dataset.
Note on duplicate content (PDF AND TXT FORMATS)
Supplementary Tables S1, S2, and S3 are each provided in both PDF and TXT formats. The PDF files are included for human readability and visual inspection, while the TXT files provide machine-readable versions suitable for computational reuse. The PDF and TXT versions contain identical underlying data, with no differences in content. Users may choose either format depending on their intended use.
Detailed Description of Files and Variables
Supplementary_Table_S1_2025-09-29.pdf and Supplementary_Table_S1_2025-09-29.txt
Sample metadata table
Variables
- Sample ID: Unique identifier for each shipworm specimen. Type: alphanumeric string.
- Host species: Morphological identification of the host species. Values: Lyrodus pedicellatus or Teredo bartschi.
- Collection and processing date: Date of specimen collection and processing. Format: dd/mm/yyyy.
Missing values
No missing values reported.
Supplementary_Table_S2_2026-01-25.pdf and Supplementary_Table_S2_2026-01-25.txt
Host species identification based on mitochondrial DNA
This table reports host species identification based on mitochondrial DNA sequences recovered from each sample metagenome. Identification is supported by pairwise sequence identity and recovered fraction relative to reference mitochondrial genomes and mitochondrial cytochrome c oxidase subunit I (COI) sequences.
Variables
- Sample ID: Unique identifier for each sample. Type: alphanumeric string.
- Morphological identification: Host species identification based on morphology. Values: Lyrodus pedicellatus or Teredo bartschi.
- Comparison to GenBank reference mitogenome: L. pedicellatus (OM910820).
Pairwise nucleotide identity and recovered fraction of the mitochondrial genome relative to the L. pedicellatus reference.
Format: identity/recovered portion, reported as proportions between 0 and 1. - Comparison to GenBank reference mitogenome: T. bartschi (OM910823).
Pairwise nucleotide identity and recovered fraction of the mitochondrial genome relative to the T. bartschi reference.
Format: identity/recovered portion, reported as proportions between 0 and 1. - Comparison to GenBank reference COI: L. pedicellatus (OM910820).
Pairwise nucleotide identity and recovered fraction of the COI gene relative to the L. pedicellatus reference.
Format: identity/recovered portion, reported as proportions between 0 and 1. - Comparison to GenBank reference COI: T. bartschi (OM910823).
Pairwise nucleotide identity and recovered fraction of the COI gene relative to the T. bartschi reference.
Format: identity/recovered portion, reported as proportions between 0 and 1.
Interpretation key
- Values approaching 1 indicate high sequence identity and near-complete recovery of the corresponding mitochondrial region.
- Consistent high identity to one species reference and lower identity to the alternative reference supports host species assignment.
- Values reported as 0/0 indicate that no mitochondrial DNA alignment was recovered for that reference sequence.
Missing values
- Missing mitochondrial or COI matches are explicitly reported as 0/0 or NA, indicating absence of detectable alignment rather than missing data.
Supplementary_Table_S3_2025-09-30.pdf and Supplementary_Table_S3_2025-09-30.txt
Core CAZyme module subfamilies and predicted functions
Variables
- Subfamily name: Identifier for the CAZyme subfamily. Format: CAZy or dbCAN subfamily nomenclature.
- Predicted substrate: Primary carbohydrate or polymer targeted by the enzyme module.
- Lignocellulose component: Structural component of lignocellulose associated with the substrate. Values: cellulose, hemicellulose, lignin.
- Predicted activity (EC): Enzyme Commission number describing the predicted catalytic activity. Format: EC number or NA.
- Notes: Additional functional or biochemical details.
- Reference source: Database supporting the annotation. Values include CAZy, BRENDA, and dbCAN.
Missing values
- NA indicates cases where no EC number is assigned, typically for binding-only modules or unclassified activities.
CheckM_isolates_results.xlsx
Genome quality assessment of symbiont isolate assemblies
Sheet: CheckM_isolates_results
Variables
- Bin Id: Isolate genome identifier.
- Marker lineage: CheckM marker lineage assignment.
- Genomes: Number of genomes used in marker set definition. Units: count.
- Markers: Number of markers in the lineage set. Units: count.
- Marker sets: Number of marker sets. Units: count.
- Completeness: Estimated genome completeness. Units: percent (%).
- Contamination: Estimated genome contamination. Units: percent (%).
- Strain heterogeneity: Estimated strain heterogeneity. Units: percent (%).
- Genome size (bp): Assembly size. Units: base pairs (bp).
- Ambiguous bases: Number of ambiguous bases. Units: count.
- Scaffolds: Number of scaffolds. Units: count.
- Contigs: Number of contigs. Units: count.
- N50 (scaffolds): N50 scaffold length. Units: bp.
- N50 (contigs): N50 contig length. Units: bp.
- Mean scaffold length (bp): Mean scaffold length. Units: bp.
- Mean contig length (bp): Mean contig length. Units: bp.
- Longest scaffold (bp): Maximum scaffold length. Units: bp.
- Longest contig (bp): Maximum contig length. Units: bp.
- GC: GC content. Units: percent (%).
- GC std (scaffolds >1 kbp): Standard deviation of GC content among scaffolds longer than 1 kbp. Units: percent (%).
- Coding density: Coding density. Units: percent (%).
- Translation table: Genetic code translation table identifier. Units: integer.
- Predicted genes: Number of predicted genes. Units: count.
- Marker occurrence distribution: Counts of markers occurring 0, 1, 2, 3, 4, or 5+ times. Units: count.
Missing values: Blank or NA-like entries where CheckM metrics could not be computed.
CheckM_MAGs_results.xlsx
Genome quality assessment of metagenome-assembled genomes (MAGs)
Sheet: CheckM_MAGs_results
Variables
- Sample: Sample identifier.
- Species number: Numeric species or group identifier.
- Host: Host species label.
- Completeness: Estimated genome completeness. Units: percent (%).
- Contamination: Estimated genome contamination. Units: percent (%).
- Strain heterogeneity: Estimated strain heterogeneity. Units: percent (%).
- Genome size (bp): Assembly size. Units: bp.
- Contigs: Number of contigs. Units: count.
- N50 (contigs): N50 contig length. Units: bp.
- Mean contig length (bp): Mean contig length. Units: bp.
- GC: GC content. Units: percent (%).
- Marker occurrence distribution: Counts of markers occurring 0, 1, 2, 3, 4, or 5+ times. Units: count.
- Closest isolate or MAG: Identifier of the closest comparison genome.
- ANI (1->2): Average Nucleotide Identity for the reported comparison. Units: percent (%).
- AF (1->2): Alignment fraction for the reported comparison. Units: percent (%).
Missing values: Blank or NA-like entries where metrics could not be computed.
ANI_AF_results.xlsx
Pairwise genome similarity metrics
Sheet: Index
Purpose: Interpretation key for the ANI_AF matrix.
Contents: Text indicating that ANI values are reported above the diagonal and alignment fraction values are reported below the diagonal.
Sheet: ANI_AF
Purpose: Square matrix of pairwise genome similarity metrics.
Structure and variables
- Row labels: Genome IDs.
- Column labels: Genome IDs.
Matrix values:
- Above diagonal: Average Nucleotide Identity (ANI). Units: percent (%).
- Below diagonal: Alignment fraction (AF). Units: percent (%).
- Diagonal: Self-comparisons, typically 100 percent.
Missing values: Blank cells or NA-like entries where ANI or AF could not be computed.
Isolates_and_MAGs_fasta
FASTA genome assemblies
Description: Directory containing nucleotide genome assemblies of symbiont isolates and metagenome-assembled genomes.
Variables
- FASTA header: Assembly or contig identifier.
- Sequence: Nucleotide sequence composed of standard IUPAC DNA characters.
Missing values: Not applicable.
Experimental Workflow
A schematic overview of the experimental design and bioinformatic workflow is provided in Supplementary_Figure_S1.pdf. The figure illustrates shipworm cultivation, gill dissection, DNA extraction, metagenomic sequencing, read processing, genome assembly, binning, quality control, taxonomic assignment, and downstream comparative analyses.
Software Requirements
Any standard text editor or bioinformatics software for FASTA files.
Changes after Dec 22, 2025: Additional data was added to Supplementary_Table_S2_2026-01-25 PDF and TXT due to the request of the manuscript review.
Changes at Feb 12, 2026:
Additional data were added to the Supplementary_Table_S2 PDF and TXT files in response to the manuscript review. Font and changes in design and text were made to Supplementary_Table_S1 due to the request of the manuscript review
