Metagenomic and genomic data of symbiotic bacteria from wood-boring shipworms Lyrodus pedicellatus and Teredo bartschi
Data files
Dec 22, 2025 version files 120.50 MB
-
README.md
9.57 KB
-
Suplementary_Table_S3_2025-09-30.pdf
86.31 KB
-
Suplementary_Table_S3_2025-09-30.txt
7.34 KB
-
Supplementary_data_description.docx
14.32 KB
-
Supplementary_data_files.zip
120.24 MB
-
Supplementary_Table_S1_2025-09-29.pdf
45.99 KB
-
Supplementary_Table_S1_2025-09-29.txt
1.39 KB
-
Supplementary_Table_S2_2025-09-29.pdf
93.04 KB
-
Supplementary_Table_S2_2025-09-29.txt
2.06 KB
Abstract
Shipworms (Bivalvia: Teredinidae) are the most prolific wood consumers in marine environments. These wormlike marine bivalves digest wood using carbohydrate-active enzymes (CAZymes) produced by intracellular bacterial endosymbionts housed within their gills. Although several shipworm species are known to host multiple co-occurring symbiont species, the factors that influence symbiont community assembly, including the phylogenetic identity and metabolic capabilities of the symbionts, remain poorly understood. We sequenced gill symbiont metagenomes from multiple specimens of two shipworm species, Teredo bartschi (22 specimens) and Lyrodus pedicellatus (14 specimens), which were reared together in laboratory co-culture. From these metagenomes, we assembled 90 metagenome-assembled genomes (MAGs) representing seven distinct symbiont species. The metagenome of each host specimen contained between 1 and 5 symbiont species, with each including at least one nitrogen-fixing symbiont. Six of the seven identified symbiont species were found in both host species, demonstrating a lack of host species specificity in these symbioses. We identified patterns of symbiont occurrence and co-occurrence in these two hosts and used these to constrain the core set of CAZyme and nitrogen fixation gene classes necessary to support host survival. Our results indicate that, in these two host species, symbiont community composition reflects the symbionts' capabilities for carbohydrate degradation and nitrogen fixation, rather than strict species-specific mechanisms of host and symbiont sorting.
This dataset supports experimental investigations into the microbiomes of wood-boring shipworms (Lyrodus pedicellatus and Teredo bartschi). Shipworm specimens were collected between 2023 and 2025 and processed for metagenomic sequencing. The dataset includes sample metadata, mitochondrial DNA–based host identification, genome quality metrics for bacterial symbiont isolates and metagenome-assembled genomes (MAGs), pairwise genome similarity metrics, predicted carbohydrate-active enzyme (CAZyme) functions, and genome assemblies in FASTA format.
All supplementary files are intended to support interpretation, reuse, and reanalysis of the data presented in the associated publication.
File Inventory
Supplementary_data_files.zip: Compressed archive containing three Excel datasets and a directory of FASTA genome assemblies.
Contents
- CheckM_isolates_results.xlsx
- CheckM_MAGs_results.xlsx
- ANI_AF_results.xlsx
- Isolates_and_MAGs_fasta/ (directory of FASTA files)
Supplementary_Table_S1_2025-09-29.pdf: Sample metadata for all shipworm specimens.
Supplementary_Table_S2_2025-09-29.pdf: Host species identification based on mitochondrial DNA recovered from metagenomes.
Suplementary_Table_S3_2025-09-30.pdf: Core CAZyme module subfamilies with predicted substrates and enzymatic activities.
Additional machine-readable table files (TXT format)
Supplementary_Table_S1_2025-09-29.txt: Plain-text version of Supplementary Table S1 containing sample metadata.
Supplementary_Table_S2_2025-09-29.txt: Plain-text version of Supplementary Table S2 containing mitochondrial DNA–based host identification results.
Suplementary_Table_S3_2025-09-30.txt: Plain-text version of Supplementary Table S3 containing core CAZyme module subfamilies and predicted functions.
Supplementary_Figure_S1.pdf: Schematic representation of the experimental design and bioinformatic workflow.
Supplementary_data_description.docx: Narrative description of the experimental context and purpose of the dataset.
NOTE ON DUPLICATE CONTENT (PDF AND TXT FORMATS)
Supplementary Tables S1, S2, and S3 are each provided in both PDF and TXT formats. The PDF files are included for human readability and visual inspection, while the TXT files provide machine-readable versions suitable for computational reuse. The PDF and TXT versions contain identical underlying data, with no differences in content. Users may choose either format depending on their intended use.
Detailed Description of Files and Variables
Supplementary_Table_S1_2025-09-29.pdf and Supplementary_Table_S1_2025-09-29.txt: Sample metadata table
Variables:
- Sample ID: Unique identifier for each shipworm specimen. Type: alphanumeric string.
- Host species: Morphological identification of the host species. Values: Lyrodus pedicellatus or Teredo bartschi.
- Collection and processing date: Date of specimen collection and processing. Format: dd/mm/yyyy.
Missing values: No missing values reported.
Supplementary_Table_S2_2025-09-29.pdf and Supplementary_Table_S2_2025-09-29.txt: Host species identification based on mitochondrial DNA
Variables:
- Sample ID: Unique identifier for each sample. Type: alphanumeric string.
- Morphological identification: Host species identification based on morphology. Values: Lyrodus pedicellatus or Teredo bartschi.
- Comparison to GenBank reference mitogenome: L. pedicellatus (OM910820). Pairwise nucleotide identity and recovered fraction of the mitochondrial genome relative to the L. pedicellatus reference. Format: identity/recovered portion, reported as proportions between 0 and 1.
- Comparison to GenBank reference mitogenome: T. bartschi (OM910823). Pairwise nucleotide identity and recovered fraction of the mitochondrial genome relative to the T. bartschi reference. Format: identity/recovered portion, reported as proportions between 0 and 1.
Interpretation key: Values approaching 1 indicate high sequence identity or near-complete mitochondrial genome recovery. Values reported as 0/0 indicate that no mitochondrial DNA alignment was recovered for that reference.
Missing values: Reported as 0/0 where no mitochondrial sequence was recovered.
Supplementary_Table_S3_2025-09-30.pdf and Supplementary_Table_S3_2025-09-30.txt: Core CAZyme module subfamilies and predicted functions
Variables
- Subfamily name: Identifier for the CAZyme subfamily. Format: CAZy or dbCAN subfamily nomenclature.
- Predicted substrate: Primary carbohydrate or polymer targeted by the enzyme module.
- Lignocellulose component. Structural component of lignocellulose associated with the substrate. Values: cellulose, hemicellulose, lignin.
- Predicted activity (EC): Enzyme Commission number describing the predicted catalytic activity. Format: EC number or NA.
- Notes: Additional functional or biochemical details.
- Reference source: Database supporting the annotation. Values include CAZy, BRENDA, and dbCAN.
Missing values: NA indicates cases where no EC number is assigned, typically for binding-only modules or unclassified activities.
CheckM_isolates_results.xlsx: Genome quality assessment of symbiont isolate assemblies
Sheet: CheckM_isolates_results
Variables
- Bin Id: Isolate genome identifier.
- Marker lineage: CheckM marker lineage assignment.
- Genomes: Number of genomes used in marker set definition. Units: count.
- Markers: Number of markers in the lineage set. Units: count.
- Marker sets: Number of marker sets. Units: count.
- Completeness: Estimated genome completeness. Units: percent (%).
- Contamination: Estimated genome contamination. Units: percent (%).
- Strain heterogeneity: Estimated strain heterogeneity. Units: percent (%).
- Genome size (bp): Assembly size. Units: base pairs (bp).
- Ambiguous bases: Number of ambiguous bases. Units: count.
- Scaffolds: Number of scaffolds. Units: count.
- Contigs: Number of contigs. Units: count.
- N50 (scaffolds): N50 scaffold length. Units: bp.
- N50 (contigs): N50 contig length. Units: bp.
- Mean scaffold length (bp): Mean scaffold length. Units: bp.
- Mean contig length (bp): Mean contig length. Units: bp.
- Longest scaffold (bp): Maximum scaffold length. Units: bp.
- Longest contig (bp): Maximum contig length. Units: bp.
- GC: GC content. Units: percent (%).
- GC std (scaffolds >1 kbp): Standard deviation of GC content among scaffolds longer than 1 kbp. Units: percent (%).
- Coding density: Coding density. Units: percent (%).
- Translation table: Genetic code translation table identifier. Units: integer.
- Predicted genes: Number of predicted genes. Units: count.
- Marker occurrence distribution: Counts of markers occurring 0, 1, 2, 3, 4, or 5+ times. Units: count.
Missing values: Blank or NA-like entries where CheckM metrics could not be computed.
CheckM_MAGs_results.xlsx: Genome quality assessment of metagenome-assembled genomes (MAGs)
Sheet: CheckM_MAGs_results
Variables
- Sample: Sample identifier.
- Species number: Numeric species or group identifier.
- Host: Host species label.
- Completeness: Estimated genome completeness. Units: percent (%).
- Contamination: Estimated genome contamination. Units: percent (%).
- Strain heterogeneity: Estimated strain heterogeneity. Units: percent (%).
- Genome size (bp): Assembly size. Units: bp.
- Contigs: Number of contigs. Units: count.
- N50 (contigs): N50 contig length. Units: bp.
- Mean contig length (bp): Mean contig length. Units: bp.
- GC: GC content. Units: percent (%).
- Marker occurrence distribution: Counts of markers occurring 0, 1, 2, 3, 4, or 5+ times. Units: count.
- Closest isolate or MAG: Identifier of the closest comparison genome.
- ANI (1->2): Average Nucleotide Identity for the reported comparison. Units: percent (%).
- AF (1->2): Alignment fraction for the reported comparison. Units: percent (%).
Missing values: Blank or NA-like entries where metrics could not be computed.
ANI_AF_results.xlsx: Pairwise genome similarity metrics
Sheet: Index
Purpose: Interpretation key for the ANI_AF matrix.
Contents: Text indicating that ANI values are reported above the diagonal and alignment fraction values are reported below the diagonal.
Sheet: ANI_AF
Purpose: Square matrix of pairwise genome similarity metrics.
Structure and variables
- Row labels: Genome IDs.
- Column labels: Genome IDs.
Matrix values:
- Above diagonal: Average Nucleotide Identity (ANI). Units: percent (%).
- Below diagonal: Alignment fraction (AF). Units: percent (%).
- Diagonal: Self-comparisons, typically 100 percent.
Missing values: Blank cells or NA-like entries where ANI or AF could not be computed.
Isolates_and_MAGs_fasta: FASTA genome assemblies
Description: Directory containing nucleotide genome assemblies of symbiont isolates and metagenome-assembled genomes.
Variables
- FASTA header: Assembly or contig identifier.
- Sequence: Nucleotide sequence composed of standard IUPAC DNA characters.
Missing values: Not applicable.
Experimental Workflow: A schematic overview of the experimental design and bioinformatic workflow is provided in Supplementary_Figure_S1.pdf. The figure illustrates shipworm cultivation, gill dissection, DNA extraction, metagenomic sequencing, read processing, genome assembly, binning, quality control, taxonomic assignment, and downstream comparative analyses.
Software Requirements
Any standard text editor or bioinformatics software for FASTA files.
