Data from: Deep learning reveals hidden diversity of Synechococcus in the coastal water of China: Novel clades and their ecological insights

Wang, Yanhui 1 ; Xu, Jinxin1; Chen, Feng2; Sun, Yanni3; Liu, Lu1; Chen, Jiaxin1; Wang, Xiaomeng1; Hu, Yuxing1; He, Bowen1; Li, Yunxuan1; Zheng, Qiang 1

Published Oct 21, 2025 on Dryad. https://doi.org/10.5061/dryad.ksn02v7jg

Data files

Oct 21, 2025 version files 94.22 MB

ChinaCoastalWater_ITS_Dataset.zip

7.15 MB
README.md

8.19 KB
Supplementary_Data.zip

210.50 KB
Syn_Tool.zip

86.85 MB

Abstract

Synechococcus is ubiquitous and diverse in marine environments and contributes significantly to primary productivity in the ocean. The genetic diversity of the genus Synechococcus has been extensively explored based on the 16S-23S rRNA internal transcribed spacer (ITS) region. However, accurate identification of Synechococcus ITS from large sequencing datasets is challenging due to the absence of a standardized taxonomy and ambiguous clade boundaries. To address these limitations, we developed Syn_Tool, a deep learning-based framework integrating a curated Synechococcus ITS database for sequence identification, classification, and novel clade discovery. Analyzing 1,087,323 ITS sequences from the coastal water of China—the largest Synechococcus dataset to date—Syn_Tool classified them into 42 clades, including 28 known and 14 newly defined clades. Biogeographic analyses revealed a latitudinal diversity gradient driven by temperature, with 12 newly defined clades (clades CSII-IV, CSVI-XIV) primarily found in estuarine regions where rapid diversification may promote the emergence of novel genotypes. This study demonstrates the application of deep learning in classifying Synechococcus and understanding their ecological roles in dynamic marine ecosystems.

Dataset DOI: 10.5061/dryad.ksn02v7jg

Description of the data and file structure

This dataset contains three main components generated in the study:

Supplementary_Data
ChinaCoastalWater_ITS_Dataset
Syn_Tool

Files and variables

File: Supplementary_Data.zip

Description:

Data S1–S4 for this study, including the list of references for the collected Synechococcus ITS sequences (Data S1, Supplementary Data 1.csv), the starting dataset for Synechococcus ITS (Data S2, Supplementary Data 2.csv), the Syn_ITS database (Data S3, Supplementary Data 3.csv), and the sampling station information (Data S4, Supplementary Data 4.csv).

Supplementary Data 1.csv

Title: Title of the reference from which the Synechococcus ITS sequence was collected
Authors: Reference authors
Year: Publication year
Journal: Journal name

DOI: Digital Object Identifier of the reference. “n/a” indicates that no DOI is available

Supplementary Data 2.csv

Accession_number: NCBI (National Center for Biotechnology Information) accession number for each Synechococcus Internal Transcribed Spacer (ITS) sequence
Label: Custom label for each sequence
Clade: Clade assigned in this study
Sequence: ITS DNA sequence
Strain_or_Clone: Strain or clone name of the organism from which the sequence was obtained
Region: Genomic region covered by the sequence
- 16S: The 16S ribosomal RNA gene
- ITS: Internal Transcribed Spacer
- 23S: The 23S ribosomal RNA gene
Clade_in_origin_reference: Clade assignment reported in the original reference. =“--” indicates that no clade assignment was provided in the original reference
NCBI_taxonomy: NCBI taxonomic classification
Origin_reference: Source reference for the sequence

Supplementary Data 3.csv

Accession_number: NCBI (National Center for Biotechnology Information) accession number for each Synechococcus Internal Transcribed Spacer (ITS) sequence
Label: Custom label for each sequence
Clade: Clade assigned in this study
Sequence: ITS DNA sequence
Strain_or_Clone: Strain or clone name of the organism from which the sequence was obtained
Region: Genomic region covered by the sequence
- 16S: The 16S ribosomal RNA gene
- ITS: Internal Transcribed Spacer
- 23S: The 23S ribosomal RNA gene
Clade_in_Origin_Reference: Clade assignment reported in the original reference. “--” indicates that no clade assignment was provided in the original reference
NCBI_taxonomy: NCBI taxonomic classification

Origin_Reference: Source reference for the sequence

Supplementary Data 4.csv

Station: Sampling station ID
Sea Area: Name of the sea area where the station is located
Latitude: Latitude of the station in decimal degrees
Longitude: Longitude of the station in decimal degrees
Bottom Depth [m]: Water depth at the station (meters). “NA” indicate that the bottom depth data were not available
Sample Depth [m]: Depth at which the sample was collected (meters)
Date [yyyymm]: Sampling date in year and month (YYYYMM)

File: ChinaCoastalWater_ITS_Dataset.zip

Description:

This dataset contains operational taxonomic unit (OTU) representative sequences and corresponding abundance information derived from internal transcribed spacer (ITS) sequences of Synechococcus collected from the coastal water of China.

40535_OTU_repseqs.fasta

Representative sequences of all 40,535 OTUs obtained in this study, in FASTA format.
OTUs were generated using a 97% sequence similarity threshold

40535_OTU_repseqs_abundance.txt

Abundance file corresponding to the 40,535 OTUs
Column 1 is the sequence ID, matching the headers in 40535_OTU_repseqs.fasta
Column 2 is the sequence abundance, representing the number of reads for each OTU

22980_OTU_repseqs_Syn.fasta

Representative sequences of 22,980 Synechococcus ITS OTUs in FASTA format
Sequences were identified using the Syn_Tool software

646_OTU_repseqs_Novel_Syn.fasta

Representative sequences of 646 OTUs assigned to 14 newly defined Synechococcus clades in FASTA format

File: Syn_Tool.zip

Description:

This zip file contains the Syn_Tool software package developed in this study for identifying, classifying, and analyzing Synechococcus ITS sequences. It includes the following files and folders:

model/: Models and related files used by Syn_Tool

CNN_model.h5: Pre-trained CNN (Convolutional Neural Network) model for sequence embedding of ITS sequences
CNN_kmer_tokenizer.pkl: Tokenizer for k-mer representation used by the CNN model
CNN_label_encoder.pkl: Label encoder for CNN class labels
Transformer_model.h5: Pre-trained Transformer model for ITS classification
Transformer_BPE_tokenizer.1024.json: BPE (Byte-Pair Encoding) tokenizer used by the Transformer model
Transformer_layers.py: Python script defining the Transformer model layers
Syn_ITS_database.txt: Synechococcus ITS reference database for Syn_Tool
- Column 1: Sequence
- Column 2: Clade assignment
- Column 3: NCBI accession number
Picocyanobacteria_ITS_database.fasta: Picocyanobacteria ITS reference sequences for Syn_Tool

Picocyanobacteria_ITS_database.tax: Taxonomy file for the Picocyanobacteria ITS reference sequences
- Column 1: NCBI accession number
- Column 2: Clade assignment. All entries are “Pico”, representing Picocyanobacteria

scripts/: Scripts for running Syn_Tool

Sequence_QC.batch: Batch script for performing quality control on input ITS sequences before analysis
Syn_Tool_run.py: Main Python script for running the Syn_Tool workflow

example/: Example input data

example_abundance.fasta: Example FASTA file containing representative ITS sequences
example_abundance.txt: Example OTU abundance file corresponding to the sequences

README.md: Instructions and usage guidelines for the software

requirements.txt: List of Python packages and dependencies required to run Syn_Tool

Code/software

Syn_Tool

Syn_Tool is a deep learning framework designed for analyzing Synechococcus ITS sequences. It integrates identification, classification, and novel clade delineation into a single streamlined workflow. With minimal setup, you can process sequence data and abundance information to gain insights into Synechococcus communities.

Installation

Requirements

Python Version: Python 3.8 or later
Dependencies: Listed in requirements.txt

Steps

Clone the repository:

git clone https://github.com/Aldred-Wang/Syn_Tool.git
cd Syn_Tool

Install dependencies:
```
pip install -r requirements.txt
```

Usage

Before running the tool, ensure that the following files are placed inside the model folder:

Syn_Tool_run.py
input_sequences.fasta
input_abundance.txt

Run the following command to start the tool:

python Syn_Tool_run.py -fa input_sequences.fasta -a input_abundance.txt

Parameters

-fa or --fasta: Path to the input FASTA file.
-a or --abundance: Path to the abundance file.

Example

Here’s an example of how to use Syn_Tool:

python Syn_Tool_run.py -fa example_sequences.fasta -a example_abundance.txt

Output

After execution, the following files will be generated:

Syn.fasta: Output file from the identification module containing Synechococcus ITS sequences in FASTA format.
Syn_df.csv: Output file from the identification module containing Synechococcus ITS sequences in CSV format.
combined_df.csv: Output file from the classification module containing feature vectors of Synechococcus ITS sequences extracted by the CNN model.
result.csv: Output file from the novel clade delineation module providing information about Synechococcus novel clades.
Syn_Tool_final_result.txt: Final classification file generated by Syn_Tool.

Data from: Deep learning reveals hidden diversity of Synechococcus in the coastal water of China: Novel clades and their ecological insights

Data files

Abstract

README: Data from: Deep learning reveals hidden diversity of Synechococcus in the coastal water of China: Novel clades and their ecological insights

Description of the data and file structure

Files and variables

File: Supplementary_Data.zip

File: ChinaCoastalWater_ITS_Dataset.zip

File: Syn_Tool.zip

Code/software

Syn_Tool

Installation

Requirements

Steps

Usage

Example

Output