Data from: Deep learning reveals hidden diversity of Synechococcus in the coastal water of China: Novel clades and their ecological insights
Data files
Oct 21, 2025 version files 94.22 MB
-
ChinaCoastalWater_ITS_Dataset.zip
7.15 MB
-
README.md
8.19 KB
-
Supplementary_Data.zip
210.50 KB
-
Syn_Tool.zip
86.85 MB
Abstract
Synechococcus is ubiquitous and diverse in marine environments and contributes significantly to primary productivity in the ocean. The genetic diversity of the genus Synechococcus has been extensively explored based on the 16S-23S rRNA internal transcribed spacer (ITS) region. However, accurate identification of Synechococcus ITS from large sequencing datasets is challenging due to the absence of a standardized taxonomy and ambiguous clade boundaries. To address these limitations, we developed Syn_Tool, a deep learning-based framework integrating a curated Synechococcus ITS database for sequence identification, classification, and novel clade discovery. Analyzing 1,087,323 ITS sequences from the coastal water of China—the largest Synechococcus dataset to date—Syn_Tool classified them into 42 clades, including 28 known and 14 newly defined clades. Biogeographic analyses revealed a latitudinal diversity gradient driven by temperature, with 12 newly defined clades (clades CSII-IV, CSVI-XIV) primarily found in estuarine regions where rapid diversification may promote the emergence of novel genotypes. This study demonstrates the application of deep learning in classifying Synechococcus and understanding their ecological roles in dynamic marine ecosystems.
Dataset DOI: 10.5061/dryad.ksn02v7jg
Description of the data and file structure
This dataset contains three main components generated in the study:
- Supplementary_Data
- ChinaCoastalWater_ITS_Dataset
- Syn_Tool
Files and variables
File: Supplementary_Data.zip
Description:
Data S1–S4 for this study, including the list of references for the collected Synechococcus ITS sequences (Data S1, Supplementary Data 1.csv), the starting dataset for Synechococcus ITS (Data S2, Supplementary Data 2.csv), the Syn_ITS database (Data S3, Supplementary Data 3.csv), and the sampling station information (Data S4, Supplementary Data 4.csv).
Supplementary Data 1.csv
- Title: Title of the reference from which the Synechococcus ITS sequence was collected
- Authors: Reference authors
- Year: Publication year
- Journal: Journal name
- DOI: Digital Object Identifier of the reference. “n/a” indicates that no DOI is available
Supplementary Data 2.csv
- Accession_number: NCBI (National Center for Biotechnology Information) accession number for each Synechococcus Internal Transcribed Spacer (ITS) sequence
- Label: Custom label for each sequence
- Clade: Clade assigned in this study
- Sequence: ITS DNA sequence
- Strain_or_Clone: Strain or clone name of the organism from which the sequence was obtained
- Region: Genomic region covered by the sequence
- 16S: The 16S ribosomal RNA gene
- ITS: Internal Transcribed Spacer
- 23S: The 23S ribosomal RNA gene
- Clade_in_origin_reference: Clade assignment reported in the original reference. =“--” indicates that no clade assignment was provided in the original reference
- NCBI_taxonomy: NCBI taxonomic classification
- Origin_reference: Source reference for the sequence
Supplementary Data 3.csv
- Accession_number: NCBI (National Center for Biotechnology Information) accession number for each Synechococcus Internal Transcribed Spacer (ITS) sequence
- Label: Custom label for each sequence
- Clade: Clade assigned in this study
- Sequence: ITS DNA sequence
- Strain_or_Clone: Strain or clone name of the organism from which the sequence was obtained
- Region: Genomic region covered by the sequence
- 16S: The 16S ribosomal RNA gene
- ITS: Internal Transcribed Spacer
- 23S: The 23S ribosomal RNA gene
- Clade_in_Origin_Reference: Clade assignment reported in the original reference. “--” indicates that no clade assignment was provided in the original reference
- NCBI_taxonomy: NCBI taxonomic classification
- Origin_Reference: Source reference for the sequence
Supplementary Data 4.csv
- Station: Sampling station ID
- Sea Area: Name of the sea area where the station is located
- Latitude: Latitude of the station in decimal degrees
- Longitude: Longitude of the station in decimal degrees
- Bottom Depth [m]: Water depth at the station (meters). “NA” indicate that the bottom depth data were not available
- Sample Depth [m]: Depth at which the sample was collected (meters)
- Date [yyyymm]: Sampling date in year and month (YYYYMM)
File: ChinaCoastalWater_ITS_Dataset.zip
Description:
This dataset contains operational taxonomic unit (OTU) representative sequences and corresponding abundance information derived from internal transcribed spacer (ITS) sequences of Synechococcus collected from the coastal water of China.
40535_OTU_repseqs.fasta
- Representative sequences of all 40,535 OTUs obtained in this study, in FASTA format.
- OTUs were generated using a 97% sequence similarity threshold
40535_OTU_repseqs_abundance.txt
- Abundance file corresponding to the 40,535 OTUs
- Column 1 is the sequence ID, matching the headers in
40535_OTU_repseqs.fasta - Column 2 is the sequence abundance, representing the number of reads for each OTU
22980_OTU_repseqs_Syn.fasta
- Representative sequences of 22,980 Synechococcus ITS OTUs in FASTA format
- Sequences were identified using the Syn_Tool software
646_OTU_repseqs_Novel_Syn.fasta
- Representative sequences of 646 OTUs assigned to 14 newly defined Synechococcus clades in FASTA format
File: Syn_Tool.zip
Description:
This zip file contains the Syn_Tool software package developed in this study for identifying, classifying, and analyzing Synechococcus ITS sequences. It includes the following files and folders:
model/: Models and related files used by Syn_Tool
CNN_model.h5: Pre-trained CNN (Convolutional Neural Network) model for sequence embedding of ITS sequencesCNN_kmer_tokenizer.pkl: Tokenizer for k-mer representation used by the CNN modelCNN_label_encoder.pkl: Label encoder for CNN class labelsTransformer_model.h5: Pre-trained Transformer model for ITS classificationTransformer_BPE_tokenizer.1024.json: BPE (Byte-Pair Encoding) tokenizer used by the Transformer modelTransformer_layers.py: Python script defining the Transformer model layersSyn_ITS_database.txt: Synechococcus ITS reference database for Syn_Tool- Column 1: Sequence
- Column 2: Clade assignment
- Column 3: NCBI accession number
Picocyanobacteria_ITS_database.fasta: Picocyanobacteria ITS reference sequences for Syn_Tool
Picocyanobacteria_ITS_database.tax: Taxonomy file for the Picocyanobacteria ITS reference sequences- Column 1: NCBI accession number
- Column 2: Clade assignment. All entries are “Pico”, representing Picocyanobacteria
scripts/: Scripts for running Syn_Tool
Sequence_QC.batch: Batch script for performing quality control on input ITS sequences before analysisSyn_Tool_run.py: Main Python script for running the Syn_Tool workflow
example/: Example input data
example_abundance.fasta: Example FASTA file containing representative ITS sequencesexample_abundance.txt: Example OTU abundance file corresponding to the sequences
README.md: Instructions and usage guidelines for the software
requirements.txt: List of Python packages and dependencies required to run Syn_Tool
Code/software
Syn_Tool
Syn_Tool is a deep learning framework designed for analyzing Synechococcus ITS sequences. It integrates identification, classification, and novel clade delineation into a single streamlined workflow. With minimal setup, you can process sequence data and abundance information to gain insights into Synechococcus communities.
Installation
Requirements
- Python Version: Python 3.8 or later
- Dependencies: Listed in
requirements.txt
Steps
- Clone the repository:
git clone https://github.com/Aldred-Wang/Syn_Tool.git cd Syn_Tool - Install dependencies:
pip install -r requirements.txt
Usage
Before running the tool, ensure that the following files are placed inside the model folder:
Syn_Tool_run.pyinput_sequences.fastainput_abundance.txt
Run the following command to start the tool:
python Syn_Tool_run.py -fa input_sequences.fasta -a input_abundance.txt
Parameters
-faor--fasta: Path to the input FASTA file.-aor--abundance: Path to the abundance file.
Example
Here’s an example of how to use Syn_Tool:
python Syn_Tool_run.py -fa example_sequences.fasta -a example_abundance.txt
Output
After execution, the following files will be generated:
Syn.fasta: Output file from the identification module containing Synechococcus ITS sequences in FASTA format.Syn_df.csv: Output file from the identification module containing Synechococcus ITS sequences in CSV format.combined_df.csv: Output file from the classification module containing feature vectors of Synechococcus ITS sequences extracted by the CNN model.result.csv: Output file from the novel clade delineation module providing information about Synechococcus novel clades.Syn_Tool_final_result.txt: Final classification file generated by Syn_Tool.
