Taxonomic reassessment of genomes from a divergent population of Streptococcus suis by average nucleotide identity analysis
Abstract
This dataset provides genomic-level re-identification and taxonomic reassessment of Streptococcus suis out-population strains based on average nucleotide identity (ANI), 16S rRNA gene comparisons, and other phylogenetic markers. The dataset includes an Excel workbook with eight sheets (Sheet A–H), which together constitute Supplementary Table S1. This table contains detailed information on the studied strains, including their disease association, sequence alignment results (identity, coverage, and length) of 16S rRNA genes against both the S. suis type strain S735 and a database of prokaryotic type strains, ANI and AAI values compared to S735, accession numbers for short-read and long-read data, genome assembly IDs, phylogenetic lineage within the S. suis tree, group classification within the out-population, and cluster medoid status indicating representative strains. Through comparative genomic analysis, this study established an ANI threshold of 93.17% for defining authentic S. suis, revealing that all 645 genomes from the out-population fell below this cutoff and thus did not belong to S. suis. Further pairwise ANI comparison identified 18 distinct clusters among these strains, leading to the proposal of 12 novel Streptococcus species and the reclassification of six as known species. This dataset serves as a valuable resource for researchers studying streptococcal taxonomy, zoonotic pathogens, and bacterial evolution. It enables further investigations into genomic diversity, species delineation, and the evolutionary relationships within the genus Streptococcus.
Dataset DOI: 10.5061/dryad.v41ns1s7g
Description of the data and file structure
The table contains information on the re-identification of Streptococcus suis out-population species at the genomic level. It includes disease-related information of the studied strains, as well as ANI (Average Nucleotide Identity), 16S rRNA, and other comparison data. The table consists of eight sheets, labeled A through H.
Files and variables
File: S1.xlsx
Description: The table contains information on the re-identification of Streptococcus suis out-population species at the genomic level. It includes disease-related information of the studied strains, as well as ANI (Average Nucleotide Identity), 16S rRNA, and other comparison data. The table consists of eight sheets, labeled A through H.
-
Sheet A: Basic information of Streptococcus suis outgroup strains, including 16S rRNA gene sequences and ANI (Average Nucleotide Identity) comparison results.
-
Sheet B: ANI clustering results of Strectococcus suis outgroup1 strains.
-
Sheet C: ANI comparison values between Streptococcus suis outgroup1 strains and the type strains of known Streptococcus species.
-
Sheet D: ANI clustering results of Streptococcus suis outgroup3 strains.
-
Sheet E: ANI comparison values between Streptococcus suis outgroup3 strains and the type strains of known Streptococcus species.
-
Sheet F: ANI comparison values between Streptococcus suis outgroup2 strains and the type strains of known Streptococcus species.
-
Sheet G: ANI clustering results of Streptococcus suis outgroup4 strains.
-
Sheet H: ANI comparison values between Streptococcus suis outgroup4 strains and the type strains of known Streptococcus species.
-
The column names and descriptions are as follows:
sheet A:
-
Filename: The filename of the genomic data used in the analysis.
-
Streptococcus suis S735: Extract the complete 16S sequence, i.e., align with Streptococcus suis S735; use "unavailable" if not extracted; use "NA" if the strain sequence is not available in NCBI.
-
pIdent (line C): Sequence identity percentage of the 16S rRNA gene when compared to the Streptococcus suis type strain S735.
-
qcovs (line D): Alignment coverage percentage of the 16S rRNA gene when compared to the Streptococcus suis type strain S735.
-
length (line E): Length (in base pairs) of the aligned region when comparing the 16S rRNA gene to the Streptococcus suis type strain S735.
-
16S-subject: Sequence identity percentage of the 16S rRNA gene when compared to a database of prokaryotic type strains.
-
pIdent (lineG): Sequence identity percentage of the 16S rRNA gene when compared to the prokaryotic type strains database.
-
qcovs (line H): Alignment coverage percentage of the 16S rRNA gene when compared to the prokaryotic type strains database.
-
length (line I): Length (in base pairs) of the aligned region when comparing the 16S rRNA gene to the prokaryotic type strains database.
-
ANI Final identification: The species identification results determined by ANI alignment in this study.
-
ANI with S735: Average Nucleotide Identity value when compared to the Streptococcus suis type strain S735.
-
AAI with S.suis S735: Average Amino Acid Identity value when compared to the Streptococcus suis type strain S735.
-
Bioproject ID (short-read data): Bioproject ID associated with the short-read sequencing data of the strain.
-
Bioproject ID (long-read data): Bioproject ID associated with the long-read sequencing data of the strain.
-
Genome Assembly: Accession number or identifier for the genome assembly of the strain.
-
Lineage: Phylogenetic lineage of the strain within the Streptococcus suis species.
-
Group: The specific outgroup cluster or group to which the strain belongs.
sheet B:
-
Species: Name of the species involved in the alignment.
-
Genome: Name of the genome.
-
Cluster_Medoid: Indicates whether the strain is the medoid (representative strain) of its cluster, based on computational analysis.
sheet C:
-
Query: Name of the genome involved in the alignment.
-
Reference: Known type strain of the Streptococcus species that the query aligns to.
-
ANI (%): The specific ANI value from the alignment.
The column names in Tables D–H are explained in reference to Tables B and C.
-
The reasons for unavailable or missing data ("NA") in the "A" sheet are as follows:
- For the columns "Streptococcus suis S735" and "16S-subject", the missing information is due to the fact that the draft genomes did not allow for the extraction of complete 16S rRNA gene sequences.
- In the columns "pIdent", "qcovs", and "length", "NA" indicates that no 16S rRNA sequence was available for alignment, and therefore no corresponding alignment values could be calculated.
- The "Bioproject ID (long-read data)" column contains "NA" when no complete genome assembly or long-read data are available in the NCBI database.
- The "Genome Assembly" column contains "NA" when the corresponding genome sequence does not have a separate Assembly accession number in NCBI.
- The "Lineage" column contains "NA" for strains belonging to the Streptococcus suis out population. Lineage classification was only assigned for strains within the central population in the study.
-
Abbreviation
ANI (Average Nucleotide Identity)
AAI (Average Amino Acid Identity)
Code/software
- Microsoft Excel (recommended) or
- LibreOffice Calc (free and open-source alternative) or
- Google Sheets (online tool).
Workflow Description:
The submitted file is an Excel workbook (data.xlsx) containing 8 sheets (labeled A through H). Each sheet includes genomic comparison data for Streptococcus suis strains, such as ANI, AAI, and 16S rRNA alignment results. No additional software, scripts, or packages are required to view the data.
Instructions for Use:
Simply download the file and open it using one of the compatible spreadsheet tools listed above. All data is self-contained within the workbook, and no further processing is necessary.
Access information
Links to Other Publicly Accessible Locations of the Data:
The data provided in this submission is original and has not been previously published or stored in any other publicly accessible location. Therefore, no additional links are available.
Source(s) of the Data:
All data were generated as part of this study and are included in full with this submission. No external sources were used in the creation of this dataset.
The analysis primarily employed ANI (Average Nucleotide Identity), 16S rRNA sequences, and AAI (Average Amino Acid Identity) comparisons to calculate relevant information for the entire Streptococcus suis out-population as well as the type strains of known species within the genus Streptococcus.
