Re-annotated genomes for plastomes and mitogenomes
Data files
Jul 02, 2025 version files 12.09 MB
-
125_CP_Reannotation.zip
9.94 MB
-
75_MT_Reannotation.zip
2.16 MB
-
README.md
3.34 KB
Abstract
Diatoms are pivotal in global oxygen, carbon dioxide, and silica cycling, contributing significantly to photosynthesis and serving as fundamental components in aquatic ecosystems. Recent advancements in genomic sequencing have shed light on their evolutionary dynamics, revealing evolutionary complex genomes influenced by symbiotic relationships and horizontal gene transfer events. By analyzing 120 plastome and 70 mitogenome publicly available sequences, this paper aims to elucidate the evolutionary dynamics of diatoms across diverse lineages. In comparing genomic events between plastomes and mitogenomes, gene losses and pseudogenes were more frequently observed in plastomes, while they were less commonly found in mitogenomes. Overall, gene losses were abundant in the plastomes of Astrosyne radiata, Toxarium undulatum, and Proboscia sp. Frequently lost and pseudogenized genes were acpP, ilv, serC, tsf, tyrC, ycf42 and bas1. In mitogenomes, mttB, secY and tatA genes were lost repeatedly across several diatom taxa. Analysis of nucleotide substitution rates indicated that, in general, mitogenomes were evolving at a more rapid rate compared to plastomes. This is contrary to what was observed in synteny analyses where plastomes exhibited greater structural rearrangements compared to mitogenomes with the exception of the genera Coscinodiscus and one group of species within Thalassiosira.
https://doi.org/10.5061/dryad.70rxwdc7b
Description of the data and file structure
Genome Annotation and Validation Process
Background
Genomes deposited in public databases such as NCBI GenBank often contain annotation errors or missing gene information. These inaccuracies can lead to incorrect conclusions about gene presence, loss, or gain in the studied organisms. In fact, during our analysis, every published organellar genome we examined showed some level of annotation inconsistency or incompleteness.
Problem with Existing Annotations
A common issue, particularly with diatom genomes in NCBI, is incomplete or poor annotation. This often happens because the reference genomes used during the original annotation process themselves lacked certain genes. Consequently, errors propagate and result in genomes that do not fully represent the true gene content.
Our Approach: Confirmatory Re-Annotation
To address these issues and ensure reliable genomic data, we implemented a confirmatory step:
- Re-annotation of all genomes analyzed in this study.
- Used multiple reference genomes known to have complete gene sets to guide the re-annotation.
- This approach allowed us to distinguish genes that are truly absent from those simply missing due to annotation errors.
- As a result, previously unannotated genes were identified and added, improving the completeness and accuracy of the genome annotations.
Methodology
Data Acquisition: Downloaded FASTA files of the organellar genome sequences from NCBI for all strains included in this study.
Annotation: Performed re-annotation using the GeSeq tool (Tillich et al., 2017), which supports annotation based on multiple reference genomes and custom settings.
Summary
By incorporating multiple complete reference genomes during re-annotation and confirming gene presence, our workflow improves the reliability of gene annotations in diatom organellar genomes. This careful validation helps prevent misinterpretation caused by missing or erroneous gene annotations in public databases.
File: 75_MT_Reannotation.zip
Description: Re-annotated mitogenomes using GeSeq in Chlorobox
File: 125_CP_Reannotation.zip
Description: Re-annotated plastomes using GeSeq in Chlorobox
Code/software
GeSeq in Chlorobox (web server) https://chlorobox.mpimp-golm.mpg.de/geseq.html
Access information
Other publicly accessible locations of the data:
- N/A
Data was derived from the following sources:
- GeSeq in Chlorobox (web server) https://chlorobox.mpimp-golm.mpg.de/geseq.html
File naming convention/scheme:
Mitogenomes (ingroups):
[total no. of ingroup taxa] MTs Reannotation-2_[Genus]**[species][NCBI accession no.]_[source: GenBank].gb
Mitogenomes (outgroups):
Outgroups MT-[Genus]**[species]-MT[Genus][species][NCBI accession no.]_[source: GenBank].gb
Plastomes (ingroups):
[total no. of ingroup taxa] CP Re-annotation_[Genus]**[species][NCBI accession no.]_[source: GenBank].gb
Plastomes (outgroups):
Outgroups CP-Reannotation_[Genus]**[species][NCBI accession no.]_[source: GenBank].gb
Genomes deposited in NCBI GenBank may contain annotation errors or missing gene data, which can lead to misinterpretations regarding gene losses and gains. In fact, this was the case for every one of the published organellar genomes we analyzed. To address this, the authors included a confirmatory step to verify whether genes were actually absent or simply missing due to annotation issues. Thus, all genomes were re-annotated using multiple reference genomes to ensure annotation completeness.
