Skip to main content
Dryad

Primary data of thiotrophic free-living and ectosymbiotic genomes

Cite this dataset

Espada-Hinojosa, Salvador (2023). Primary data of thiotrophic free-living and ectosymbiotic genomes [Dataset]. Dryad. https://doi.org/10.5061/dryad.wh70rxwrq

Abstract

Thiotrophic symbioses between sulfur-oxidizing bacteria and various unicellular and metazoan eukaryotes are widespread in reducing marine environments. The giant colonial ciliate Zoothamnium niveum and its vertically transmitted ectosymbiont Candidatus Thiobius zoothamnicola (short Thiobius), however, is the only thiotrophic mutualism that has been cultivated so far. Because theoretical predictions posit a smaller genome in vertically transmitted endosymbionts compared to free-living relatives, we investigated whether this is true also for an ectosymbiont. We used metagenomics to recover the high-quality draft genome of this bacterial symbiont. For comparison, we have also sequenced a closely related free-living cultured but not formally described strain Milos ODIII6 (short ODIII6). We then performed comparative genomics to assess the functional capabilities at the gene, metabolic pathway, and trait levels. 16S rRNA gene trees and average amino acid identity confirmed the close phylogenetic relationship of both bacteria. Indeed, Thiobius had about a third smaller genome than its free-living relative ODIII6, with reduced metabolic capabilities and fewer functional traits. The functional capabilities of Thiobius were mostly a subset of those of the more versatile ODIII6, which possessed additional genes in the context of oxygen, sulfur, and hydrogen utilization and for the acquisition of phosphorus illustrating the adaptations to instable environmental conditions at hydrothermal vents. In contrast, Thiobius possessed genes for heterotrophy potentially enabling it to utilize lactate and acetate, which may be provided as byproducts by the host. The present study illustrates the effect of strict host-dependence of a bacterial ectosymbiont on genome evolution and host adaptation.

README: Primary Data of Thiotrophic free-living and ectosymbiotic genomes


<br>
01.ML16S.redu.nwk md5=403c2de8a2ac17fdd55559414657b454
This file includes the maximum likelihood (ML) 16S rRNA gene tree in newick format after prunning all branches of organisms that do not have an available genome.

02.ML.16S.306seqs.100BS.nwk md5=5df2c88afae8c5ec8643cc430ef22cfd
This file includes the maximum likelihood 16S rRNA gene tree as obtained from the R workflow (see code in the R-Markdown .Rmd file 18).

03.MP16S.redu.nwk md5=d809ed2d245816d80fb8e14fc685ca9f
This file includes the maximum parsimony (MP) 16S rRNA gene tree in newick format after prunning all branches of organisms that do not have an available genome.

04.MP.16S.306seqs.100BS.nwk md5=32fdefd95b5d0ebee1631b43d7a87ede
This file includes the maximum likelihood 16S rRNA gene tree as obtained from the R workflow (see code in the R-Markdown .Rmd file 18).

05.BI16S.redu.nwk md5=d9ac087e04306794b329fefb64722aaa
This file includes the bayesian inference (BI) 16S rRNA gene tree in newick format after prunning all branches of organisms that do not have an available genome.

06.BI16S.306seqs.nwk md5=5b7eeb57af90de2d989f6189dbfb2bb7
This file includes the bayesian inference 16S rRNA gene tree as obtained from MrBayes.

07.BI_ML_MP_supports.16Sredu.csv md5=628ab33dced304b5ceb3a893dfa3a5e9
Column "Node" refers to the internal phylogenetic node identifier. Columns "BI support", "ML support" and "MP support" refer to the internal phylogenetic support metrics (bootstrap for ML and MP, and posterior probability for BI), all in percent. Several "NA" values (not available) ocurr in the root nodes due to the used techniques. Internal nodes supports indicated as "--" mean that the tree topology of the column technique did not include this internal node. The ML tree topology is used as the reference.

08.Phylogeny.16S.mafft.trimal.fna md5=05119c491f84ac6f56377f2c4ae56dfe
Alignment used to generate the phylogenies, in fasta nucleotide file format.

09.TaxaSelection16S.csv md5=3615b834dfa2a0589105d0b3d61392f8
Details of the sequences, including the following columns: SequenceID, SpeciesID, YearOfTree, Accession, PublishedTree, TipLabel, Database, Length, Start16S, End16S, Contig16S, Notes. The column "SequenceID" shows a running number for sequences. The column "SpeciesID" shows a running number for species. The column "YearOfTree" indicates the year of publication of the phylogenetic tree from where the sequence accesion number was taken. "Accession" means the accession number of the sequence. "PublishedTree" shows the peer-reviewed publication reference where the tree that includes the sequence was publicly released. "TipLabel" states the organism name. "Database" shows the name of the database that hosts the accession number instance of the sequence. "Length" states the sequence length in number of nucleotides. "Start16S" and "End16S" shows the coordinates of the sequence. "Contig16S" state the numeral of the contig where the 16S rRNA gene is taken from. The column "Notes" contain empty cells, in those instances where there was nothing to be noted.

10.Metagenomes.csv md5=8230850cbd1f1b3e03d8460ec294d18b
Details of the metagenomes and their pair-end Illumina generated read pairs. The column "GenomeID" expresses the identifier of the genome, in the first three instances of Candidatus Thiobius zoothamnicola. "MD5" is the hashtag signature of the bam file from where the fastq files pairs originate. "FileSize" is the file size in bytes of the bam file from where the fastq files pairs originate. "ReadPairs" is the number of read pairs included in the files before any post-processing or filtering.

11.SampleCollection.csv md5=aff4657109165b398aead88b5d1f7b73
Sample collection details, including geographical location, year and sampling collector person. "SalmplingSite" is the identifier of the place where samples were collected. "SovereignState" is the country of sampling. "LatitudeDMS" and "LongitudeDMS" are the sampling geographical coordinates in Degrees(d), Minutes(m) and Seconds(s), where "N" means Northern hemisphere and "W" means Western hemisphere. "LongitudeNumerical" and "LatitudeNumerical" are the sampling geographical coordinates in Degrees with decimals, with negative longitude values referring to those western of Greenwich meridian. "SamplingTime" shows the year of collection. "SamplingPerson" refers to the person that collected the sample.

12.DNAextraction.csv md5=d80e13a9889bf5151d1e561c5bbd0fcb
Details of the genome extractions, including the columns: GenomeID, VialID, Wood, Fixation, NanoConcEx, Na260to280, Na260to230, PicoGreenEx, PicoGreenLi. "GenomeID" is a running number for a broader set of metagenomes that was tried, for which the best ones in terms of assembly metrics are used in this study. "VialID" is a internal reference from our lab connecting it to our sampling documentation. "Wood" is a running number for the sunken wood sampled through the years by our research group. "Fixation" stands either for "Cryo" (cryofixation, by flash-freezing in liquid nitrogen), or for "EtOH" that stands for absolute ethanol fixation. The column named "NanoConcEx" refers to the Nanodrop measurement of DNA concentration in ng/µL. "Na260to280" and "Na260to230" are the absorvance ratios between the two expressed wave lengths, in order to assess sample purity. "PicoGreenEx" is the concentration of the DNA extract according to the Picogreen essay, in ng/µL. "PicoGreenLi" is the concentration of the DNA library according to the Picogreen essay, in ng/µL.

13.ThiobiusG43.curated.tsv md5=7d8c1750732f8ec6af0abac20dfb99f1

Curated outcome of gene prediction and annotation, including the columns: Local_ID, Contig_id, Feature_id, Type, Start, Stop, TraitNumber, TraitShortName, TraitEvidence, Inferred_function, Notes, Orthogroup, GeneSymbol, GeneEvidence, GeneClass, COGcat. Some columns (e.g. GeneSymbol, GeneEvidence, ...) contain empty cells, meaning that for that particular instance in that row, there is no information available from the analysis. "Local_ID" is a running number. "Contig_id" states the numeral of the contig where the indicated gene is taken from. "Feature_id" is the RASTtk annotation label. "Feature" is the type of feature, in this case for all "peg". "Start" and "Stop" are the cordinates of the feature in the indicated contig. "TraitNumber" is a running number of the trait. "TraitShortName" is a short name for the trait. "TraitEvidence" is the evidence backing the trait call. "Inferred_function" is the functional annotation. "Notes" are free text comments whenever needed. "Orthogroup" is the label of the orthology analysis. "GeneSymbol" is a short name for the gene. "GeneEvidence" is the evidence backing the gene call. "GeneClass" is the attribution whenever possible to the core genome or to the shell or to the cloud. "COGcat" is the attribution to a functional COG category.

14.ODIII6.curated.tsv md5=25f4a9fa34e36eb1c16d5dbafd0bd73e
Curated outcome of gene prediction and annotation, including the columns: Local_ID, Contig_id, Feature_id, Type, Start, Stop, TraitNumber, TraitShortName, TraitEvidence, Inferred_function, Notes, Orthogroup, GeneSymbol, GeneEvidence, GeneClass, COGcat. Some columns (e.g. GeneSymbol, GeneEvidence, ...) contain empty cells, meaning that for that particular instance in that row, there is no information available from the analysis. "Local_ID" is a running number. "Contig_id" states the numeral of the contig where the indicated gene is taken from. "Feature_id" is the RASTtk annotation label. "Feature" is the type of feature, in this case for all "peg". "Start" and "Stop" are the cordinates of the feature in the indicated contig. "TraitNumber" is a running number of the trait. "TraitShortName" is a short name for the trait. "TraitEvidence" is the evidence backing the trait call. "Inferred_function" is the functional annotation. "Notes" are free text comments whenever needed. "Orthogroup" is the label of the orthology analysis. "GeneSymbol" is a short name for the gene. "GeneEvidence" is the evidence backing the gene call. "GeneClass" is the attribution whenever possible to the core genome or to the shell or to the cloud. "COGcat" is the attribution to a functional COG category.

15.ThiobiusMAGandODIII6.csv md5=d28becd25f63777c699c8b61040f36ca
Sumary of genome metrics, including the following columns: GenomeID, MD5, FileSize, ContigsNumber, Nucleotides, N50, BUSCO_Completeness, CheckM_Completeness, CheckM_Contamination, CheckM_Heterogeneity, GCcontentPerCent, tRNAs, AAtRNAs, MissingAAtRNA. "GenomeID" is a running number for a broader set of metagenomes that was tried, for which the best ones in terms of assembly metrics are used in this study. "MD5" is the hashtag signature of the bam file from where the fastq files pairs originate. "FileSize" is the file size in bytes of the bam file from where the fastq files pairs originate. "ContigsNumber" is the number of contigs the assembly includes. "Nucleotides" is the assembly size in number of nucleotides. "N50" is the metric of contiguity in nucleotides. "BUSCO_Completeness", "CheckM_Completeness", "CheckM_Contamination", "CheckM_Heterogeneity" are metrics of assembly quality. "tRNAs" is the number of encoded tRNAs. "AAtRNAs" is the number or amino acid transcriptions covered by the encoded tRNAs. "MissingAAtRNA" specifies what amino acid is not having tRNA, as a indication of incompleteness in the assembly. This file is semicolon-delimited with comma used as the decimal, instead of a period.

16.MetaCycOUTCOME.csv md5=7a557d8c45186e0decaae663268eba90
Outcome of the MetaCyc pathways assessment including the following columns: LocalID, Pathway, MetaCycID, css, mso, tzb, tta, tbm, svs, tes, tea, epe, sti, avm, tdi, mpm, lpa, tms, toi, ts0, hhs, hcs, hki, hms, tcs, tca, tae, bss, bas, sup, taa, rma, voi, tss, TraitShortName, CoreTouched, KEEGm. "LocalID" is a running number. "Pathway" states the name of the metabolic pathway. "MetaCycID" is the label from MetaCyc database. "TraitShortName" is a short name for the trait. "CoreTouched" is wether the genes employed for the core genome phylogeny are touched by the pathway, where blank cells mean "no". "KEGGm" is the correspondent KEGG module. Organisms: avm,Allochromatium vinosum; bss,Bathymodiolus sp. SMAR symbiont; bas,Bathymodiolus azoricus symbiont; epe,Candidatus Endoriftia persephone; rma,Candidatus Ruthia magnifica; tzb,Candidatus Thiobios zoothamnieoli str. BelizeG43; tes,Candidatus Thiodiazotropha endoloripes; tea,Candidatus Thiodiazotropha endolucinida; taa,Candidatus Thioglobus autotrophica; tss,Candidatus Thioglobus singularis; toi,Candidatus Thiosymbion oneisti; voi,Candidatus Vesicomyosocius okutanii; css,Chrysomallon squamiferum symbiont; hcs,Hydrogenovibrio crunogenus; hhs,Hydrogenovibrio halophilus; hki,Hydrogenovibrio kuenenii; hms,Hydrogenovibrio marinus; lpa,Lamprocystis purpurea; mpm,Marichromatium purpuratum; mso,Milos strain ODIII6; sti,Sedimenticola thiotaurini; svs,Solemya velum symbiont; tms,Thioflavicoccus mobilis; tbm,Thiolapillus brandeum; tcs,Thiomicrorhabdus chilensis; tae,Thiomicrospira aerophila; tca,Thiomicrospira cyclica; tdi,Thiorhodococcus drewsii; ts0,Thiorhodovibrio sp. 970; tta,Thiosocius teredinicola; sup,uncultured SUP05 cluster bacterium. "1" stands for presence, and "0" for absence. Some columns (TraitShortName, KEGGm) contain empty cells, meaning that for that particular instance in that row, there is no information available from the analysis. This file is semicolon-delimited.

17.KEGGmOUTCOME.csv md5=00162551d1e98e5e7ba128cce9ecd4ad
Outcome of the KEGG modules assessment including the following columns: LocalID, ModuleName, ModuleM, css, mso, tzb, tta, tbm, svs, tes, tea, epe, sti, avm, tdi, mpm, lpa, tms, toi, ts0, hhs, hcs, hki, hms, tcs, tca, tae, bss, bas, sup, taa, rma, voi, tss, TraitNumber, TraitShortName, CoreTouched. "LocalID" is a running number. "ModuleName" states the name of the KEGG module. "ModuleM" is the short KEGG identifier. "TraitShortName" is a short name for the trait. "CoreTouched" is wether the genes employed for the core genome phylogeny are touched by the module, where blank cells mean "no". Organisms: avm,Allochromatium vinosum; bss,Bathymodiolus sp. SMAR symbiont; bas,Bathymodiolus azoricus symbiont; epe,Candidatus Endoriftia persephone; rma,Candidatus Ruthia magnifica; tzb,Candidatus Thiobios zoothamnicoli str. BelizeG43; tes,Candidatus Thiodiazotropha endoloripes; tea,Candidatus Thiodiazotropha endolucinida; taa,Candidatus Thioglobus autotrophica; tss,Candidatus Thioglobus singularis; toi,Candidatus Thiosymbion oneisti; voi,Candidatus Vesicomyosocius okutanii; css,Chrysomallon squamiferum symbiont; hcs,Hydrogenovibrio crunogenus; hhs,Hydrogenovibrio halophilus; hki,Hydrogenovibrio kuenenii; hms,Hydrogenovibrio marinus; lpa,Lamprocystis purpurea; mpm,Marichromatium purpuratum; mso,Milos strain ODIII6; sti,Sedimenticola thiotaurini; svs,Solemya velum symbiont; tms,Thioflavicoccus mobilis; tbm,Thiolapillus brandeum; tcs,Thiomicrorhabdus chilensis; tae,Thiomicrospira aerophila; tca,Thiomicrospira cyclica; tdi,Thiorhodococcus drewsii; ts0,Thiorhodovibrio sp. 970; tta,Thiosocius teredinicola; sup,uncultured SUP05 cluster bacterium. “co”, complete; “1m”, one module missing; “2m”, two modules missing; “in”, incomplete, and "0" stands for absence. Some columns (e.g. GeneSymbol, GeneEvidence, ...) contain empty cells, meaning that for that particular instance in that row, there is no information available from the analysis. This file is semicolon-delimited.

18.Espada-Hinojosa_et_al_2022_PrimaryData.Rmd md5=b7297234dc067b1dd587f91482ce7f62
R-Markdown Primary Data report code. This file is made available under CC0 license waiver.

19.Espada-Hinojosa_et_al_2022_PrimaryData.html md5=88fb0354828aae82d00eee5c25a96eb2
R-Markdown Primary Data html report. This file is made available under CC0 license waiver.

Methods

The files 1 to 6 are phylogenetic trees in newick format obtained with R and MrBayes. File 7 puts together the support metrics of the phylogenetic internal nodes whenever this are shared between trees. File 8 provides the nucleotide alignment of the 16S rRNA gene sequences in fasta format. File 10 reports the information of the 310 employed 16S rRNA gene sequences. Sample collection details are reported in file 11, and DNA extraction in file 12. Files 13 and 14 report the outcome of the manually curated genome functional annotations. File 15 summarizes genome features. Metabolic pathways inference outcomes are reported in files 16 and 17. Ultimately, files 18 and 19 make the R-Markdown Primary Data report.

Funding

FWF Austrian Science Fund, Award: P 32197

FWF Austrian Science Fund, Award: P 24565