Skip to main content

Genomics of new ciliate lineages provides insight into the evolution of obligate anaerobiosis - single gene datasets for phylogenomic analysis of anaerobic ciliates (SAL, Ciliophora), protein datasets for mitochondrial pathways prediction, and mitochondrial genomes

Cite this dataset

Rotterova, Johana et al. (2020). Genomics of new ciliate lineages provides insight into the evolution of obligate anaerobiosis - single gene datasets for phylogenomic analysis of anaerobic ciliates (SAL, Ciliophora), protein datasets for mitochondrial pathways prediction, and mitochondrial genomes [Dataset]. Dryad.


Oxygen plays a crucial role in energetic metabolism of most eukaryotes. Yet, adaptations to low oxygen concentrations leading to anaerobiosis have independently arisen in many eukaryotic lineages, resulting in a broad spectrum of reduced and modified mitochondrial organelles (MROs). In this study, we present the discovery of two new class-level lineages of free-living marine anaerobic ciliates, Muranotrichea, cl. nov. and Parablepharismea, cl. nov., that, together with the class Armophorea, form a major clade of obligate anaerobes (APM ciliates) within the SAL (Spirotrichea, Armophorea, Litostomatea) group. To deepen our understanding of the evolution of anaerobiosis in ciliates, we predicted the mitochondrial metabolism of cultured representatives from all three classes in the APM clade, using transcriptomic and metagenomic data, and performed phylogenomic analyses to assess their evolutionary relationships. The predicted mitochondrial metabolism of representatives from the APM ciliates reveals functional adaptations of metabolic pathways that were present in their last common ancestor and likely led to the successful colonization and diversification of the group in various anoxic environments. Furthermore, we discuss the possible relationship of Parablepharismea to the uncultured deep-sea class Cariacotrichea based on single gene analyses. Like most anaerobic ciliates, all studied species of the APM clade host symbionts, which we propose to be a significant accelerating factor in the transitions to an obligately anaerobic lifestyle. Our results provide an insight into the evolutionary mechanisms of early transitions to anaerobiosis and shed light on fine-scale adaptations in MROs over a relatively short evolutionary timeframe.


Illumina NextSeq Sequencing

Metagenomes were sequenced from the picked cells and a culture using Illumina NextSeq (Illumina Inc.) (150-bp paired-end reads, 300-bp insert size). Transcriptome was sequenced by EMBL Genomics Core Facility (GeneCore), Heidelberg, Germany.

Metagenomic and Transcriptomic Assemblies

For metagenomes, Trimmomatic v0.36 was used to filter and trim paired reads from all samples. Filtered and trimmed reads were then combined and co-assembled with IDBA-UD v.1.1.1. Metagenomic contigs > 1 kb were binned using CONCOCT as implemented in Anvi’o v2.0.3, which bins contigs based on nucleotide composition and differential coverage data from mapping reads to the co-assembly. The reads mapping to an affiliated genomic bin of each target ciliate from the initial binning results were re-assembled with IDBA-UD v.1.1.1 to improve bin quality. CheckM was used to evaluate the completeness, contamination and taxonomy of genomic bins, as well as to identify rRNA gene sequences. Additional metagenomic reads from Muranothrix gubernata SUMMARTIN, sequenced subsequently in order to obtain mitochondrial genome from Muranotrichea, were quality trimmed using Trimmomatic v0.39, corrected by Rcorrector and assembled using Metaspades (SPAdes genome assembler v3.13.1) with default settings for paired-end reads in "only-assembler" mode. Contigs representing mitogenomes were identified in metagenome assemblies by BLAST using available protein sequences from ciliate mitogenomes as a query (GU057832.1, JN383843.1, NC_014262.1). Subsequently, we used manual approach based on iterative searches in Illumina reads (BLASTN) followed by read alignment to the corresponding contig (Geneious Prime; “map to reference” option) to unambiguously extend the longest Parablepharisma sp. mitogenome fragment. This resulted in 41,931 bp long sequence (opposed to original contig length 35,877 bp). Initial annotations of mitogenome sequences were obtained using MFannot (online version from October 23, 2019). Predicted genes recovered by MFannot (including hypothetical proteins) were individually checked to confirm/reveal their identity and to exclude possible bacterial contamination using BLAST against the nr GenBank database. Graphical genome maps were produced by OrganellarGenomeDRAW (OGDRAW) v1.3.1. For transcriptomic data, quality trimming and Illumina adapter and sequence contamination removal was done using the program Trimmomatic. Transcriptomes were assembled using the software package Trinity.

Datasets preparation and rRNA gene phylogenetic analyses

Two data sets, containing 18S rRNA gene sequences/18S and 28S rRNA gene sequences concatenated using Mesquite [86], consisted of 17/2 newly determined sequences of Muranotrichea and Parablepharismea, 1/2 newly determined and 24/0 GenBank sequences of Metopida (Armophorea), 12/1 GenBank sequences of Clevelandellida (Armophorea), 3/0 GenBank sequences of Armophorida (Armophorea), 8/6 GenBank sequences of Litostomatea, 28/31 GenBank sequences of Spirotrichea sensu lato, 2/1 GenBank sequences of Odontostomatea, 1/0 GenBank sequence of Cariacotrichea, and 41/41 GenBank sequences of other ciliates (CONThreeP, Protocruziiea, Heterotrichea, Karyorelictea) used as outgroup. Environmental GenBank sequences KT346287 and KT346288, affiliated as related to the newly described taxa by BLAST, were excluded from the analyses due to chimeric origin. The sequences were aligned using MAFFT on the MAFFT 7 server ( with L-INS-i algorithm and default settings. The alignments were manually checked for chimeras and edited using BioEdit 18S rRNA, and concatenated 18S and 28S rRNA phylogenetic trees were constructed by maximum likelihood (ML) and Bayesian analyses. ML analyses were performed in RAxML 8.0.0 under the GTRGAMMAI model with 1000 rapid bootstraps. Node support was assessed by ML analysis of 1000 bootstrap data sets. Bayesian analyses were performed using Phylobayes with GTR CAT. For 18S rRNA gene, four independent chains were run for 34,720 generations (maxdiff = 0.0531, 20% burnin). For concatenated 18S and 28S rRNA genes, four independent chains were run for 150,000 generations (maxdiff = 0.14, 20% burnin).

Phylogenomic analyses

Extensive phylogenomic datasets of ciliates, containing ~160 and 124 genes, respectively, were used as reference datasets. Up to five best hits were recovered (E-value < 1E-5) for each gene from the Muranothrix transcriptome using BLAST and translated into protein sequences using the Barrel-O-Monkeys package ( In the case of metagenomes, genes of interest were recovered using in-house script ( Each recovered sequence was then reciprocally blasted against reference datasets enriched with well-defined known paralogs (e.g., EF-1α and EFL) to assist in the identification and removal of deep-paralogs. BLAST searches against the Swissprot database was used to trim away non-homologous sequence data at the end of predicted genes. Subsequently, all sequences were added to single-gene alignments. These were then aligned using the MAFFT-LINSI algorithm (with default parameters), trimmed in BMGE (gap threshold set to 0.3) and single gene trees were reconstructed using RAxML (model setting PROTGAMMALGF) with 100 rapid bootstraps. Single gene trees were then investigated by eye and proper orthologs were selected from targeted taxa, while putative paralogs were removed, creating final version of the single gene datasets. Each single gene dataset was then aligned by MAFFT LINSI algorithm (with default parameters), followed by trimming using BMGE v 1.2 (gap threshold set to 0.3). Trimmed alignments were then concatenated into one super-matrix using the program alvert from the Barrel-O-Monkeys package. The maximum likelihood phylogenomic tree was constructed in IqTree using the C40+LG+F+G model with 1000 PMSF bootstraps. Bayesian analysis was performed using Phylobayes with CAT-GTR+G with constant sites removed. Four independent chains were run for 16,000 generations (maxdiff = 0.002, 20% burnin).

In-silico predictions of mitochondrial metabolism

Genes of interests were collected from previously published studies of anaerobic mitochondria and from the published mitochondrial proteome of Tetrahymena thermophila. Candidate sequences from ciliates were recovered by blast and hits with e-value < 1e-10 were extracted and added to the initial datasets for each gene. Published datasets were used as starting datasets, where available. When a published dataset was not available, starting datasets were constructed by blast searches of a seed sequence against nr (max target seqs = 100). Additionally, sequences from taxonomically diverse eukaryotic and prokaryotic representatives identified through searches of the gene name in NCBI protein and RefSeq databases. The resulting sequences were combined into a single dataset and clustered in CD-Hit at 75%. Each gene dataset was aligned using MAFFT-LINSI (default settings), trimmed using trimal (gappyout) and trees were constructed using RAxML (PROTGAMMALG with 100 rapid bootstraps). Each alignment/tree has been repeatedly inspected by eye and sequences were added/removed as necessary. Each tree was then manually evaluated for presence absence of genes of interest in studied lineages.

Usage notes

Single gene datasets for phylogenomic analysis:

AAC1.faa ACO.faa ACS_ab1.faa ACS_ab2.faa ALT.faa AOX.faa ArcA.faa

ASCT_1A.faa ASCT_1C.faa AST.faa ATM1.faa Atp1.faa Atp12.faa Atp2.faa

Atp3.faa Atp5.faa Atp9.faa BCKDHa.faa BCKDHb.faa CBS.faa cox19.faa

cox5b.faa cyt_c1.faa DHFS.faa Fd.faa FKBP.faa FtsH.faa FUM.faa G3PD.faa

GCSH.faa GCSL.faa GCSP1.faa GCSP2.faa GCST.faa GDH.faa GR.faa hAAC.faa

Hcp.faa hmp31.faa HscB.faa hsp10.faa hsp40.faa hsp60.faa hsp70.faa

hsp90.faa HydA.faa IND1.faa IscA2.faa IscS.faa IscU.faa Isd11.faa KBL.faa

MAS5.faa MC.faa MCF.faa MDH.faa ME.faa MMM.faa MOC.faa MPPa.faa NAD11.faa

NuoE.faa NuoF.faa OGDH.faa PCCa.faa PCCb.faa PDH-E1a.faa PDH-E1b.faa

PDHe2.faa PDI.faa Pet8.faa PrA.faa PRX.faa PSAT.faa Rieske.faa SCSa.faa

SCSb.faa SDHa.faa SDHb.faa SHMT.faa SOD_Fe.faa SOD_Mn.faa TOM34.faa TPN.faa



Protein datasets for prediction of mitochondrial pathways:

ar21.fas arc20.fas capz.fas cct-A.fas cct-B.fas cct-D.fas cct-E.fas

cct-G.fas cct-N.fas cct-T.fas cct-Z.fas cpn60.fas crfg.fas fh.fas gdi2.fas

glcn.fas grc5.fas Gtub.fas hmt1.fas HSP70C.fas HSP70-E.fas hsp70mt.fas

HSP90.fas if2b.fas if2g.fas if2p.fas if6.fas ino1.fas l12e-A.fas mcm-A.fas

mcm-B.fas mcm-C.fas mcm-D.fas mcm-E.fas mcm-F.fas metap2.fas mra1.fas

nsf1-E.fas nsf1-G.fas nsf1-I.fas nsf1-J.fas nsf1-K.fas nsf1-L.fas

nsf1-M.fas nsf2-B.fas nsf2-F.fas oplah.fas orf2.fas pace2-A.fas pace2B.fas

Pace2C.fas pace5.fas psma-A.fas psma-B.fas psma-C.fas psma-D.fas psma-E.fas

psma-G.fas psma-H.fas psma-I.fas psma-J.fas psmb-K.fas psmb-L.fas

psmb-M.fas psmb-N.fas psmd.fas rad23.fas Rad51A.fas rf1.fas rla2a.fas

rla2b.fas rpl11.fas rpl12.fas Rpl13A.fas Rpl13e.fas Rpl15.fas Rpl18.fas

rpl20.fas rpl21.fas rpl26.fas Rpl3.fas rpl30.fas rpl32.fas rpl33.fas

rpl35.fas rpl43.fas rpl44.fas Rpl4b.fas Rpl5.fas rpl6.fas rpl9.fas

rpo-A.fas rpo-C.fas rppO.fas rps10.fas rps11.fas rps14.fas rps15.fas

rps16.fas rps17.fas rps18.fas rps20.fas rps23.fas rps26.fas rps27.fas

rps3.fas rps4.fas rps5.fas rps6.fas rps8.fas s15a.fas s15p.fas sap40.fas

sra.fas srp54.fas srs.fas suca.fas tfiid.fas topo1.fas trs.fas vata.fas

vatb.fas vatc.fas wrs.fas


Mitochondrial genomes:


National Science Foundation, Award: 1322928