Recent origin of iron oxidation in extant microbial groups and low clade fidelity of iron metabolisms

Published Jul 15, 2025 on Dryad. https://doi.org/10.5061/dryad.905qfttvf

Abstract

Reduced iron was abundant in Earth’s surface environments before their oxygenation, so iron oxidation could have been a common metabolism on the early Earth. Consequently, modern microbial iron oxidation is sometimes seen as a holdover from an earlier biosphere, but the continuity of involved lineages or the metabolic process itself has not been verified. Modern neutrophilic iron oxidizers use cytochrome-porin Cyc2 as the initial electron acceptor in iron oxidation. With the protein as a proxy for the metabolism, we performed a phylogenetic analysis of Cyc2 to understand the evolutionary history of this microbial iron oxidation pathway.

In addition to known iron oxidizers, we identified Cyc2 orthologs in gammaproteobacterial endosymbionts of lucinid bivalves. These bivalves have a robust fossil record and rely on sea-grass meadows that only appear in the Cretaceous, providing a valuable time calibration in the evolutionary history of Cyc2. Our molecular clock analysis shows that extant sampled Cyc2 diversity has surprisingly recent common ancestry, and iron oxidation metabolisms in Gallionellaceae, Zetaproteobacteria, and photoferrotrophic Chlorobi originated in the Neoproterozoic or the Phanerozoic via multiple non-vertical inheritance events.

The groups responsible for microbial iron oxidation have thus changed over Earth history, possibly reflecting the instability of niches with sufficient reduced iron. We note that frequent transfer and changing taxonomic distribution may be a general pattern for traits which are selected for sporadically across space and time. Based on iron metabolism and other processes, we explore this concept of a trait’s “clade fidelity” (or lack thereof) and establish its evolutionary importance.

https://doi.org/10.5061/dryad.905qfttvf

Description of the data and file structure

This dataset contains multiple sequence alignments and phylogenetic trees inferred from these alignments as referenced in the paper "Recent origin of modern clades of iron oxidizers and low clade fidelity of iron metabolisms".

Multiple sequence alignments are uploaded in the FASTA format (.afa). Phylogenetic trees are in the Newick format. For the maximum-likelihood phylogenetic trees (.figTree) and the Bayesian posterior samples of molecular clock runs (half of the .chronogram files), the Newick-format tree is embedded in a NEXUS file that stores further information about the taxa on the tree for clarity. The NEXUS files can be opened as text files or using a tree visualization program such as FigTree. The Bayesian prior samples of molecular clock runs account for the remaining half of the .chronogram files, and .datedist files include every tree sampled by the chain for each molecular clock run under the posterior.

Files including the label “alignclade” use the narrow sample, while files including “extended” use the broad sample (see Methods). In addition to the Newick file, I have also included a .png image of the phylogenetic tree under the broad sample: I would have included it among the Supplemental Figures of the manuscript, but the tree is simply too large to be legible after printing on a page.

File labelling for different clock models: “ugam” denotes an uncorrelated model with rates on each branch drawn from a gamma distribution, “ln” denotes an autocorrelated model with rates drawn from a log-normal distribution, and “cir” denotes an autocorrelated model with rates based on the CIR process. Addition of “bd” signifies that the clock was run under the birth-death tree process prior; otherwise the uniform prior was used.

Taxa in alignments and Newick trees have been labeled using protein accession numbers. For each accession number, the corresponding organism and its taxonomic information can be found in the header of the NEXUS tree files.

Tree nodes are annotated with the ultrafast bootstrap support value for the corresponding bipartition (maximum-likelihood phylogenetic trees) or the 95% posterior credible intervals for node age (Bayesian chronograms).

Files and variables

Description: Multiple sequence alignment of Cyc2 using the narrow sample

File: cyc2_alignclade.afa

Description: Multiple sequence alignment of Cyc2 using the broad sample

File: cyc2_extended.afa

Description: Multiple sequence alignment of the 16S sequences from taxa that have Cyc2

File: 16S.afa

Description: PNG image of the phylogenetic tree under the broad sample. We would have included it among the Supplemental Figures of the manuscript, but the tree is simply too large to be legible after printing on a page.

File: cyc2_extended.treefile.figTree_formatted.png

Description: Cyc2 Phylogenetic trees (in NEXUS files) built using each sample

File: cyc2_alignclade.treefile.figTree

File: cyc2_extended.treefile.figTree

Description: 16S phylogenetic tree (NEXUS file) including taxa that have Cyc2

File: iron_16S_GTR.treefile.figTree

Description: Samples of Cyc2 Bayesian molecular clock runs under the posterior

File: cyc2_ugam_alignclade_1chain_sample.chronogram

File: cyc2_ugam_bd_alignclade_1chain_sample.chronogram

File: cyc2_ln_alignclade_1chain_sample.chronogram

File: cyc2_ln_bd_alignclade_1chain_sample.chronogram

File: cyc2_cir_alignclade_1chain_sample.chronogram

File: cyc2_cir_bd_alignclade_1chain_sample.chronogram

Description: Samples of Bayesian molecular clock runs under the prior

File: cyc2_cir_bd_prior_alignclade_1chain_sample.chronogram

File: cyc2_cir_prior_alignclade_1chain_sample.chronogram

File: cyc2_ln_bd_prior_alignclade_1chain_sample.chronogram

File: cyc2_ln_prior_alignclade_1chain_sample.chronogram

File: cyc2_ugam_bd_prior_alignclade_1chain_sample.chronogram

File: cyc2_ugam_prior_alignclade_1chain_sample.chronogram

Description: Lists of every tree sampled during each Bayesian molecular clock run by the chain that was used to build the chronogram

File: cyc2_cir_bd_alignclade_1chain_sample.datedist

File: cyc2_cir_bd_prior_alignclade_1chain_sample.datedist

File: cyc2_cir_alignclade_1chain_sample.datedist

File: cyc2_ln_bd_alignclade_1chain_sample.datedist

File: cyc2_cir_prior_alignclade_1chain_sample.datedist

File: cyc2_ugam_bd_alignclade_1chain_sample.datedist

File: cyc2_ugam_bd_prior_alignclade_1chain_sample.datedist

File: cyc2_ugam_alignclade_1chain_sample.datedist

File: cyc2_ln_alignclade_1chain_sample.datedist

File: cyc2_ugam_prior_alignclade_1chain_sample.datedist

File: cyc2_ln_bd_prior_alignclade_1chain_sample.datedist

File: cyc2_ln_prior_alignclade_1chain_sample.datedist

Code/software

Multiple sequence alignment files can be opened as text files or using alignment software such as Jalview. Tree files can be viewed as text files or using tree visualization software such as FigTree.

Cyc2 sequence collection and alignment

The Cyc2 sequence in Chlorobium phaeoferrooxidans str. KB01 (protein accession number WP_076792910.1) as identified by Tsuji et al. (2020) was used to query the NCBI non-redundant protein database for homologous Cyc2 sequences using BLASTp.

Two different datasets of Cyc2 orthologs were assembled, differing in the breadth of the sampling strategy. A broad sample includes the first 500 BLAST hits returned by the search, exhaustively covering the sequence space in and around the main clades of neutrophilic iron oxidizers as well as including any clades containing possible calibration points in the broader protein family. This large sequence set showed multiple suspected misalignments on visual inspection, especially within very distantly related sequences (alignment files in Supplementary Data linked at the end of the manuscript), calling for the cross-validation of results against a smaller dataset including only the descendants of the last common ancestor of Cyc2 sequences in the clades of interest. That narrow sample included a group of 109 sequences from the broad sample, and a visual inspection of the resulting alignment showed no obvious misalignments. The alignments were made using MAFFT 7.245 with the automatic choice of alignment algorithm (“mafft --auto”).

Cyc2 phylogenetic tree search

Based on the alignments, maximum-likelihood phylogenetic trees were built using IQTree 1.6.3. The ModelFinder function was used to select the best-fitting model, and LG+R7+F (LG substitution model with seven free rate categories and equilibrium frequencies estimated from the data) was selected based on the Bayesian Information Criterion score. The support for each bipartition was estimated using ultrafast bootstraps with 1000 replicates. The relationships between taxa involved in the narrow sampling largely reflected their relationships in the tree based on the broad sampling. Nevertheless, given the potential for misalignments in the broad sample to drive spurious substitution rate inferences, only the best tree obtained for the narrow sampling scheme was used as a basis for molecular clock runs. The tree based on broad sampling was still used as a means of outgroup rooting for the narrow sample.

16S species tree construction

To explicitly compare the relationships between Cyc2 sequences to species tree relationships, 16S ribosomal RNA sequences from taxa with Cyc2 were recruited from the SILVA ribosomal RNA database. Not all taxa with Cyc2 sequences had a 16S sequence available (especially where Cyc2 sequences came from metagenomic studies); nevertheless, in most clades of iron-oxidizers the available 16S sequences covered both sides of the clade’s basal split on the Cyc2 tree.

The 16S sequences were aligned using MAFFT 7.245 [11] with the automatic choice of alignment algorithm (“mafft --auto”). Based on this alignment, a maximum-likelihood phylogenetic tree was built using IQTree 1.6.3 with GTR+G4+F+I (GTR substitution model with four gamma-distributed rate categories, equilibrium frequencies estimated from the data and the presence of invariant sites). The support for each bipartition was estimated using ultrafast bootstraps with 1000 replicates.

The tree was rooted according to bacterial species tree work in prior literature (for example, Hug et al., 2016, and Waite et al., 2017 and [15]), separating Chlorobi from rest of the diversity in this tree. The root is also consistent with that inferred from minimum ancestor deviation (MAD) analysis [16]. Note that while the best root inferred by MAD narrowly falls on a neighboring branch compared to the one that the root has been placed on, MAD only offers a very weak preference between these options: in fact, the ancestor deviation values at MAD’s best root position and at the adjacent node connecting the branches in question are identical to the third decimal point (see Supplemental Figure 2).

Bayesian molecular clock inferences

PhyloBayes 4.1 was used for the relaxed molecular clock runs. Each clock includes a uniform root prior constrained by 2.4 Ga as an older bound, reflective of the dominating presence of aerobic groups on both sides of the root – for example, Gallionella and Zetaproteobacteria use oxygen as the electron acceptor for the very iron oxidation process mediated by Cyc2. As a younger bound, we used 75 Ma, reflecting that all known lucinid bivalves host endosymbionts as adults (Lim et al., 2019) and the symbiotic relationship with the chemoautotrophic bacteria would have therefore have been established by the lucinid radiation in the late Cretaceous. Thus, as lucinid endosymbiont sequences of Cyc2 represent a subtree of the Cyc2 phylogeny to be dated, the root necessarily predates their divergence. The extreme breadth of the root prior is intended to reflect that lack of prior information on the age of Cyc2-based iron oxidation: that is the question the clock is intended to answer. Nevertheless, note that we reject a priori an Archean root, given the topology of the phylogenetic tree.

Both the birth-death prior process and a uniform prior on node ages were tested as tree priors, and each relaxed clock model was run with each of the prior processes in PhyloBayes 4.1. We included three clock models: 1) an uncorrelated model with rates on each branch drawn from a gamma distribution (“-ugam”), 2) an autocorrelated model with rates drawn from a log-normal distribution (“-ln”), and 3) an autocorrelated model with rates based on the CIR process (“-cir”). Thus, a total of six clocks was run under the posterior, and at least two chains were required to converge for each clock before the posterior distribution would be sampled. In accordance with recommendations in the PhyloBayes manual, convergence was assessed by requiring TRACECOMP values for all estimated variables to be <0.3 and the minimum effective size of the sample to be >50.

Each clock was also run under the prior without the inclusion of sequence information, in order to check the parameters of the joint prior: for example, the impact of the internal node calibrations and the tree process prior means that the joint prior on the root age is considerably narrower than the root prior alone. That also allowed us to distinguish the impact of sequence data from the constraints imposed by the prior. The prior-only runs did not differ between clocks using the same tree process prior, but a different clock model: this is expected, since the only function of the clock model is to convert sequence substitutions into time and sequence data does not impact runs under the prior.

Custom Python scripts were used to read the .datedist files from PhyloBayes, and the ggplot2 package in R was used to visualize the full variability of divergence date estimates in the sample.