Skip to main content

Data from: The Archean origin of oxygenic photosynthesis and extant cyanobacterial lineages

Cite this dataset

Fournier, Gregory et al. (2021). Data from: The Archean origin of oxygenic photosynthesis and extant cyanobacterial lineages [Dataset]. Dryad.


The record of the co-evolution of oxygenic phototrophs and the environment is preserved in three forms: genomes of modern organisms, diverse geochemical signals of surface oxidation, and diagnostic Proterozoic microfossils. When calibrated by fossils, genomic data form the basis of molecular clock analyses. However, different interpretations of the geochemical record, fossil calibrations, and evolutionary models produce a wide range of age estimates that are often conflicting. Here, we show that multiple interpretations of the cyanobacterial fossil record are consistent with an Archean origin of crown-group Cyanobacteria. We further show that incorporating relative dating information from horizontal gene transfers greatly improves the precision of these age estimates, by both providing a novel empirical criterion for selecting evolutionary models, and increasing the stringency of sampling of posterior age estimates. Independent of any geochemical evidence or hypotheses, these results support oxygenic photosynthesis evolving at least several hundred million years before the Great Oxygenation Event (GOE), a rapid diversification of major cyanobacterial lineages around the time of the GOE, and a post-Cryogenian origin of extant marine picocyanobacterial diversity.        


Species tree reconstruction

The species tree used in this study was based upon the tree published in[16] using identical taxon sampling, with the addition of the newly sequenced strains described above (SI Table 5). Homologous ribosomal protein sequences identified from these newly sequenced genomes were added to each of the 30 ribosomal protein alignments, and re-aligned with MUSCLE v3.8.31[63] and concatenated using FASconCAT v1.0[64]. Best-fitting substitution model analyses were carried out using ProtTest v. 3.4.2[65][66]. Maximum likelihood trees both including and excluding plastid sequences were made with RAxML v8.1.9[67] using the best fitting model parameters (four gamma distributed site rate categories with estimated shape parameter (a = 0.827794) and an LG substitution model). Bayesian inference trees were made with PhyloBayes 3.3[68] (C20 site specific substitution models, with default convergence criteria). The resulting trees did not substantially differ from one another or previous reconstructions using similar taxon samplings and datasets[16]. The maximum-likelihood tree was used for subsequent molecular clock analysis.  Concatenated alignments and species trees are provided in SI Dataset 4[69].

Molecular Clocks and Fossil Calibrations

      All primary fossil calibrations were applied as hard younger bound minimum age constraints (SI Table 2). Secondary calibration for the last common ancestor of plastids was applied as a range between 1050 Ma for the younger bound (the age of Bangiomorpha[30]) and 1800 Ma for the older bound, as estimated in [62]. Additional primary calibrations for plastid groups were also used, consisting of green algal and dinoflagellate fossils (SI Table 2). Other calibrations applied to outgroups are the same as in [19]. The root prior on the bacterial ancestor was normally distributed with a mean age of 3900 Ma and standard deviation of 200 Ma, in order to approximate a plausible range of hypotheses for the age of the bacterial root between 4300 Ma and 3500 Ma[19]. Hard bounds were not included on the root prior with the goal of identifying model parameters and calibrations that push the root age estimate beyond plausible constraint (i.e, the age of the Earth) as a further means of model evaluation. Hard bounds at 4.4 Ga and 3.5 Ga were included on the root prior when estimating prior and posterior age estimates in the absence of any other calibration.

Divergence time estimates were calculated with Phylobayes 4.1c, using C20 substitution models, birth-death and uniform tree priors, and uncorrelated gamma distributed, autocorrelated lognormal, and autocorrelated CIR process branch rate distribution models for both prior and posterior estimates[68]. Two chains were run for each analysis using the default settings for the automatic stopping rule. For each analysis, at least the first 20% of chains were excluded as burn-in before the sampling of prior and posterior trees and calculation of age distributions in the chronogram (1k trees for uncorrelated models, and 10k trees for autocorrelated models). Chronograms and dates output files are available for each evolutionary model and set of calibrations in SI Dataset 5[69].

Detailed HGT detection methodology and gene tree reconstructions

Bacterial COGs with protein to genome ratios less than 3:1 were used as the starting protein family dataset, to avoid spurious inferences from gene families with histories of extensive gene duplication.  For each of the 10,453 protein families examined, patristic distances between protein sequences were transformed to network weights (e-Dij, where Dij is the sum of branch lengths between homolog i and homolog j).  The Louvain method for community detection[40] was applied to this network, dividing each set of homologs into closely related sub-groups for further analysis.  For each phylum, distances between its members and the five closest sequences from each of the other phyla were calculated. Alphaproteobacteria were not included in HGT detection, as the large number of representatives in this dataset proved computationally intractable. Phyla represented by fewer than 5 taxa within a given sub-group were not examined. Inter-phyla distances were deemed to be potentially indicative of HGT events if the median distance to the closest phylum was significantly lower than that of the second-closest phylum (p=0.01), resulting in 3,949 sub-groups containing potential HGT candidates. For each sub-group identified as potentially containing an inter-phylum HGT, sequences were re-aligned using MAFFT[41] with automatic parameter selection. Trees were subsequently reconstructed using IQTree[42], under the LG+G model, and supports provided by rapid bootstrapping (n=1000) and an approximate likelihood ratio test (n=1000).  Reconstructed sub-group trees were automatically rooted using Minimal Ancestor Deviation (MAD)[43] (Tria et al., 2017).  249 gene families with nested (paraphyletic) pairs of phyla indicative of index HGTs were then automatically detected using in-house scripts based on the ETE3 python library.  Manual vetting of these index HGT candidates was then performed, identifying 38 index HGT candidates.  These candidate HGTs were further evaluated using taxon samplings that only included the local topology containing the candidate HGT (donor, recipient, and outgroup sequences), which further restricted the set of inferred index HGTs to 34 events.  This local reconstruction ensures that potentially divergent sequences elsewhere in the often-large COG subcluster trees do not have an inordinate influence on the alignment of sequences relevant to the HGT hypothesis being tested, and that the best-fitting model reflects the likelihood of substitutions directly associated with the bipiartitions relevant to the HGT inference. For these taxa, sequences were realigned and trees reconstructed in IQtree using the best-fitting evolutionary model.   Supports for inferred HGTs were estimated based on the “degenerate” bootstrap support for the nestedness of donor-recipient pairs, as multiple bipartitions representing specific donor-recipient hypotheses may each be consistent with the relative temporal ordering of the donor and recipient nodes.  These values were extracted from bipartition tables (SI Dataset 6) and reported in SI Table 4. Specific narratives for each HGT inference are included in SI Text 5.

Usage notes

SI Dataset 1: PDFs of gene trees showing index HGTs. Trees generated under LG and best-fitting models are included, with HGT donor groups highlighted in red, and HGT recipient groups highlighted in blue.  Support values show ultrafast bootstraps. All trees were generated as described in SI Text 4: Detailed HGT detection methodology.  Filenames are formatted to identify the clades constrained by the HGT (as described in SI Table 4 and shown in SI Figure 2), the COG and sequence cluster containing the HGT, the protein name, and the evolutionary model used for tree reconstruction. Alignments and treefiles for inferred index HGTs are in SI Dataset 2 for cluster trees under the LG model, and SI Dataset 3 for subcluster trees under best-fitting models.

SI Dataset 2: Data supporting index HGTs, LG model. EggNOG cluster alignments (.aln) and tree files (.figTree) for index HGTs generated under the LG substitution model. Taxonomic names, rankings and sequence identifiers are contained within the FigTree formatted treefiles.

SI Dataset 3: Data supporting index HGTs, best-fitting evolutionary models.  EggNOG subcluster alignments (.aln) and tree files (.figTree) for index HGTs generated under best-fitting substitution models. Taxonomic names, rankings and sequence identifiers are contained within the FigTree formatted treefiles.

SI Dataset 4: Data for species trees used in this study. Alignments (phylip format) and trees (newick format) for species trees used in this study, with plastids (model B) and without plastids (model D).  Sequence names in alignments and trees are decoded in SI Table 5.

SI Dataset 5: Data for divergence time analyses in this study. Chronogram tree files (.chronogram) and date outfiles (.dates) for calibrated molecular clock analyses listed in SI Table 3, and uncalibrated analyses under the root prior. Prior chronogram and date outfiles are included for BE and BH calibration schemas, and uncalibrated analyses. Sample trees with nodes labeled to decode the date outfiles are included for trees using both B and D taxonomic samplings (TreeB_node_labels_sample; TreeD_node_labels_sample).  A spreadsheet of HGT-constrained divergence times ages are provided for the BE schema under the CIR_nobd evolutionary model (as shown in Figure 2), with nodes numbered as date outfiles and TreeB_node_labels_sample files (HGT_Constrained_NodeAges_BE_CIR_nobd.xlsx).

SI Dataset 6:  Bipartition tables for index HGT gene trees. Bipartition tables showing ultrafast bootstrap supports for bipartitions within index HGT gene tree analyses (LG model cluster trees and best-fittign model subcluster trees), used to calculate degenerate support values shown in SI Table 4.  Taxa labels are decoded in SI Table 5.


National Science Foundation, Award: EAR-1615426

Simons Foundation, Award: 327126