Deep-level relationships within Bacteria, Archaea, and Eukarya as well as the relationships of these three domains to each other require resolution. The ribosomal machinery, universal to all cellular life, represents a protein repertoire resistant to horizontal gene transfer, which provides a largely congruent signal necessary for reconstructing a tree suitable as a backbone for life’s reticulate history. Here, we generate a ribosomal tree of life from a robust taxonomic sampling of Bacteria, Archaea, and Eukarya to elucidate deep-level intra-domain and inter-domain relationships. Lack of phylogenetic information and systematic errors caused by inadequate models (that cannot account for substitution rate or compositional heterogeneities) or improper model selection compound conflicting phylogenetic signals from HGT and/or paralogy. Thus, we tested several models of varying sophistication on three different datasets, performed removal of fast-evolving or long-branched Archaea and Eukarya, and employed three different strategies to remove compositional heterogeneity to examine their effects on the topological outcome. Our results support a two-domain topology for the tree of life, where Eukarya emerges from within Archaea as sister to a Korarchaeota/Thaumarchaeota (KT) or Crenarchaeota/KT clade for all models under all or at least one of the strategies employed. Taxonomic manipulation allows single-matrix and certain mixture models to vacillate between two-domain and three-domain phylogenies. We find that models vary in their ability to resolve different areas of the tree of life, which does not necessarily correlate with model complexity. For example, both single-matrix and some mixture models recover monophyletic Crenarchaeota and Euryarchaeota archaeal phyla. In contrast, the most sophisticated model recovers a paraphyletic Euryarchaeota but detects two large clades that comprise the Bacteria, which were recovered separately but never together in the other models. Overall, models recovered consistent topologies despite dataset modifications due to the removal of compositional bias, which reflects either ineffective bias reduction or robust datasets that allow models to overcome reconstruction artifacts. We recommend a comparative approach for evolutionary models to identify model weaknesses as well as consensus relationships.
aeb1 phylogenies
The Aeb1 phylogenies include representatives of all major phyla/supergroups for the three domains of life (when available). Archaea, Eukarya, and Bacteria protein sequences were aligned at once (no profile alignment) and protein alignments were trimmed individually and then concatenated into a supermatrix (resulting in 13432 positions). Aeb1 contains no “Archezoan” sequences (polyphyletic group of long branched eukaryotes such as Giardia, Trichomonas, Microsporidia, and Entamoeba) but has some fast evolving eukaryotes, including Euglenozoa. Methods of inference included maximum likelihood as implemented in RAxML, PhyML-4X (Le et al., 2012), and PhyML-Structure (Le and Gascuel, 2010) or Bayesian inference as implemented in PhyloBayes v. 3.3 or 3.3e (Lartillot et al., 2009).
aeb1.alltrees.tree
aeb2 phylogenies
Aeb2 includes members of Bacteria, Archaea, and Eukarya. Methods of tree inference included maximum likelihood as implemented in RAxML, PhyML-4X (Le et al., 2012), and PhyML-Structure (Le and Gascuel, 2010) or Bayesian inference as implemented in PhyloBayes v. 3.3 or 3.3e (Lartillot et al., 2009). Aeb2 includes “Archezoan” (polyphyletic group of long branched eukaryotes such as Giardia, Trichomonas, Microsporidia, and Entamoeba) sequences; the subset “aeb2 Arch+” refers to the presence of Archezoa without Korarchaeota and Thaumarchaeota and the subset, “aeb2 KT+” refers to the presence of Korarchaeota and Thaumarchaeota without Archezoa. Aeb2 Arch+ and KT+ attempt to mitigate long-branch attraction (LBA) artifacts due to substitution rate saturation and allow us to assess the influence of taxonomic sampling on phylogenetic outcome. Gap removal occurred after the creation of the supermatrix (resulting in 13929 positions for aeb2 Arch+ and 13941 positions for aeb2 KT+).
aeb2.alltrees.tree
aeb3 phylogenies
Methods of inference included maximum likelihood as implemented in RAxML, PhyML-4X (Le et al., 2012), and PhyML-Structure (Le and Gascuel, 2010) or Bayesian inference as implemented in PhyloBayes v. 3.3 or 3.3e (Lartillot et al., 2009). Aeb3 comprises representatives from each phylum/supergroup with no sequences from Archezoa or Euglenozoa and trimming performed after the construction of the supermatrix (resulting in 17876 positions). The removal of the fast-evolving Euglenozoa represents an attempt to further mitigate LBA artifacts. To examine the effects of missing data and specifically test how missing data influences the relationship between Archaea and Eukarya, ribosomal proteins shared by Archaea and Eukarya were deleted from some archaeal taxa.
aeb3.alltrees.tree
aeb1 alignment
aeb1 alignment in fasta format
aeb1.fas
aeb2 Arch+ alignment
aeb2_Arch.fas
aeb2 KT+ alignment
aeb2_KT.fas
aeb3 alignment
Aeb3 comprises representatives from each phylum/supergroup with no sequences from Archezoa or Euglenozoa and trimming performed after the construction of the supermatrix (resulting in 17876 positions). The removal of the fast-evolving Euglenozoa represents an attempt to further mitigate LBA artifacts. To examine the effects of missing data and specifically test how missing data influences the relationship between Archaea and Eukarya, ribosomal proteins shared by Archaea and Eukarya were deleted from some archaeal taxa.
aeb3.fas