Phylogenomic analyses of 2,786 genes in 158 lineages support a root of the eukaryotic tree of life between opisthokonts and all other lineages
Cerón-Romero, Mario et al. (2022), Phylogenomic analyses of 2,786 genes in 158 lineages support a root of the eukaryotic tree of life between opisthokonts and all other lineages, Dryad, Dataset, https://doi.org/10.5061/dryad.vq83bk3q8
Advances in phylogenetic methods and high-throughput sequencing have allowed the reconstruction of deep phylogenetic relationships in the evolutionary history of eukaryotes. Yet, the root of the eukaryotic tree of life remains elusive. The most ‘popular’ (i.e. in textbooks and reviews) hypothesis for the root is between Unikonta (Opisthokonta + Amoebozoa) and Bikonta (all other eukaryotes), which emerged from analyses of a single gene fusion and a limited sampling of eukaryotic lineages. Subsequent highly-cited studies based on concatenation of genes supported this hypothesis with some variations or proposed a root within the Excavata. However, concatenation of genes neither considers phylogenetically-informative events (i.e. gene duplications and losses) nor provides an estimate of the root. A more recent study using gene tree-species tree reconciliation methods suggested the root lies between Opisthokonta and all other eukaryotes, but only including 59 taxa and 20 genes. Here we apply a gene tree – species tree reconciliation approach to a gene-rich and taxon-rich dataset (i.e. 2,786 gene families from two sets of ~158 diverse eukaryotic lineages) to assess the root, and we iterate each analysis 100 times to quantify tree space uncertainty. Our results estimate a root between Fungi and all other eukaryotes, or between Opisthokonta and all other eukaryotes, and reject alternative popular roots from the literature. Based on further analysis of genome size, we propose Opisthokonta + others as the most likely root. Finding the root of the eukaryotic tree of life is critical for the field of comparative biology as it allows us to understand the timing and mode of evolution of characters across the evolutionary history of eukaryotes.
Here we provide the alignments, gene trees, inputs, and outputs from our project entitled "Phylogenomic Analyses of 2,786 Genes in 158 Lineages Support a Root of The Eukaryotic Tree of Life Between Opisthokonts and All Other Lineages". Sequences and alignments were produced using the phylogenomic pipeline PhyloToL, which contains a taxon- and gene-rich database (including eukaryotes, archaea, and bacteria). These data were then used for 1) assessing the root of the eukaryotes and 2) for comparison with other previously published hypotheses. In both cases, we used the species tree - gene tree reconciliation tool iGTP. We also did a comparison of hypotheses using the likelihood-based tool SpeciesRax
These data are divided into four datasets based on taxa selection. For dataset SEL+, taxa were selected based on their taxonomy; for RAN+, taxa were selected randomly among the major eukaryotic clades Opisthokonta, Amoebozoa, Archaeplastida, Excavata, SAR, and some orphan lineages. Datasets SEL- and RAN- are the same as SEL+ and RAN+, but exclude microsporidians in order to account for and avoid long-branch attraction due to microsporidians' fast-evolutionary rates. We chose the gene families that contain at least 25 taxa representing at least four of the five major eukaryotic clades. Additionally, at least 2 of the major clades had to contain at least 2 minor clades (e.g. Glaucophytes and Rhodophyta are minor clades in the major clade Archaeplastida). In a pilot analysis, we produced an alignment and a phylogenetic tree for each gene family using the default settings of a previous version of PhyloToL (GUIDANCE V1.3.1 sequence cutoff = 0.3 and column cutoff = 0.4; RAxML quick tree with model PROTGAMMALG and no bootstraps). Then, we kept the gene families that are exclusive of eukaryotes or the ones in which eukaryotes were monophyletic. From a total of 3,002 gene families that met our criteria, 2786 passed the initial steps of PhyloToL when including only the data from the dataset SEL+. These 2,786 gene families were used for further analyses with all datasets.
MSAs were produced with PhyloToL (GUIDANCE V2.02 sequence cutoff = 0.3, column cutoff = 0.4, number of iterations = 5; Sela, et al. 2015). The default parameters of PhyloToL include up to five iterations of GUIDANCE V2.02 with 10 bootstraps and MAFFT V7 with algorithm E-INS-i for less than 200 sequences or “auto” option if more than 200 sequences, and maxiterate = 1000. Instead, here we run up to five iterations of GUIDANCE with 20 bootstraps and the simple MAFFT algorithm FFT-NS-2. Then, we perform an additional GUIDANCE run with 100 bootstraps and the default MAFFT parameters for PhyloToL.
Gene trees were inferred with RAxML v.8.2.4 with 10 ML searches for best-ML tree (option "-# 10"), using the rapid hill-climbing algorithm (option "-f d") and no bootstrap replicates. The protein evolution model used was evaluated during the gene tree inference (option "-m PROTCATAUTO") by testing all models available in RAxML (e.g. JTT, LG, WAG, etc) with optimization of substitution rates and of site-specific evolutionary rates which were categorized into four distinct rate categories for greater computational efficiency.
We ran 100 repetitions of iGTP analyses per dataset. But, given the complexity of the datasets and the heuristic nature of some key steps of the iGTP algorithm (e.g. gene tree rooting and initial starting species tree generation), in a preliminary analysis, we faced two systematic challenges with iGTP as the inferred species tree was affected by: 1) the order of the leaves in the input unrooted gene tree Newick strings (i.e. the input trees were treated as rooted even though we specified that they were not); and 2) the input gene order in the 100 replicates. Therefore, we randomly shuffled the order of the leaves in the unrooted gene trees (keeping the same topology), and randomly shuffled the order of the input gene trees in each of the 100 replicates per dataset. Here we provide the 100 input files generated for those iGTP analyses.
Here we also share the data generated after two analyses: 1) root assessment and 2) hypothesis testing. For the former, we allowed iGTP to calculate the more parsimonious root given our input files. For the latter, we allowed iGTP to calculate the reconciliation cost of the gene trees given the input files and constraints in the species trees to reflect previously published root hypotheses. The constraints are explained in the README file.
Since we removed LGT and contamination from our dataset using a series of filters, we applied the model UndatedDL instead of UndatedDTL, which implies that we only took into consideration duplications and losses and ignored the transferences. Then, the command used for SpeciesRax was...
./generax --families forGeneRax/famFile --species-tree forGeneRax/spsTreeAn.newick --strategy SKIP --rec-model UndatedDL --per-family-rates --prefix An_DS1 --si-strategy EVAL
Here, we are sharing all the necessary files to run SpeciesRax, including the mapping file (famFile), the species trees (spsTrees; the best iGTP constrained species trees per hypothesis), and their underlying gene trees (trees_r)
Output folder from SpeciesRax, which includes log files, events (duplications, losses) counts, and statistics (i.e., reconciliation likelihood values)
* As in the iGTP analyses, for the SpeciesRax files, the words Op, Fu, Di, Un, An, refer to the five root hypotheses compared: Opisthokonta-others, Fungi-others, Discoba-others, Unikonta-Bikonta, and (Ancyromonadida + Metamonada)-others, respectively.
* For the SpeciesRax analyses, the Un word refers to Ut (Cavalier-Smith 2003).
National Human Genome Research Institute, Award: R15HG010409
National Science Foundation, Award: OCE-1924570
National Science Foundation, Award: DEB-1651908
National Science Foundation, Award: DEB-1541511