Dating the bacterial tree of life based on ancient symbiosis
Data files
Dec 06, 2024 version files 43.81 GB
-
a_more_detailed_readme.zip
510.94 KB
-
DATING.zip
36.46 GB
-
hessian_in_BV.zip
3.24 GB
-
mcmc3r.zip
740.51 MB
-
phylo.zip
78.87 MB
-
PTL.zip
3.29 GB
-
README.md
5.54 KB
Dec 10, 2024 version files 43.81 GB
-
a_more_detailed_readme.zip
510.82 KB
-
DATING.zip
36.46 GB
-
hessian_in_BV.zip
3.24 GB
-
mcmc3r.zip
740.51 MB
-
phylo.zip
78.87 MB
-
PTL.zip
3.29 GB
-
README.md
5.98 KB
Abstract
Obtaining a timescale for bacterial evolution is crucial to understanding early life evolution but is difficult owing to the scarcity of bacterial fossils and the absence of maximum age constraints of the available fossils. Here, we introduce multiple new time constraints to calibrate bacterial evolution based on ancient symbiosis. This idea is implemented using a bacterial tree constructed with mitochondria-originated genes where the mitochondrial lineage representing eukaryotes is embedded within Proteobacteria, such that the date constraints of eukaryotes established by their abundant fossils are propagated to ancient co-evolving bacterial symbionts and across the bacterial tree of life. Importantly, we formulate a new probabilistic framework that considers uncertainty in inference of the ancestral lifestyle of modern symbionts to apply 19 relative time constraints (RTC) each informed by host-symbiont association to constrain bacterial symbionts no older than their eukaryotic host. Moreover, we develop an approach to incorporating substitution mixture models that better accommodate substitutional saturation and compositional heterogeneity for dating deep phylogenies. Our analysis estimates that the last bacterial common ancestor (LBCA) occurred approximately 4.0-3.5 billion years ago (Ga), followed by rapid divergence of major bacterial clades. It is generally robust to alternative root ages, root positions, tree topologies, fossil ages, ancestral lifestyle reconstruction, and gene sets, among other factors. The obtained time tree serves as a foundation for testing hypotheses regarding bacterial diversification and its correlation with geobiological events across different timescales.
README: Dating the bacterial tree of life based on ancient symbiosis
https://doi.org/10.5061/dryad.1c59zw42s
More detailed introductions for each folder and data set are given in a_more_detailed_readme.zip.
Description of the data and file structure
Original data and results for the study "Dating the bacterial tree of life based on ancient symbiosis". Organized into the following archives:
Note that Supplemental data are available at https://zenodo.org/records/14348151 (see Related works on the right-hand side)
DATING
Relevant figures and tables: Figs. 2-3, Figs. S15-S19.
Included in this folder are the results of the MCMCtree analysis.
- bootstrap/: using the focal strategy (some alternatives included)
- bootstrap.root2/: with different root placements
- date-alt/: analyses under various alternative dating schemes to explore the uncertainty associated with divergence time estimation
- date-alt.dup/: run2 of the MCMC analysis included in date-alt/
- date-fg/: the full gene set instead of 19 “high-quality” genes
- deltaLL/: using 5, 10, and 15 (best) top-ranking genes out of the 19 mitochondrial genes according to relative rate difference between mitochondrial and bacterial lineages or △LL which measures the the degree of species- and gene-tree incongruence (see Data S3)
The structures of the MCMCtree results are similar. This is detailed in the “Readme” file in the directory “bootstrap/”.
The result of the focal analysis can be found in “bootstrap\1-pf\dating\C60\mcmctree-rate\rrtc\official-mcmc987\marginal”.
PTL
Relevant figures and tables: Figs. S2, S10, S11; Note S3.
ASR of the hosts/lifestyles of seven selected bacterial symbiont lineages.
- full/: main analysis
Coleman60
Relevant figures and tables: Fig. S6, Table S2.
This folder is about IQ-Tree phylogenomic tree construction. All 265 bacterial genomes used in the prior study Coleman et al., 2021 are included. Individual alignments of the 60 genes used in the study Coleman et al., 2021 can be found in ./pep/. See Table S2 for the details of each scheme.
Files used and generated in IQ-Tree analysis are given in tree-CxxPMSF-RML/ where Cxx represents C20, C40, or C60.
- combined.aln: concatenated alignment (after trimming)
- aln/: original sequence alignments
- trimal/: trimmed sequence alignments
- phylo/iqtree/: IQ-Tree analysis. Specifically, the command can be found in the ".log" file, and the ml tree is indicated in the "iqtree.treefile" file. For NONREV model analysis, you may see the file "REV.treefile" which is the guide tree constructed using a reversible model.
hessian_in_BV
Relevant figures and tables: Fig. 1B, Figs. S12-S13, Note S4.
The simulation analysis of the bootstrap-based Hessian approximation in MCMCtree analysis using bs_inBV (https://github.com/evolbeginner/bs_inBV) or CODEML (see the GitHub online repository for more details). Briefly, a [timetree]{.underline} is simulated, and then a [phylogeny]{.underline} where branch lengths correspond to the number of expected substitutions/sites is simulated under a relaxed clock model. Last, [sequence alignment]{.underline} is simulated under a given model. For more details, please see Note S4 in the original paper.
The three folders included correspond to different models used in [sequence evolution]{.underline} simulation.
i) LG+G{1.0}: LG substitution model and across-site rate variation under a four-category discrete gamma distribution with the shape parameter α=1.0, thus Gamma(1.0,1.0), meaning that the mean and variance of the relative rates across site are 1/1 = 1 and 1/1^2=1 respectively (Yang 1994).
ii) LG+G{1.0}+C40: LG substitution model plus a mixture of amino acid site frequency profiles C40 and across-site rate variation under a four-category discrete gamma distribution with α=1.0.
iii) LG+G{0.5}+C40: LG substitution model plus a mixture of amino acid site frequency profiles C40 and across-site rate variation under a four-category discrete gamma distribution with α=0.5, which reflects more among-site rate variation than using α=1.0.
For each of the above models, you will see 30 independent simulations/comparisons based on which many further analyses are performed.
mcmc3r
Relevant figures and tables: Fig. S5.
Data for clock model selection using mcmc3r.
MCMC3R.tar.xz: used for eukaryote divergence time estimation as the 1st step of the Bayesian sequential molecular clock analysis.
bac_20genome/: used for the bacterial timetree analysis (2nd step of the Bayesian sequence molecular clock analysis), with 20 genomes.
bac_40genome/: used for the bacterial timetree analysis (2nd step of the Bayesian sequence molecular clock analysis), with 40 genomes.
In the folder of each gene, there are four folders:
- date/: MCMCtree analysis results
- nucl/: recoded nucleotide sequences
- pep/: original amino acid sequences
- tree.sp:/ original and trimmed alignment
Sharing/Access information
The sources of genomes and their accessions are well indicated in Table S1.
Code/Software
We have made the following two scripts freely available on GitHub.
bs_inBV: A tool to help generate the file in.BV when using "complex" substitution models for MCMCTree. https://github.com/evolbeginner/bs_inBV
pRTC: A script to do probability-based relaxed time constraint (pRTC)-based dating. https://github.com/evolbeginner/rrtc.
Change log
10-Dec-2024 - A description of the state information contained in the file traits.txt for PTL/ is added to the file "PTL_readme.docx" in a_more_detailed_readme.zip. The Zeonodo link to all Supplemental Data is added in the Related Works section.