Expectation-Maximization enables phylogenetic dating under a Categorical Rate Model
Mai, Uyen; Charvel, Eduardo; Mirarab, Siavash (2022), Expectation-Maximization enables phylogenetic dating under a Categorical Rate Model, Dryad, Dataset, https://doi.org/10.5061/dryad.pk0p2ngs0
Dating phylogenetic trees to obtain branch lengths in the unit of time is essential for many downstream applications but has remained challenging. Dating requires inferring mutation rates that can change across the tree. While we can assume to have information about a small subset of nodes from the fossil record or sampling times (for fast-evolving organisms), inferring the ages of the other nodes essentially requires extrapolation and interpolation. Assuming a clock model that defines a distribution over rates, we can formulate dating as a constrained maximum likelihood (ML) estimation problem. While ML dating methods exist, their accuracy degrades in the face of model misspecification where the assumed parametric statistical clock model vastly differs from the true distribution. Notably, existing methods tend to assume rigid, often unimodal rate distributions. A second challenge is that the likelihood function involves an integral over the continuous domain of the rates and often leads to difficult non-convex optimization problems. To tackle these two challenges, we propose a new method called Molecular Dating using Categorical-models (MD-Cat). MD-Cat uses a categorical model of rates inspired by non-parametric statistics and can approximate a large family of models by discretizing the rate distribution into k categories. Under this model, we can use the Expectation-Maximization (EM) algorithm to co-estimate rate categories and branch lengths in the time unit. Our model has fewer assumptions about the true clock model than parametric models such as Gamma or LogNormal distribution. Our results on two simulated and real datasets of Angiosperms and HIV and a wide selection of rate distributions show that MD-Cat is often more accurate than the alternatives, especially on datasets with nonmodal or multimodal clock models.
National Institutes of Health, Award: 1R35GM142725