Phylogenetic trees are central to many areas of biology, ranging from population genetics and epidemiology to microbiology, ecology, and macroevolution. The ability to summarize properties of trees, compare different trees, and identify distinct modes of division within trees is essential to all these research areas. But despite wide-ranging applications, there currently exists no common, comprehensive framework for such analyses. Here we present a graph-theoretical approach that provides such a framework. We show how to construct the spectral density profile of a phylogenetic tree from its Laplacian graph. Using ultrametric simulated trees as well as non-ultrametric empirical trees, we demonstrate that the spectral density successfully identifies various properties of the trees and clusters them into meaningful groups. Finally, we illustrate how the eigengap can identify modes of division within a given tree. As phylogenetic data continue to accumulate and to be integrated into various areas of the life sciences, we expect that this spectral graph-theoretical framework to phylogenetics will have powerful and long-lasting applications.
Figure_S1
Toy example illustrating the construction of the spectral density for a hypothetical phylogeny. Top: example phylogeny. Middle: computation of the MGL. Each non-diagonal element (i, j) in the MGL Λ is equal to the negative of the branch length between nodes i and j. Each diagonal element i is computed as the sum of branch lengths between node i and all other nodes j. Bottom: the spectral density is obtained by convolving the λ calculated from Λ with a smoothing function.
Figure_S2
Speciation values for diversification models. Birth-death trees were constructed according to one of six diversification models: increasing speciation, decreasing speciation, decreasing speciation below extinction; and constant speciation-extinction with (i) ancient mass-extinction (0.1 survival probability), (ii) recent mass-extinction (0.1 survival probability), or (iii) no mass-extinction. For all models, μ = 0.05.
Figure_S3
Interpreting spectral density profiles For trees simulated under a constant birth-death model (open circle), the principal λ for the MGL is a good predictor of species richness (a) and phylogenetic diversity (b); there is a significant positive relationship between skewness and the γ statistic (c) and no significant relationship between kurtosis and the Colless index (d). The principal λ for the nMGL shows no significant relationship with species richness (a, inset) and a significantly negative relationship with phylogenetic diversity (b, inset). Only significant slopes are shown.
Figure_S4
Principal components analysis of simulated trees using traditional summary statistics. (a) K-medoids and (b) hierarchical clustering on principal components derived from mean branch length, branch length standard deviation, ln species richness, ln phylogenetic diversity, the Colless index, and γ calculated for 600 trees simulated under different diversification models. Both hierarchical clustering (bootstrap probability > 0.95) and k-medoids clustering (P < 0.05) extract three clusters of trees. Shape and color correspond to cluster assignment.
Figure_S5
Clustering on principal components and spectral density profiles from the nMGL. (a) Hierarchical clustering on spectral density profiles and (b) k-medoids clustering on principal components are computed as in Figure 3, except based on the nMGL. In (A), hierarchical clustering on the spectral density profile identified six clusters of trees (bootstrap probability > 0.95), each corresponding to a distinct underlying diversification model, whose property in terms of speciation-extinction rate variation is summarized in the left column. In (b), only four significant clusters were identified (P < 0.05).
Figure_S6
Undersampling affects spectral density profiles. The spectral densities of 3 (out of 100) trees (solid line) simulated under (a) constant birth-death, (b) increasing speciation-rate, and (c) recent mass-extinction models and their jackknifed trees (dashed lines) at 90%, 80%, 70%, 60%, 50%, and 40% are plotted. As the tree moves further from complete, the density plot shifts left, as a result of a declining principal λ, and the shape of the spectral density becomes increasingly different from the original, notably by decreasing skewness. The mean and standard deviation of the Jensen-Shannon distance between each tree and its 100 jackknifed trees are shown in a barplot. The distance between trees increases linearly with incompleteness.
Figure_S7
The reliability of the eigengap heuristic versus MEDUSA in recovering shifts in speciation rate and diversification pattern in simulated trees. The absolute deviation of shifts recovered by the eigengap (red) and MEDUSA (blue) from the known number of shifts for trees with 0–10 shifts in (a) speciation rate and (b) diversification pattern. In (a), only when five or ten shifts were simulated, MEDUSA performed significantly better (T > 2, P < 0.05) than the eigengap. (a, Inset) The average deviation across all trees is slightly lower for MEDUSA (2.89) than for the eigengap heuristic (3.10), although this is not significant (T = 1.88, P > 0.05). In (b), the eigengap heuristic outperformed MEDUSA for all trees with > 2 shifts and (b, inset) overall (T = 10.44, P < 0.01). In (a) and (b), only eigengaps supported by BIC post-hoc analysis were computed in the means. Asterisks indicate a significantly lower deviation for MEDUSA (blue) or the eigengap heuristic (red).
Figure_S8
Spectral density profile summary statistics across viral strains and hosts. Boxplot of avian (red) and human (gold) strains calculated from standard (a) and normalized (b) MGLs. Grey bars indicate across-host means; asterisks denote significant differences at P < 0.05. (c) Mean differences between hosts across all strains calculated from standard graph Laplacians.
Figure_S9
The cluster assignments for strains by country of origin. Six clusters were found based on k-medoid clustering on the standard spectral density profiles of all strains (P < 0.05). The distribution across those clusters for strains sampled from 25 countries are shown for avian (a) and human (b) hosts. Four clusters were found using the normalized profiles (P < 0.05) and the distributions are shown for each country for avian (c) and human (d) hosts. Only countries with all seven strains sampled are represented.