Robust analysis of phylogenetic tree space
Data files
Aug 29, 2022 version files 16.65 GB
-
ClusterMapping.zip
9.90 MB
-
CombinedTrees.zip
33.31 MB
-
InherentDims.zip
55.77 KB
-
Manuscript_data.zip
338.74 KB
-
NEXUS.zip
152.44 KB
-
README.md
12.80 KB
-
StratCongruenceResults.zip
59.45 MB
-
TreeSpaceBurnin_A.zip
254.99 MB
-
TreeSpaceBurnin_B.zip
2.17 GB
-
TreeSpaceBurnin_C.zip
967.59 MB
-
TreeSpaceBurnin_D.zip
997.08 MB
-
TreeSpaceBurnin_E.zip
243.37 MB
-
TreeSpaceBurnin_F.zip
1.45 GB
-
TreeSpaceBurnin_G.zip
935.65 MB
-
TreeSpaceBurnin_H.zip
673.83 MB
-
TreeSpaceBurnin_K.zip
339.23 MB
-
TreeSpaceBurnin_L.zip
685.74 MB
-
TreeSpaceBurnin_M.zip
1.74 GB
-
TreeSpaceBurnin_N.zip
300.23 MB
-
TreeSpaceBurnin_P.zip
1.18 GB
-
TreeSpaceBurnin_R.zip
826.01 MB
-
TreeSpaceBurnin_S.zip
1.64 GB
-
TreeSpaceBurnin_T.zip
638.60 MB
-
TreeSpaceBurnin_V.zip
375.85 MB
-
TreeSpaceBurnin_W.zip
245.55 MB
-
TreeSpaceBurnin_X.zip
301.61 MB
-
TreeSpaceBurnin_Y.zip
255.25 MB
-
TreeSpaceBurnin_Z.zip
133.24 MB
-
WhichProjection.zip
202.77 MB
Abstract
Phylogenetic analyses often produce large numbers of trees. Mapping trees’ distribution in “tree space” can illuminate the behavior and performance of search strategies, reveal distinct clusters of optimal trees, and expose differences between different data sources or phylogenetic methods—but the high-dimensional spaces defined by metric distances are necessarily distorted when represented in fewer dimensions. Here, I explore the consequences of this transformation in phylogenetic search results from 128 morphological data sets, using stratigraphic congruence—a complementary aspect of tree similarity—to evaluate the utility of low-dimensional mappings. I find that phylogenetic similarities between cladograms are most accurately depicted in tree spaces derived from information-theoretic tree distances or the quartet distance. Robinson–Foulds tree spaces exhibit prominent distortions and often fail to group trees according to phylogenetic similarity, whereas the strong influence of tree shape on the Kendall–Colijn distance makes its tree space unsuitable for many purposes. Distances mapped into two or even three dimensions often display little correspondence with true distances, which can lead to profound misrepresentation of clustering structure. Without explicit testing, one cannot be confident that a tree space mapping faithfully represents the true distribution of trees, nor that visually evident structure is valid. My recommendations for tree space validation and visualization are implemented in a new graphical user interface in the “TreeDist” R package.
Projections of the 128 datasets from Wright & Lloyd (2020, doi:10.1111/pala.12500) into six dimensions, with evaluations of projection quality, clusterings and correlation with stratigraphic fit.
File names detail the dataset, tree distance method, and projection method.
Abbreviations:
cid, clustering information distance, treating trees as unrooted (cid) / rooted (cidR);
es, split size vector metric;
kc, Kendall–Colijn distance;
msid, matching split information distance;
path, path distance;
pid, phylogenetic information distance;
qd[R], quartet distance, treating trees as unrooted (qd) / rooted (qdR);
rf, Robinson–Foulds distance;
cca, curvilinear components analysis;
dif, diffusion mapping;
ks1, Kruskal-1 MDS;
leim, Laplacian eigenmapping;
nls, Sammon MDS;
pco, principle components (classic MDS);
tsne, t-distributed stochastic neighbour embedding.
Trees sourced from https://github.com/graemetlloyd/ProjectWhalehead/tree/master/Data/CombinedTrees
References to original studies listed at https://github.com/graemetlloyd/ProjectWhalehead/tree/master/Data/XML.
Underlying data and scripts necessary for reproduction are included as described in the README.md file.