Paleontological systematics relies heavily on morphological data that have undergone decay and fossilization. Here, we apply a heuristic means to assess how a fossil's incompleteness detracts from inferring its phylogenetic relationships. We compiled a phylogenetic matrix for primates and simulated the extinction of living species by deleting an extant taxon's molecular data and keeping only those morphological characters present in actual fossils. The choice of characters present in a given living taxon (the subject) was defined by those present in a given fossil (the template). By measuring congruence between a well-corroborated phylogeny to those incorporating artificial fossils, and by comparing real vs. random character distributions and states, we tested the information content of paleontological datasets and determined if extinction of a living species leads to bias in phylogeny reconstruction. We found a positive correlation between fossil completeness and topological congruence. Real fossil templates sampled for 36 or more of the 360 available morphological characters (including dental) performed significantly better than similarly complete templates with random states. Templates dominated by only one partition performed worse than templates with randomly sampled characters across partitions. The template based on the Eocene primate Darwinius masillae performs better than most other templates with a similar number of sampled characters, likely due to preservation of data across multiple partitions. Our results support the interpretation that Darwinius is strepsirhine, not haplorhine, and suggest that paleontological datasets are reliable in primate phylogeny reconstruction.
Dataset S1. Combined morphology-DNA data matrix in nexus format
We compiled a dataset of morphological and molecular characters, sampled at genus level for 24 extant primates and two outgroups. Genera represented in previous studies (Seiffert et al. 2009; Springer et al. 2012) by multiple species were condensed into single terminals in order to minimize missing data. In cases where different species within a genus exhibited different character states, we coded the genus as polymorphic for that character. Condensing taxa to genus-level also has the advantage of improving the tractability of both MP and Bayesian phylogenetic analysis by slightly reducing the number of terminals.
We used the alignment of Springer et al. (2012; Treebase accession #S13451) as our molecular dataset, consisting of 61199 nucleotide characters distributed across 69 nuclear and 10 mitochondrial genes. This alignment comprises part of our morphology-DNA data matrix in our supplementary data file S1.
S1_combined.nex
Dataset S2. Morphological data matrix in nexus format
Our morphological dataset was derived primarily from Seiffert et al. (2009), updated by Gladman et al. (2013) and Boyer and Seiffert (2013), and enabled us to sample 85 fossil taxa. In order to improve overlap with the available DNA sequences, several extant taxa were added: Callicebus sp., Cebus sp., Chlorocebus aethiops, Colobus sp., Daubentonia madagascariensis, Hylobates sp., Macaca sp., and a composite Dermoptera consisting of Cynocephalus volans and Galeopithecus variegatus, treated as a single taxon. Characters were coded from direct observations of museum specimens housed at the University Museum of Zoology, Cambridge, based on the descriptions in the matrix of Seiffert et al. (2009), supplemented with images available from www.digimorph.org, and data from Luckett (1976), Wible and Covert (1987), Beard et al. (1988), Dagosto (1990), Yoder (1994), Ross et al. (1998), Gebo (2001), and Pilbeam (2004). Our morphology matrix is available in nexus format as supplementary data (S2) and from http://www.datadryad.org. Postcranial data for Callicebus moloch and Chlorocebus aethiops were derived primarily from Ross et al. (1998) and Yoder (1994), respectively. Alouatta seniculus and Pan troglodytes were added to the matrix using data from Boyer and Seiffert (2013) and Seiffert (pers. comm.). Facial vibrissae (Yoder 1994: character 61) were coded for new taxa according to the presence of vibrissae musculature as reported in Muchlinski et al. (2013). Not all characters have been treated consistently by previous investigators, or were sufficiently described and illustrated so as to enable coding new taxa. Where the anatomical basis for making a particular coding decision was not clear to us, we have left previous codings as-is and added only "?" to our new taxa.
S2_morph.nex
artEx
A script to make artificial fossils, which can be used in artificial extinction analyses. Link to artEx in GitHub repository at https://github.com/davipatti/artEx
Table S1.
Table S1: Key linking X-axis of Fig. 8 to taxon names.
tableS1_taxonNumbers.xls
Figure S1.
Fig S1: Asymmetric means of calculating topological similarity. AFT-ECT represents number of splits in extant combined topology shared with artificial fossil topologies (also given in Fig. 1). ECT-AFT represents splits in artificial fossil topologies shared with extant combined topology. Number of morphological characters (X-axis) corresponds to 0-100% complete (out of 360 total) in 10% intervals.
figS1.ai
Figure S2. Q accuracy
Fig. S2: Relationship between Q and topological accuracy. Q quantifies the extent to which a dataset is evenly sampled across partitions; as Q approaches 1 all partitions are evenly sampled; as Q approaches 0 only one partition contains data. Topological accuracy is quantified as the number of splits (i.e., unrooted clades) in the well corroborated, extant combined topology (or ECT, Fig. 1) present in the artificial
fossil topology (or AFT). For the 85 fossil templates, there is a
statistically significant correlation (Pearson’s R = 0.694, p <<0.01, Rohlf & Sokal 1995: table R) between Q values (X-axis) and ECT splits present in AFTs (Y-axis). “All fossils” (black circles) represent our 85 real fossil templates, with a select few identified with polygons. “Random templates” (open circles) indicate artificial fossils generated using real character state data with missing entries inserted at random
across partitions (see Fig. 2a and Methods).
FigS2-Qaccuracy.ai