A new Lower Permian ray-finned fish (Actinopterygii) from South Dakota and the use of tree space to find rogue taxa in phylogenetic analysis of morphological data
Data files
Jul 14, 2025 version files 98.68 MB
-
README.md
25.50 KB
-
StacketalSupplementData.zip
98.66 MB
Abstract
The divergence of extant lineages from the “palaeoniscoids”, a grade of Paleozoic and early Mesozoic Era species, remains unresolved in analyses of morphological data despite more than four decades of phylogenetic research. We describe a new ray-finned fish, Tenupiscis dakotaensis gen et. sp. nov., from the Lower Permian (Kungurian) of South Dakota to strengthen the phylogenetic framework of Mississippian–Triassic actinopterygians. Our initial parsimony and Bayesian phylogenetic analyses were unable to resolve the relationships of Mississippian–Triassic “palaeoniscoids”. We analyzed the topological variation among the trees sampled in each phylogenetic search (tree space) to determine if uncertainty was concentrated in a small subset of species with highly uncertain phylogenetic relationships relative to other terminal taxa (rogue taxa) or distributed evenly amongst early actinopterygians. The relationships of fourteen species were unresolved in the parsimony strict consensus due to a single rogue taxon (“Kalops monophyrum”). Parsimony and Bayesian analyses with the rogue pruned or recoded find the initially unresolved Mississippian–Triassic “palaeoniscoids” (including Tenupiscis) branching from the actinopterygian stem or from the base of Pan-Neopterygii . Our work supports the emerging consensus that Paleozoic Era ray-finned fishes therefore, include clades of stem actinopterygians and the earliest members of the actinopterygian crown group. We also demonstrate that tree space methods can effectively identify and mitigate rogue taxon effects in phylogenetic analysis of morphological data from new fossil taxa.
GENERAL INFORMATION
Principal Investigator: Jack Stack (Virginia Tech, Blacksburg, VA, USA). Email: Jackrs@vt.edu Co-authors: Michael D. Gottfried (Michigan State, East Lansing, MI, USA); Michelle R. Stocker (Virginia Tech, Blacksburg, VA, USA).
Date of project: 2021-2023.
DATA & FILE OVERVIEW
- StacketalSupplementData.zip
1. File List:
File Name: Stacketal_SupplementCode.r
File Description: R script file that provides the full code needed to run the tree space visualizations, rogue taxon search, and annotation of the Bayesian consensus tree in R (R Core Team 2021). These are each of the command line instructions we used, and an explanation of what the instruction is for. We attempt to make every step in our search for rogue taxa clear and repeatable, so Stacketal_SupplementCode.r contains a great deal of annotation, describing what each set of code does and is for. This code is written for a beginner R user, so the advanced user may find some of it obvious, but our goal is that our work is accessible beyond R-experts.
Folder Name: Initial_Bayesian_Analysis
->Folder Name: Initial_Bayesian_Consensus_Tree
File Name: Stacketal_MrBayesCon_Initial.tre
File Description: Annotated majority rule consensus tree from MrBayes, which is the input for creating an annotated consensus in R. This tree serves as the result for the initial Bayesian search. This tree shows the arrangements with the highest estimated posterior probability that surpass a value of 0.5. If no arrangement for a taxon has an estimated posterior probability above 0.5, it is represented as uncertain (a polytomy).
File Name: BayesianConsensus_FigTre
File Description: Plain text file of the majority rule consensus tree from the initial Bayesian search that can be opened in Figtree v1.4.4 (Rambaut, 2018).
->Folder Name: MrBayes_Script
File Name: Initial_MrBayes_Script.rtf
File Description: Raw text formatted script to run the initial Bayesian search in MrBayes. These are step-by-step command-line instructions to replicate the steps of our analysis in MrBayes.
Relationship to other files: Requires "Stacketal_InitialMatrix.nex".
->Folder Name: Initial_Bayesian_Tree_Search
File Name: Stacketal_InitialMatrix.nex
File Description: Nexus formatted matrix for a Bayesian phylogenetic analysis in MrBayes 3.7.2a. In this context, a matrix describes the phylogenetic character annotations for the species of interest.
File Name: Stacketal_InitialBayes_Runonetrees.t
File Description: Nexus formatted file containing the trees sampled from our initial Bayesian search.
File Name: InitialBayes_TxCofPCOA.pdf
File Description: PDF file of a plot trustworthiness x continuity showing the reliability (the product of trustworthiness x continuity) of 1-12 dimensions in the PCoA space of the subsampled Bayesian trees.
File Name: InitialBayes_PCoA_Visualization.pdf
File Description: PDF formatted plot of the PCoA space from 1000 phylogenetic trees sampled from the initial Bayesian search. This plot shows the dissimilarity of the phylogenetic trees in a low-dimensional (7 in this instance)Euclidean space (Gower, 1966).
File Name: InitialBayes_ClusterSearch.pdf
File Description: PDF of a comparison of the reliability(silhouette coefficient; Kaufman & Rousseeuw, 1990) of the clusters identified via Partitioning Around Medoids (PAM; Kaufman & Rousseeuw, 1990; with algorithmic improvements from Schubert & Rousseeuw (2021) and hierarchical clustering with minimax linkage (Hierarchical; Ao et al., 2005; Bien & Tibshirani, 2011) algorithms. The silhouette coefficient is a dimensionless measure of the degree to which objects in a cluster are close to other objects in their cluster relative to objects in the closest neighboring cluster (Kaufman & Rousseeuw, 1990).
File Name: InitialBayes_DensityPlot.pdf
File Description: PDF of a plot of the density of the subsampled initial Bayesian trees about the median tree. The density refers to the distance of each tree from the median tree (Smith, 2022).
Folder Name: Kalops_Recoded_BayesianAnalysis
->Folder Name: MrBayesScript
File Name: Kalops_Recoded_MrBayes_Script.rtf
File Description: Raw text formatted script for the Bayesian analysis with "Kalops monophyrum" recoded as Kalops monophrys in MrBayes. These are step-by-step command-line instructions to replicate each step of our analysis. These instructions include how to read "Stacketal_KalopsRecodedMatrix_Bayes.nex" into MrBayes, which is described below.
->Folder Name: Recoded_Bayesian_Tree_Search
File Name: Stacketal_KalopsRecodedMatrix_Bayes.nex
File Description: Nexus formatted matrix with "Kalops monophyrum" recoded as Kalops monophrys for Bayesian analysis in MrBayes 3.7.2a. In this context, a matrix describes the phylogenetic character annotations for the species of interest.
File Name: Stacketal_KalopsRecoded_Runonetrees.t
File Description: Nexus formatted tree file containing the trees sampled in the Bayesian search in MrBayes 3.7.2a. with "Kalops monophyrum" recoded as Kalops monophrys.
File Name: RecodedBayes_TxCofPCOA.pdf
File Description: PDF trustworthiness x continuity plot (Kaski et al., 2003; Venna & Kaski, 2001) showing the reliability (the product of trustworthiness x continuity) of 1-12 dimensions in the Principal Coordinates Analysis space of the subsampled Bayesian trees.
File Name: RecodedBayes_PCoA_Visualization.pdf"
File Description: PDF file of a plot of the PCoA space from the subsampled Bayesian trees. This plot shows the dissimilarity of the phylogenetic trees in a low-dimensional Euclidean space (Gower, 1966).
File Name: InitialBayes_ClusterSearch.pdf
File Description: PDF of a comparison of the reliability (silhouette coefficient; Kaufman & Rousseeuw, 1990) of the clusters identified via Partitioning Around Medoids (PAM; Kaufman & Rousseeuw, 1990; with algorithmic improvements from Schubert & Rousseeuw (2021) and hierarchical clustering with minimax linkage (Hierarchical; Ao et al., 2005; Bien & Tibshirani, 2011) algorithms. The silhouette coefficient is a dimensionless measure of the degree to which objects in a cluster are close to other objects in their cluster relative to objects in the closest neighboring cluster (Kaufman & Rousseeuw, 1990).
File Name: RecodedBayes_DensityPlot.pdf
File Description: PDF of a plot the density of the subsampled initial Bayesian trees about the median tree. The density refers to the distance of each tree from the median tree (Smith, 2022).
Folder Name: Initial_ParsimonyAnalysis
->Folder Name: Matrix
File Name: Initial_Parsimony_TNTMatrix.tnt
File Description: A .tnt formatted text file of a phylogenetic matrix matrix formatted for analysis in TNT by Morphobank(O'Leary and Kaufman 2011). In this context, a matrix is a text file that describes the phylogenetic character annotations for the species of interest.
->Folder Name: Initial_Parsimony_Tree_Search
File Name: Initial_Parsimony_TNTReader.tnt
File Description: A text file needed for R to read the most parsimonious trees as output by TNT. Because TNT annotates species in tree files as numbers rather than their names, R needs a file to translate the numbers back to species names. This is a text file that is identical to "Initial_Parsimony_TNTMatrix.tnt".
File Name: Initial_Parsimony_Synapomorphies.emf
File Description: A .emf formatted image file showing the strict consensus phylogenetic of the initial parsimony search annotated with the unambiguous synapomorphies for each branching pattern. In this context, an unambiguous synapomorphy refers to shared characteristics for the phylogenetic relationship represented by the tree pattern.
File Name: Initial_Parsimony_MPT.tre
File Description: A .tre file of the most parsimonious trees from the initial maximum parsimony search output by TNT 1.5. In this context, a .tre file is a text file that describes phylogenetic tree structure.
File Name: Initial_MPT_Readable
File Description: A .tre text file of the most parsimonious trees from TNT that are written with the species annotated as their names, rather than numbers (as is the TNT standard). Unlike the TNT output, this file can be read more easily by R and other programs for examining phylogenetic trees.
File Name: InitialParsimony_TxCofPCOA.pdf
File Description: A PDF formatted trustworthiness x continuity plot (Kaski et al., 2003; Venna & Kaski, 2001) showing the reliability (the product of trustworthiness x continuity) of 1-12 dimensions in the Principal Coordinates Analysis space of the most parsimonious trees from the initial search in TNT 1.5.
File Name: InitialParsimony_3DPCOA.pdf
File Description: A PDF-formatted plot of the PCoA space of the most parsimonious trees. This plot shows the dissimilarity of the phylogenetic trees in a low-dimensional (3 in this instance) Euclidean space (Gower, 1966).
File Name: InitialParsimony_ClusterSearch.pdf
File Description: A PDF formatted plot of a comparison of the reliability (silhouette coefficient; Kaufman & Rousseeuw, 1990) of the clusters identified via Partitioning Around Medoids (PAM; Kaufman & Rousseeuw, 1990; with algorithmic improvements from Schubert & Rousseeuw (2021) and hierarchical clustering with minimax linkage (Hierarchical; Ao et al., 2005; Bien & Tibshirani, 2011) algorithms. The silhouette coefficient is a dimensionless measure of the degree to which objects in a cluster are close to other objects in their cluster relative to objects in the closest neighboring cluster (Kaufman & Rousseeuw, 1990).
File Name: InitialParsimony_ClusterOneConsensus.pdf
File Description: PDF formatted file describing the strict consensus of the trees in the first cluster of the most parsimonious trees. This strict consensus tree shows the species arrangements that are present in 100% of the trees in the first cluster.
File Name: InitialParsimony_ClusterTwoConsensus.pfg
File Description: PDF formatted file describing the strict consensus of the trees in the second cluster of the most parsimonious trees. This strict consensus tree shows the species arrangements that are present in 100% of the trees in the second cluster.
File Name: InitialParsimony_DispersalDensityPlot.pdf
File Description: PDF formatted file showing a plot showing the dispersal of the most parsimonious trees about their median. The density refers to the distance of each tree from the median tree (Smith, 2022).
File Name: Initial_Parsimony_StrictCon.tre
File Description: Text file describing the strict consensus of the most parsimonious trees from the initial maximum parsimony search. This strict consensus tree shows the species arrangements that are present in 100% of the most parsimonious trees.
Folder Name: Kalops_Removed_ParsimonyAnalysis
->Folder Name: Matrix
File Name: KalopsRemoved_TNTMAtrix.tnt
File Description: Text file of a phylogenetic matrix formatted for analysis in TNT by Morphobank (O'Leary and Kaufman 2011). In this context, a matrix describes the phylogenetic character annotations for the species of interest.
->Folder Name: Kalops_Removed_Tree_Search
File Name: Kalops_Removed_MPT.tre
File Description: A .tre text file of the 190 most parsimonious trees from the maximum parsimony search of the initial matrix with "Kalops monophyrum" removed in TNT 1.5.
File Name: Kalops_RemovedReaderFile.tnt
File Description: A text file needed for R to read the most parsimonious trees as output by TNT. Because TNT annotates species in tree files as numbers rather than their names, R needs a file to translate the numbers back to species names. This file is a text-formatted matrix identical in content to "KalopsRemoved_TNTMAtrix.tnt".
File Name: KalopsRemoved_MPT_Readable
File Description: A .tre text file of the most parsimonious trees from TNT that are written with the species annotated as their names, rather than numbers (as is the TNT standard). Unlike the TNT output, this file can be read more easily by R and other programs for examining phylogenetic trees.
File Name: KalopsRemoved_Parsimony_TxCofPCOA.pdf
File Description: A PDF formatted file of a trustworthiness x continuity plot showing the reliability (the product of trustworthiness x continuity) of the dimensions in the PCoA space of the most parsimony trees from the maximum parsimony analysis with "Kalops monophyrum" removed.
File Name: KalopsRemoved_Synapomorphies.emf
File Description: A .emf formatted image file showing the strict consensus phylogenetic of the parsimony search with "Kalops monophyrum" removed from the matrix, annotated with the unambiguous synapomorphies for each branching pattern. In this context, an unambiguous synapomorphy refers to shared characteristics for the phylogenetic relationship represented by the tree pattern.
File Name: KalopsRemoved_Parsimony_PCoA_Visualization.pdf
File Description: A PDF formatted plot of a 4-dimensional visualization of the PCoA space of the most parsimony trees from the maximum parsimony analysis with "Kalops monophyrum" removed. This plot shows the dissimilarity of the phylogenetic trees in a low-dimensional (4 in this instance) Euclidean space (Gower, 1966).
File Name: KalopsRemoved_Parsimony_ClusterSearch.pdf
File Description: A PDF formatted plot of a comparison of the reliability (silhouette coefficient) the clusters identified via Partitioning Around Medoids(PAM; Kaufman & Rousseeuw, 1990; with algorithmic improvements from Schubert & Rousseeuw(2021) and hierarchical clustering with minimax linkage (Hierarchical; Ao et al., 2005; Bien & Tibshirani, 2011) algorithms. The silhouette coefficient is a dimensionless measure of the degree to which objects in a cluster are close to other objects in their cluster relative to objects in the closest neighboring cluster (Kaufman & Rousseeuw, 1990).
File Name: KalopsRemoved_Parsimony_DensityPlot.pdf
File Description: PDF formatted plot of the density of the most parsimonious trees from the maximum parsimony analysis with "Kalops monophyrum" removed from the matrix about their median. The density refers to the distance of each tree from the median tree (Smith, 2022).
File Name: KalopsRemoved_Strictconsensus.tre
File Description: A .tre text file of the strict consensus of the most parsimonious trees from the maximum parsimony search of the initial matrix with "Kalops monophyrum" removed. This strict consensus tree shows the species arrangements that are present in 100% of the most parsimonious trees.
Folder Name: Combined_Bayesian_Parsimony_TreeSpace
File Name: Initial_Parsimony_TNTReader.tnt
File Description: A .tnt formatted text file needed for R to read the most parsimonious trees as output by TNT. Because TNT annotates species in tree files as numbers rather than their names, R needs a file to translate the numbers back to species names. This is a text file that is identical to "Initial_Parsimony_TNTMatrix.tnt".
File Name: Initial_Parsimony_MPT.tre
File Description: A .tre formatted text file of the most parsimonious trees from the initial maximum parsimony search in TNT.
File Name: Stacketal_InitialBayes_Runonetrees.t
File Description: A Nexus-formatted text file containing the trees sampled in our initial Bayesian search.
File Name: CombinedTreeSpace_2DPlot.pdf
File Description: A PDF formatted plot of the first two dimensions of a principal coordinates analysis of the combined tree space of the most parsimonious trees and a sample of trees from the estimated Bayesian posterior distribution. This plot shows the dissimilarity of the phylogenetic trees in a low-dimensional (2 in this instance) Euclidean space (Gower, 1966).
File Name: CombinedTreeSpace_5DPlot.pdf
File Description: A PDF formatted plot of the first five dimensions of a principal coordinates analysis of the combined tree space of the most parsimonious trees and a sample of trees from the estimated Bayesian posterior distribution. This plot shows the dissimilarity of the phylogenetic trees in a low-dimensional (5 in this instance) Euclidean space (Gower, 1966).
File Name: CombinedTreeSample_ViolinPlot.pdf
File Description: A PDF formatted plot showing the dispersal of the most parsimonious and Bayesian tree sample as a violin plot. The dispersal refers to the distance of each tree from the median tree (Smith, 2022).
File Name: MostParsimoniousTree_DensityPlot.pdf
File Description: A PDF-formatted plot that shows the dispersal of the most parsimonious trees about their median tree. The density refers to the distance of each tree from the median tree (Smith, 2022).
File Name: BayesianTrees_DensityPlot.pdf
File Description: A PDF-formatted plot that shows the dispersal of the Bayesian tree sample about its median tree. The density refers to the distance of each tree from the median tree (Smith, 2022).
Folder Name: Kalops_Recoded_ParsimonyAnalysis
->Folder Name: Matrix
File Name: KalopsRecoded.tnt
File Description: A TNT formatted text file of the phylogenetic matrix used for the maximum parsimony analysis where "Kalops monophyrum" is replaced with Kalops monphrys. Matrix created in Morphobank (O'Leary and Kaufman 2011). In this context, a matrix describes the phylogenetic character annotations for the species of interest.
->Folder Name: Kalops_Recoded_Tree_Search
File Name: KalopsRecoded_MPT.tre
File Description: A .tre formatted text file of the TNT output of the most parsimonious trees from the maximum parsimony analysis of where "Kalops monophyrum" is replaced with Kalops monphrys.
File Name: KalopsRecoded.tnt
File Description: A .tnt formatted text file needed for R to read the most parsimonious trees as output by TNT. Because TNT annotates species in tree files as numbers rather than their names, R needs a file to translate the numbers back to species names. This is a text file that is identical to "KalopsRecoded.tnt".
File Name: KalopsRecoded_MPT_Readable
File Description: A .tre formatted text file of the most parsimonious trees from TNT that are written with the species annotated as their names, rather than numbers (as is the TNT standard). Unlike the TNT output, this file can be read more easily by R and other programs for examining phylogenetic trees.
File Name: KalopsRecoded_Synapomorphies.emf
File Description: A .emf formatted image file showing the strict consensus phylogenetic of the parsimony search with "Kalops monophyrum" recoded as Kalops monophrys from the matrix, annotated with the unambiguous synapomorphies for each branching pattern. In this context, an unambiguous synapomorphy refers to shared characteristics for the phylogenetic relationship represented by the tree pattern.
File Name: KalopsRecoded_Parsimony_TxCofPCOA.pdf
File Description: A PDF formatted file of a trustworthiness x continuity plot showing the reliability of the dimensions in the PCoA space of the most parsimony trees from the maximum parsimony analysis with "Kalops monophyrum" recoded as Kalops monophrys.
File Name: KalopsRecoded_Parsimony_PCoA_Visualization.pdf
File Description: A PDF formatted file of a 2-dimensional visualization of the PCoA space of the most parsimony trees from the maximum parsimony analysis with "Kalops monophyrum" recoded as Kalops monophrys. This plot shows the dissimilarity of the phylogenetic trees in a low-dimensional (2 in this instance) Euclidean space (Gower, 1966).
File Name: KalopsRecoded_Parsimony_ClusterSearch.pdf
File Description: A PDF-formatted file of a plot comparison of the reliability (silhouette coefficient) of the clusters identified in each cluster search.
File Name: KalopsRecoded_Parsimony_DensityPlot.pdf
File Description: A PDF formatted plot of the density of the most parsimonious trees from the maximum parsimony analysis with "Kalops monophyrum" recoded as Kalops monophrys.
File Name: KalopsRecoded_Strictconsensus.tre
File Description: A .tre formatted text file of the strict consensus of the most parsimonious trees from the maximum parsimony search of the initial matrix with "Kalops monophyrum" recoded as Kalops monophrs.
References:
Aberer A.J., Krompass D., Stamatakis A. 2013. Pruning rogue taxa improves phylogenetic accuracy: An efficient algorithm and webservice. Systematic biology. 62(1):162–166.
Adler D. Kelly, S.T. 2022. vioplot: violin plot. R package version 0.4.0 https://github.com/TomKellyGenetics/vioplot
Argyriou T., Giles S., Friedman M., Romano C., Kogan I., Sánchez-Villagra M.R. 2018. Internal cranial anatomy of Early Triassic species of †Saurichthys (Actinopterygii: †Saurichthyiformes): Implications for the phylogenetic placement of †saurichthyiforms. BMC Evolutionary Biology. 18:1–41.
Bien J., Tibshirani R. 2011. Hierarchical clustering with prototypes via minimax linkage. Journal of the American Statistical Association. 106:1075–1084.
Bien J, Tibshirani R. 2022. Protoclust: Hierarchical clustering with prototypes. https://cranr-projectorg/web/packages/protoclust/indexhtml.
Coates M.I., Tietjen K. 2019. ‘This strange little palaeoniscid': A new early actinopterygian genus, and commentary on pectoral fin conditions and function. Earth and Environmental Science Transactions of The Royal Society of Edinburgh. 109(1–2):15–31.
Farris J.S. 1989. The retention index and the rescaled consistency index. Cladistics: the international journal of the Willi Hennig Society. 5(4):417–419.
Gelman A., Rubin D.B. 1992. Inference from iterative simulation using multiple sequences. Statistical Science. 7(4):457-472.
Goloboff P.A., Farris J.S., Nixon K.C. 2008. TNT, a free program for phylogenetic analysis. Cladistics. 24(5):774–786.
Goloboff PA, Catalano SA. 2016. TNT version 1.5, including a full implementation of phylogenetic morphometrics. Cladistics. 32(3):221–238.
Gower JC. 1966. Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika. 53(3-4):325-338.
Kaski S., Nikkilä J., Oja M., Venna J., Törönen P., Castrén E. 2003. Trustworthiness and metrics in visualizing similarity of gene expression. BMC Bioinformatics. 4:1–13.
Kaufman L., Rousseeuw P.J. 1990. Partitioning around medoids (program PAM). Finding groups in data: An introduction to cluster analysis. Hoboken, New Jersey: John Wiley & Sons, Ltd.
Kluge A.G., Farris J.S. 1969. Quantitative phyletics and the evolution of anurans. Systematic Biology. 18(1):1-32.
Lewis P.O. 2001. A likelihood approach to estimating phylogeny from discrete morphological character data. Systematic biology. 50(6):913–925.
Maechler M., Rousseeuw P., Struyf A., Hubert M., Hornik K. 2022. Cluster: Cluster analysis basics and extensions. R package version 2.1.3 2022. https://CRAN.R-project.org/package=cluster.
Nixon K.C., Carpenter J.M. 1996. On consensus, collapsibility, and clade concordance. Cladistics. 12(4):305–321.
O’Leary MA, Kaufman S. 2011. Morphobank: Phylophenomics in the “cloud”. Cladistics. 27(5):529-537.
Ripley B.D. 2009. Stochastic simulation. John Wiley & Sons.
Paradis E, Schliep K. 2019. Ape 5.0: An environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics. 35:526–528.
Poplin C.M., Lund R. 2002. Two Carboniferous fine-eyed palaeoniscoids (Pisces, Actinopterygii) from Bear Gulch (USA). Journal of Paleontology. 76:1014–1028.
Rambaut A. 2018. Figtree tree figure drawing tool version 1.4.4. https://githubcom/rambaut/figtree/releases.
Ronquist F., Teslenko M., Van Der Mark P., Ayres D.L., Darling A., Höhna S., Larget B., Liu L., Suchard M.A., Huelsenbeck J.P. 2012. Mrbayes 3.2: Efficient bayesian phylogenetic inference and model choice across a large model space. Systematic biology. 61(3):539–542.
Schubert E., Rousseeuw P.J. 2021. Fast and eager k-medoids clustering: O (k) runtime improvement of the pam, clara, and clarans algorithms. Information Systems. 101:101804.
Smith M. 2019. Treetools: Create, modify and analyse phylogenetic trees. Comprehensive R Archive Network. doi:10.5281/zenodo.3522725.
Smith MR. 2020. Information theoretic generalized Robinson–Foulds metrics for comparing phylogenetic trees. Bioinformatics. 36(20):5007–5013.
Smith M.R. 2020. Treedist: Distances between phylogenetic trees. R package version 2.4.0. 2020. doi: 10.5281/zenodo.3528124.
Smith M.R. 2021. Using information theory to detect rogue taxa and improve consensus trees. Systematic Biology 0:1–7.
Smith M.R. 2022. Robust analysis of phylogenetic tree space. Systematic Biology. 0(syab099):1–16.
Stack J, Gottfried M.D. 2022. A new, exceptionally well-preserved Permian actinopterygian fish from the Minnekahta Limestone of South Dakota, USA. Journal of Systematic Palaeontology. 19:1271–1302.
Venna J., Kaski S. 2001. Neighborhood preservation in nonlinear projection methods: An experimental study. In: Dorffner G, Bischof H, Hornik K. editors. Artificial neural networks — ICANN 2001. Lecture notes in computer science. Berlin: Springer. p. 485–492.
Wright A.M., Lloyd, G.T. 2020. Bayesian analyses in phylogenetic palaeontology: interpreting the posterior sample. Palaeontology 1–10.
The input data in the form of phylogenetic trees were generated in a series of analyses (see "Parsimony analyses" and "Bayesian analyses" below). The matrices that we ran in the phylogenetics programs to make the trees are also provided (see "matrix construction" below). We used R to summarize the results of the Bayesian searches as majority rules consensus trees, visualized the variation in the samples of phylogenetic trees ("Tree space visualization), and determined if any rogue taxa were present in our dataset ("Rogue taxon search").
Summary of experimental efforts underlying this dataset:
This document describes a series of phylogenetic analyses of ray-finned fishes (actinopterygians) on the basis of a dataset of morphological data derived from Stack and Gottfried (2022). A full description of the methods is provided in the draft manuscript and supplementary information documents, but we will provide a summary of relevant information for how the data were generated and all information needed to replicate the analyses.
Matrix construction:
We coded the new taxon for 222 discrete morphological characters using the matrix of Stack & Gottfried (2022), which incorporates coding changes from Argyriou et al. (2018) and Coates & Tietjen (2019), adds the early Permian actinopterygian Concentrilepis minnekahtaensis, and reduces the taxon list to focus on actinopterygian interrelationships (Stack & Gottfried, 2022). The full list of changes is available in the supplementary material of Stack and Gottfried (2022). The full matrix used in our initial analyses contains 10341 scorings for 75 taxa; this matrix and analyses using it are labeled Initial below. We also used a matrix where the scorings for the terminal taxon "Kalops monophyrum" (nomen nudum, meaning a name that is not linked to a described taxon) were removed, referred to below as "Removed". The third and final version of the matrix replaces "Kalops monophyrum" with Kalops monophrys (Poplin & Lund, 2002), and is labeled as "Recoded". All parsimony and Bayesian analyses followed identical steps described below; the only difference between analyses was the matrices used. We used the open-access Morphobank (O'Leary and Kaufman 2011) to annotate and output all matrix files.
Parsimony analyses:
All parsimony analyses were conducted in TNT 1.5 (Goloboff et al., 2008; Goloboff & Catalano, 2016) and implemented an initial New Technology Search with a combination of the Sectorial Search, Ratchet, Tree Fusing, and Drift algorithms to find the optimal tree length 500 times (random seed =1; Goloboff et al., 2008). We conducted a subsequent traditional search with Tree Bisection and Reconnection on the topologies returned from the New Technology search. Our strict consensus tree summarizes the agreement between the most parsimonious trees from the traditional search (Nixon & Carpenter, 1996). We mapped unambiguous synapomorphies from the most parsimonious trees onto the strict consensus for each analysis in TNT. We also calculated Bremer support values in TNT by conducting Tree Bisection and Reconnection on the most parsimonious trees and allowing the analysis to retain all trees 1-6 steps longer than the optimal length. We calculated the consistency index (CI; Kluge and Farris, 1969) and retention index (RI; Farris, 1989) of the strict consensus in TNT with the stats.run command.
Bayesian analyses:
All Bayesian phylogenetic analyses were conducted in MrBayes 3.7.2a using two independent Metropolis-coupled Markov chain Monte Carlo analyses with the MkV model for discrete morphological data (Ronquist et al., 2012; Lewis, 2001). Each Metropolis-coupled Markov chain Monte Carlo analysis had four independent Markov chains that ran for an initial 500,000 iterations, with burn-in set to 25% and sampling every 100 generations. We ran 4.5 million generations prior to reaching a standard deviation of split frequencies of 0.008282, with the minimum Effective Sample Size (ESS; Ripley, 1987) exceeding 6000 and the Potential Scale Reduction Factor (PSRF; Gelman & Rubin, 1992) values equaling 1.0. We also used the “plot” command in MrBayes to examine the trend in sampled log-likelihood values to ensure that they are randomly distributed within the space between generations 1,125,000 and 4,500,000, indicating that the chains converged on a stable region of the posterior distribution. We generated majority rule consensus trees in MrBayes with the “sumt Burninfrac=0.5” command. We imported the nexus formatted consensus tree into R (R Core Team, 2021) to generate a more flexible annotated consensus tree with the ape (Paradis & Schliep, 2019), phytools (Revell, 2012), and phylotate (Beer & Beer, 2019) packages, which we opened in Figtree v1.4.4 (Rambaut, 2018).
Tree space visualization:
We applied a series of “tree space” techniques for visualizing variation in phylogenetic searches (Smith, 2022; Wright & Lloyd, 2020) to examine the variation in our tree searches and determine the source of any low resolution in our consensus topologies. We conducted three parallel studies of tree space in R (R Core Team, 2021), with the cluster (Maechler et al., 2022), TreeTools (Smith, 2019), TreeDist (Smith, 2020), vioplot (Adler & Kelly, 2022), ape (Paradis & Schliep, 2019), and protoclust (Bien & Tibshirani, 2022) packages. Our analyses are inspired by vignettes by Martin R. Smith (https://github.com/ms609/TreeDist/blob/HEAD/vignettes/treespace.Rmd; https://ms609.github.io/TreeDist/dev/articles/compare-treesets.html). We examined the most parsimonious trees and 1000 randomly sampled Bayesian trees on their own in addition to a separate analysis of the Bayesian and most parsimonious trees together. A full script and the files needed to recreate these analyses in R are provided in the Supplementary Data. We calculated the distance between trees via the clustering information distance metric, which Smith (2020) demonstrated to be the most consistent measure of tree dissimilarity among available metrics. See Smith (2020) for detailed comparisons and rigorous testing of measures of tree distance. We performed a principal coordinates analysis (PCoA or metric multidimensional scaling; Gower, 1966) of each tree sample to create a twelve-dimensional mapping of the distances between the topologies. We calculated the product of the trustworthiness and continuity (TxC; Kaski et al., 2003; Venna & Kaski, 2001) of mappings in 1-12 dimensions to determine how many dimensions were needed to reliably visually represent the distances between the topologies in each tree sample. The trustworthiness measures the degree to which proximities in the original distance matrix are preserved (Kaski et al., 2003), whereas continuity measures to what degree points that are nearby in the original matrix maintain proximity in the mapping (Smith, 2022; Venna & Kaski, 2001). We mapped each tree space with the number of dimensions needed to meet or surpass a TxC of 0.9, following the recommendation of Smith (2022). We searched for clustering in each tree distance matrix via Partitioning Around Medoids (PAM; Kaufman & Rousseeuw, 1990; with algorithmic improvements from Schubert & Rousseeuw, 2021) and hierarchical clustering with minimax linkage (Hierarchical; Ao et al., 2005; Bien & Tibshirani, 2011) algorithms. We calculated the silhouette coefficient (Kaufman & Rousseeuw, 1990) to evaluate the reliability of the 2-12 clustering structures identified by each algorithm. The silhouette coefficient is a dimensionless measure of the degree to which objects in a cluster are close to other objects in their cluster relative to objects in the closest neighboring cluster (Kaufman & Rousseeuw, 1990). We further evaluated potential clustering by calculating and visualizing the dispersal of each tree sample, which is the distance between each tree and the respective median tree, to further understand the geometry of their respective tree spaces (Smith, 2022). The median tree has the shortest average distance from each other tree in the set (Smith, 2022). Examining the spread of the tree samples about their median allowed us to verify the landscapes shown in the initial tree space analyses. We visualized dispersal between and within the most parsimonious trees and Bayesian tree sample using violin plots (Adler & Kelly, 2022) and density plots, based on a vignette by Tom Kelly, https://cran.r-project.org/web/packages/vioplot/vignettes/violin_area.html
Rogue taxon search:
In this context, rogue taxa are species with highly uncertain phylogenetic position relative to other species in the same analysis (Smith, 2021). We aimed to determine if any of these unstable taxa acted as rogues in our analysis by conducting a rogue taxon search with the R (R Core Team, 2021) package Rogue (Smith, 2021) on the most parsimonious trees and a sample of 1000 trees randomly sampled from the first run of the Bayesian analysis (accounting for a burn-in of 50%) with the QuickRogue function. We chose to use the Quickrogue function because it can identify rogues as reliably as alternative heuristics in Rogue and RogueNaRok (Aberer et al., 2013) with the benefit of lower computation time (Smith, 2021). The Rogue output shows the splitwise phylogenetic information content (the sum of the information content contained in the bipartitions of a topology; Smith, 2021) of the baseline majority rule consensus of the tree sample and the rawImprovement, which shows the change in phylogenetic information content for the removal of each rogue taxon. We compared the rawImprovement scores of each rogue to determine how much damage they caused relative to each other. We conducted follow-up maximum parsimony searches in TNT 1.5 with the sole rogue taxon (“Kalops monophyrum”) identified in the initial maximum parsimony analysis removed from the matrix. “Kalops monophyrum” is not one of the two described species of Kalops (Poplin & Lund, 2002) and is therefore a nomen nudum. Given the rogue behavior of this taxon, we opted to remove “Kalops monophyrum” and re-score Kalops based on personal examination of the type specimen of Kalops monophrys (Poplin & Lund, 2002; CM 27372) and the original description (Poplin & Lund, 2002). The rationale for each character coding change is provided in Part C of the Supplementary Information. We conducted an additional maximum parsimony analysis with “Kalops monophyrum” pruned from the matrix, along with a maximum parsimony analysis and Bayesian search with Kalops monophrys subbed in for “Kalops monophyrum”. These searches used identical phylogenetic search and tree space methods to the initial analyses.
References:
Aberer A.J., Krompass D., Stamatakis A. 2013. Pruning rogue taxa improves phylogenetic accuracy: An efficient algorithm and webservice. Systematic biology. 62(1):162–166.
Adler D. Kelly, S.T. 2022. vioplot: violin plot. R package version 0.4.0 https://github.com/TomKellyGenetics/vioplot
Argyriou T., Giles S., Friedman M., Romano C., Kogan I., Sánchez-Villagra M.R. 2018. Internal cranial anatomy of Early Triassic species of †Saurichthys (Actinopterygii: †Saurichthyiformes): Implications for the phylogenetic placement of †saurichthyiforms. BMC Evolutionary Biology. 18:1–41.
Bien J., Tibshirani R. 2011. Hierarchical clustering with prototypes via minimax linkage. Journal of the American Statistical Association. 106:1075–1084.
Bien J, Tibshirani R. 2022. Protoclust: Hierarchical clustering with prototypes. https://cranr-projectorg/web/packages/protoclust/indexhtml.
Coates M.I., Tietjen K. 2019. ‘This strange little palaeoniscid': A new early actinopterygian genus, and commentary on pectoral fin conditions and function. Earth and Environmental Science Transactions of The Royal Society of Edinburgh. 109(1–2):15–31.
Farris J.S. 1989. The retention index and the rescaled consistency index. Cladistics: the international journal of the Willi Hennig Society. 5(4):417–419.
Gelman A., Rubin D.B. 1992. Inference from iterative simulation using multiple sequences. Statistical Science. 7(4):457-472.
Goloboff P.A., Farris J.S., Nixon K.C. 2008. TNT, a free program for phylogenetic analysis. Cladistics. 24(5):774–786.
Goloboff PA, Catalano SA. 2016. TNT version 1.5, including a full implementation of phylogenetic morphometrics. Cladistics. 32(3):221–238.
Gower JC. 1966. Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika. 53(3-4):325-338.
Kaski S., Nikkilä J., Oja M., Venna J., Törönen P., Castrén E. 2003. Trustworthiness and metrics in visualizing similarity of gene expression. BMC Bioinformatics. 4:1–13.
Kaufman L., Rousseeuw P.J. 1990. Partitioning around medoids (program PAM). Finding groups in data: An introduction to cluster analysis. Hoboken, New Jersey: John Wiley & Sons, Ltd.
Kluge A.G., Farris J.S. 1969. Quantitative phyletics and the evolution of anurans. Systematic Biology. 18(1):1-32.
Lewis P.O. 2001. A likelihood approach to estimating phylogeny from discrete morphological character data. Systematic biology. 50(6):913–925.
Maechler M., Rousseeuw P., Struyf A., Hubert M., Hornik K. 2022. Cluster: Cluster analysis basics and extensions. R package version 2.1.3 2022. https://CRAN.R-project.org/package=cluster.
Nixon K.C., Carpenter J.M. 1996. On consensus, collapsibility, and clade concordance. Cladistics. 12(4):305–321.
O’Leary MA, Kaufman S. 2011. Morphobank: Phylophenomics in the “cloud”. Cladistics. 27(5):529-537.
Ripley B.D. 2009. Stochastic simulation. John Wiley & Sons.
Paradis E, Schliep K. 2019. Ape 5.0: An environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics. 35:526–528.
Poplin C.M., Lund R. 2002. Two Carboniferous fine-eyed palaeoniscoids (Pisces, Actinopterygii) from Bear Gulch (USA). Journal of Paleontology. 76:1014–1028.
R (R Core Team 2021) is needed to run the code in Stacketal_SupplementCode. We recommend using Morphobank (Morphobank.org; O'Leary and Kaufman 2011) to open and read phylogenetic matrix files, although can be opened as text files. Re-running the phylogenetic analyses would require TNT 1.5 (parsimony; Goloboff et al., 2008; Goloboff & Catalano, 2016) and MrBayes 3.7.2a (Ronquist et al., 2012).
Rambaut A. 2018. Figtree tree figure drawing tool version 1.4.4. https://githubcom/rambaut/figtree/releases.
Ronquist F., Teslenko M., Van Der Mark P., Ayres D.L., Darling A., Höhna S., Larget B., Liu L., Suchard M.A., Huelsenbeck J.P. 2012. MrBayes 3.2: Efficient Bayesian phylogenetic inference and model choice across a large model space. Systematic biology. 61(3):539–542.
Schubert E., Rousseeuw P.J. 2021. Fast and eager k-medoids clustering: O (k) runtime improvement of the pam, clara, and clarans algorithms. Information Systems. 101:101804.
Smith M. 2019. Treetools: Create, modify and analyse phylogenetic trees. Comprehensive R Archive Network. doi:10.5281/zenodo.3522725.
Smith MR. 2020. Information theoretic generalized Robinson–Foulds metrics for comparing phylogenetic trees. Bioinformatics. 36(20):5007–5013.
Smith M.R. 2020. Treedist: Distances between phylogenetic trees. R package version 2.4.0. 2020. doi: 10.5281/zenodo.3528124.
Smith M.R. 2021. Using information theory to detect rogue taxa and improve consensus trees. Systematic Biology 0:1–7.
Smith M.R. 2022. Robust analysis of phylogenetic tree space. Systematic Biology. 0(syab099):1–16.
Stack J, Gottfried M.D. 2022. A new, exceptionally well-preserved Permian actinopterygian fish from the Minnekahta Limestone of South Dakota, USA. Journal of Systematic Palaeontology. 19:1271–1302.
Venna J., Kaski S. 2001. Neighborhood preservation in nonlinear projection methods: An experimental study. In: Dorffner G, Bischof H, Hornik K. editors. Artificial neural networks — ICANN 2001. Lecture notes in computer science. Berlin: Springer. p. 485–492.
Wright A.M., Lloyd, G.T. 2020. Bayesian analyses in phylogenetic palaeontology: interpreting the posterior sample. Palaeontology 1–10.
- Stack, Jack; Gottfried, Michael; Stocker, Michelle (2025). A New Lower Permian Ray-Finned Fish (Actinopterygii) From South Dakota and the Use of Tree Space to Find Rogue Taxa in Phylogenetic Analysis of Morphological Data. Bulletin of the Society of Systematic Biologists. https://doi.org/10.18061/bssb.v3i2.9825
