Data from: The fundamental role of character coding in Bayesian morphological phylogenetics

Data files

Feb 12, 2024 version files 17.48 MB

data_for_q_matrix.zip

17.47 MB
README.md

13.26 KB

Abstract

Phylogenetic trees establish a historical context for the study of organismal form and function. Most phylogenetic trees are estimated using a model of evolution. For molecular data, modeling evolution is often based on biochemical observations about changes between character states. For example, there are four nucleotides, and we can make assumptions about the probability of transitions between them. By contrast, for morphological characters, we may not know a priori how many character states there are per character, as both extant sampling and the fossil record may be highly incomplete, which leads to an observer bias. For a given character, the state space may be larger than what has been observed in the sample of taxa collected by the researcher. In this case, how many evolutionary rates are needed to even describe transitions between morphological character states may not be clear, potentially leading to model misspecification. To explore the impact of this model misspecification, we simulated character data with varying numbers of character states per character. We then used the data to estimate phylogenetic trees using models of evolution with the correct number of character states and an incorrect number of character states. The results of this study indicate that this observer bias may lead to phylogenetic error, particularly in the branch lengths of trees. If the state space is wrongly assumed to be too large, then we underestimate the branch lengths, and the opposite occurs when the state space is wrongly assumed to be too small.

Access this dataset on Dryad

Dataset Attribution and Usage

Dataset Title: "Data from: The fundamental role of character coding in Bayesian morphological phylogenetics"
Identifier: https://doi.org/10.5061/dryad.p2ngf1vvp
Authors: Basanta Khakurel, Courtney Grigsby, Tyler D. Tran, Juned Zariwala, Sebastian Höhna, April M. Wright.
Date of Issue: 2024-02-07
License: Use of these data and scripts is covered by the following license:
- Title: CC0 1.0 Universal (CC0 1.0)
- Specification: https://creativecommons.org/publicdomain/zero/1.0/; the authors respectfully request to be contacted by researchers interested in the re-use of these data so that the possibility of collaboration can be discussed.
Suggested Citation:
- Dataset citation:
Khakurel, Basanta et al. (Forthcoming 2024). Data for: "The fundamental role of character coding in Bayesian morphological phylogenetics",Dataset. Dryad. https://doi.org/10.5061/dryad.p2ngf1vvp

Contact Information

Name: Basanta Khakurel
- Current Affiliations: GeoBio-Center, Ludwig-Maximilians-Universität München, Germany; Department of Earth and Environmental Sciences, Paleontology & Geobiology, Ludwig-Maximilians-Universität München, Germany
- ORCID ID: https://orcid.org/0000-0003-0511-2215
- Email: <b.khakurel@lmu.de>
- Alternate Email: basantakhakurel@gmail.com
Alternative Contact:
- Name: April M. Wright
- Affiliations: Department of Biological Sciences, Southeastern Louisiana University.
- ORCID ID: https://orcid.org/0000-0003-4692-3225
- Email: april.wright@selu.edu

Additional Dataset Metadata

Acknowledgements

Funding Sources: AMW and BK were supported on NSF DEB 2045842. AMW, TDT, CG, and BK were covered on an Institutional Development Award (IDeA) from the National Institute of General Medical Sciences of the National Institutes of Health under grant number P2O GM103424-21. AMW was additionally supported on NSF DBI 2113425. BK was supported by DiGS Fellowship from the College of Science and Technology, Southeastern Louisiana University.
This work was supported by the Deutsche Forschungsgemeinschaft (DFG) Emmy Noether-Program (Award HO 6201/1-1 to S.H.) and by the European Union (ERC, MacDrive, GA 101043187).
Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them.
Original Data Source: The morphological matrix was obtained from TreeBase: TB2:S18555

Barden, Phillip, and David A. Grimaldi. "Adaptive radiation in socially advanced stem-group ants from the Cretaceous." Current Biology 26.4 (2016): 515-521.

Methodological Information

Methods of data generation: see manuscript for details

Description of the file structure

Summary

Total File/Folder Count: 32
Total File Size: 67MB (Uncompressed)

The root of the directory contains several shell scripts that needs to run in seqeunce as numbered. The filenames indicate the function of the script (sim - simulate, mcmc - perform mcmc, summarize - obtain tree lengths or robinson foulds). The description of the scripts are as follows:
- 1_run_empirical.sh - Shell script to perform mcmc using unpartitioned and partitioned models for the empirical data.
- 2a_sim_data_unpartitioned.sh - script that simulates datasets under the unpartitioned model. Calls upon 2a1_sim_data_inclusive.sh, 2a2_sim_data_max_state.sh and 2a3_sim_data_recoded.sh to simulate datasets. See manuscript for specific details about the simulations.
- 2b1_mcmc_inclusive_SLURM.sh - script to perform mcmc among the complete character sampling replicates.
- 2b2_mcmc_max_state_SLURM.sh - script to perform mcmc among the maximum state missing replicates.
- 2b3_mcmc_recoded_SLURM.sh - script to perform mcmc among the recoded replicates.
- 2c1_summarize_SLURM.sh - script to obtain tree lengths from the unpartitioned simulations.
- 2c2_summarize_SLURM_RF.sh - script to obtain robinson foulds from the unpartitioned simulations.
- 3a_sim_data_partitioned.sh - script to simulate datasets under the partitioned model.
- 3b_mcmc_partitioned.sh - script to analyse the simulated datasets using partitioned and unpartitioned models.
- 3c1_summarize_rejection_tl.sh - script to obtain tree lengths from the partitioned simulations.
- 3c2_summarize_rejection_rf.sh - script to obtain robinson foulds from the partitioned simulations.
- 4a_sim_data_lba.sh - script to simulate datasets under the unpartitioned model during long branch attraction conditions.
- 4b_mcmc_lba.sh - script to analyse the simulated long-branch datasets using partitioned and unpartitioned models.
- 4c1_summarize_lba_tl.sh - script to obtain tree lengths from the long-branch attraction simulations.
- 4c2_summarize_lba_rf.sh - script to obtain robinson foulds from the long-branch attraction simulations.
- 5a_sim_data_for_missing_SLURM_.sh - script that simulates datasets for missing data simulations. Calls upon 5a1_sim_data_missing.sh, 5a2_sim_data_remove_chars.sh and 5a3_sim_data_binarify.sh to simulate datasets. See manuscript for specific details about the simulations.
- 5b1_mcmc_missing_binarified_SLURM.sh - script to perform mcmc on the missing data and binarized data simulations.
- 5b2_mcmc_removed_SLURM.sh - script to perform mcmc on the simulation datasets with varying numbers of characters removed.
- 5c_summarizing_missing.sh - script to summarize the output (tree lengths and robinson foulds) from the missing data simulations. Uses missing_directories.txt to obtain the summary statistics.
data - The folder consists of various Long-Branch Attraction trees used in this study. The original matrix used for simulations can be obtained from TreeBase (TB2:S18555).
output - This folder contains the output files from RevBayes analyses of the empirical matrix under unpartitioned and partitioning by state models.
scripts - This folder contains all the scripts that are used to analyse the data. It contains Rev scripts used for simulations and analyses in RevBayes. It also contains R scripts for summarizing the outputs from RevBayes and plotting them.
- binarify_states.R - R Script to change multistate characters (4 states, 10 states or 20 states) to binary.
- get_inclusive_rf.R - R Script to obtain robinson foulds from unpartitioned simulations.
- get_inclusive_tl.R - R Script to obtain tree lengths from unpartitioned simulations.
- get_lba_rf.R - R Script to obtain robinson foulds from long-branch simulations.
- get_lba_tl.R - R Script to obtain tree lengths from long-branch simulations.
- get_missing_rf.R - R Script to obtain robinson foulds from missing data simulations.
- get_mising_tl.R - R Script to obtain tree lengths from missing data simulations.
- get_rejection_rf.R - R Script to obtain robinson foulds from partitioned simulations.
- get_rejection_tl.R - R Script to obtain tree lengths from partitioned simulations.
- mcmc.Rev - RevBayes script to perform MCMC.
- model_ACRV.Rev - RevBayes script for modeling among-character rate variation.
- model_partitioned.Rev - RevBayes script for the automatic partitioned model.
- model_partition_true.Rev - RevBayes script for the true partitioned model.
- model_tree.Rev - RevBayes script - model file for phylogeny.
- model_unpart.Rev - RevBaye script for the unpartitioned model.
- plot_inclusive_rf.R - R script to plot robinson foulds from the unpartitioned simulations.
- plot_inclusive_tl.R - R script to plot tree lengths from the unpartitioned simulations.
- plot_lba_rf.R - R script to plot robinson foulds from the long-branch attraction simulations.
- plot_missing_rf.R - R script to plot robinson foulds from the missing data simulations.
- plot_missing_tl.R - R script to plot tree lengths from the missing data simulations.
- plot_partitioned_rf.R - R script to plot robinson foulds from the partitioned simulations.
- plot_partitioned_tl.R - R script to plot tree lengths from the partitioned simulations.
- recode_states.R - R script to relabel the states for each character so that always the smallest state labels are used.
- remove_states.R - R script to remove certain number of characters from the data matrix.
- sim_inclusive.Rev - RevBayes script to simulate under the unpartitioned model.
- sim_lba.Rev - RevBayes script to simulate for long-branch attraction conditions under unpartitioned model.
- sim_partitioned.Rev - RevBayes script to simulate under the partitioned model.
- sim_partitioned_rejection.Rev - RevBayes script to simulate under the partitioned model such that always the largest state space is used for all the characters.
stats_rf - This directory contains csv files with scaled symmetric difference along all the simulation replicates.
Each csv file contains the column "Partitioned" and "Unpartitioned". The column "Partitioned" includes the scaled symmetric difference from trees obtained from datasets analysed under the partitioned by state model and the column "Unpartitioned" includes the scaled symmetric difference from trees obtained from datasets analysed under the unpartitioned model.
For files containing partitioned in their filenames an additional column "True Partitioned" is present which contains scaled symmetric difference for trees obtained from the true model that the datasets were simulated under.
- output_sim_data_<4,10,20>_states_100.csv - These csv files contain the scaled symmetric difference from trees obtained from 4 state, 10 state or 20 state matrices.
  If the filename contains binarified, then the matrices were changed to a binary matrix.
  If the filename contains 5_chars_removed, then 5 characters were removed from the original matrix; and if it contains 8_chars_removed, then 8 characters were removed from the original matrix.
- output_sim_data_inclusive_<max_state>_state_<42,100>.csv - These csv files contain the scaled symmetric difference from trees obtained from simulated datasets under the unpartitioned model.
  The max_state values used in this study were 2,3,4 and 5.
  42 means that the dataset contained 42 characters and 100 means that the dataset contained 100 characters.
  If the filename contains missing, then the maximum state is removed from the original dataset to analyse the datasets under unpartitioned and partitioned model.
  If the filename contains recoded, then the characters in the original dataset were relabelled such that always the smallest state labels are used.
- output_sim_data_lba_<max_state>_state_<long_branch_length>_branchlength.csv - These csv files contain the scaled symmetric difference from the trees obtained from simulated datasets under the long-branch attraction conditions.
  The max_state values used in this study were 2,3,4 and 5.
  The long_branch_lengthvalues used in this study were 0.5 and 1.
- output_sim_data_partitioned_<max_state>_bin_<50,75>.csv - These csv files contain the scaled symmetric difference from the trees obtained from simulated datasets under the partitioned model.
  The max_state values used in this study were 3,4, and 5.
  The number 50 indicates that 50% of the characters in the matrix is binary and the number 75 indicates that 75% of the characters in the matrix is binary.
  If the filename contains rejection, the datasets were simulated with rejection sampling such that all the characters have maximum state space possible.
stats_tl - This directory contains csv files with tree lengths along all the simulation replicates. The file naming scheme is same as above for the stats_rf folder. Each csv file contains the column "Partitioned" and "Unpartitioned". The column "Partitioned" includes the tree lengths from trees obtained from datasets analysed under the partitioned by state model and the column "Unpartitioned" includes the tree lengths from trees obtained from datasets analysed under the unpartitioned model.
For files containing partitioned in their filenames an additional column "True Partitioned" is present which contains tree lengths of trees obtained from the model that the datasets were simulated under.

Software/Code

For all analyses, we used RevBayes_v.1.2.2 (compiled from development branch).
The R scripts were designed for R version 4.3.2.

END OF README