Fast mvSLOUCH: Model comparison for multivariate Ornstein--Uhlenbeck-based models of trait evolution on large phylogenies
Cite this dataset
Bartoszek, Krzysztof et al. (2023). Fast mvSLOUCH: Model comparison for multivariate Ornstein--Uhlenbeck-based models of trait evolution on large phylogenies [Dataset]. Dryad. https://doi.org/10.5061/dryad.sj3tx9656
These are the Supplementary Material, R scripts and numerical results accompanying Bartoszek, Fuentes Gonzalez, Mitov, Pienaar, Piwczyński, Puchałka, Spalik and Voje "Model Selection Performance in Phylogenetic Comparative Methods under multivariate Ornstein–Uhlenbeck Models of Trait Evolution".
The four data files concern two datasets. Ungulates: measurements of muzzle width, unworn lower third molar crown height, unworn lower third molar crown width and feeding style and their phylogeny; Ferula: measurements of ratio of canals, periderm thickness, wing area, wing thickness, and fruit mass, and their phylogeny.
The compiled ungulate dataset involves two key components: phenotypic data (Data.csv) and phylogenetic tree (Tree.tre), which consist on the following (full references for the citations presented below are provided in the paper linked to this repository, which also provides further details on the compiled dataset):
The phenotypic data includes three continuous variables and one categorical variable. The continuous variables (MZW: muzzle width; HM3: unworn lower third molar crown height; WM3: unworn lower third molar crown width), measured in cm, come from Mendoza et al. (2002; J. Zool.). The categorical variable (FS, i.e. feeding style: B=browsers, G=grazers, M=mixed feeders) is based on Pérez–Barbería and Gordon (2001; Proc. R. Soc. B: Biol. Sci.). Taxonomic mismatches between these two sources were resolved based on Wilson and Reeder (2005; Johns Hopkins University Press). Only taxa with full entries for all these variables were included (i.e. no missing data allowed).
The phylogenetic tree is pruned from the unsmoothed mammalian timetree of Hedges et al. (2015; MBE) to only include the 104 ungulate species for which there is complete phenotypic data available. Wilson and Reeder (2005; Johns Hopkins University Press) was used again to resolve taxonomic mismatches with the phenotypic data. The branch lengths of the tree are scaled to unit height and thus informative of relative time.
1) The phenotypic data are divided into two data sets: first containing five continuous variables (no_ME) measured on mericarps (the dispersal unit of fruit in Apiaceae), whereas the second having the same variables together with measurement error (ME; see paper for computational details) for 75 species of Ferula and three species of Leutea. Three continuous variables were measured on anatomical cross sections (ratio_canals_ln – the proportion of oil ducts covering the space between median and lateral ribs [dimensionless], mean_gr_peri_ln_um – periderm (fruit wall) thickness [μm], thick_wings_ln_um – wing thickness [μm]); the remaining two on whole mericarps (Wings_area_ln_mm – wings area [mm2], Seed_mass_ln_mg – seed mass [mg])
2) The phylogenetic tree was pruned from the tree obtained from the recent taxonomic revision of the genus (Panahi et al. 2018) to only include the 78 species for which the phenotypic data were generated. This tree and the associated alignment, composed of one nuclear and three plastid markers (Panahi et al. 2018), constituted an input to mcmctree software (Yang 2007) to obtain dated tree using a secondary calibration point for the root based on Banasiak et al.’s (2013) work. The branch lengths of the final tree (Ferula_fruits_tree.txt) were scaled to unit height and thus informative of relative time.
The R setup for the manuscript was as follows:
R version 3.6.1 (2019-09-12) Platform: x86_64-pc-linux-gnu (64-bit) Running under: openSUSE Leap 42.3
The exact output can depend on the random seed. However, in the script we have the option of rerunning the analyses as it was in the manuscript, i.e.
the random seeds that were used to generate the results are saved, included and can be read in.
The code is divided into several directories with scripts, random seeds and result files.
Directory contains the script test_rotation_invariance_mvSLOUCH.R that demonstrates that mvSLOUCH's likelihood calculations are rotation invariant.
Directory contains files connected to the Carnivrons' vignette in mvSLOUCH.
Full output of running the R code in the vignette. With mvSLOUCH is a very bare-minimum subset of this file that allows for the creation of the vignette.
Reduced objects from Carnivora_mvSLOUCH_objects_Full.RData that are included with mvSLOUCH's vignette.
R script to reduce Carnivora_mvSLOUCH_objects_Full.RData to Carnivora_mvSLOUCH_objects.RData .
The vignette itself.
Bib file for the vignette.
2.6) ScaledTree.png, ScaledTree2.png, ScaledTree3.png, ScaledTree4.png
Plots of phylogenetic trees for vignette.
Directory contains all the output of the simulation study presented in the manuscript and scripts that allow for replication (the random number generator seeds are also provided) or running ones own simulation study, and scripts to generate graphs, and model comparison summary. This study was done using version 2.6.2 of mvSLOUCH. If one reruns using mvSLOUCH >= 2.7, then one will obtain different (corrected) values of R2 and an additional R2 version.
Directory contains files connected to the "Feeding styles and oral morphology in ungulates" analyses performed for the manuscript.
The phenotypic data includes three continuous variables and one categorical variable. Continuous variables (MZW: muzzle width; HM3: unworn lower
third molar crown height; WM3: unworn lower third molar crown width) from Mendoza et al. (2002), measured in cm. Categorical variable (FS, i.e.
feeding style: B=browsers, G=grazers, M=mixed feeders) based on Pérez–Barbería and Gordon (2001). Phylogeny pruned from Hedges et al. (2015).
Taxonomic mismatches among these sources were resolved based on Wilson and Reeder (2005).
Hedges, S. B., J. Marin, M. Suleski, M. Paymer, and S. Kumar. 2015. Tree of life reveals clock-like speciation and diversification.
Molecular Biology and Evolution 32:835-845.
Mendoza, M., C. M. Janis, and P. Palmqvist. 2002. Characterizing complex craniodental patterns related to feeding behaviour in ungulates:
a multivariate approach. Journal of Zoology 258:223-246
Pérez–Barbería, F. J., and I. J. Gordon. 2001. Relationships between oral morphology and feeding style in the Ungulata: a phylogenetically
controlled evaluation. Proceedings of the Royal Society of London. Series B: Biological Sciences 268:1023-1032.
Wilson, D. E., and D. M. Reeder. 2005. Mammal species of the world: A taxonomic and geographic reference.
Johns Hopkins University Press, Baltimore, Maryland.
Ungulates' phylogeny, extracted from the mammalian phylogeny of
Hedges, S. B., J. Marin, M. Suleski, M. Paymer, and S. Kumar. 2015. Tree of life reveals clock–like speciation and diversification. Mol. Biol. Evol. 32:835–845.
4.3) OUB.R, OUF.R, OUG.R
R scripts for the analyses performed in the manuscript. Different files correspond to different regime setups of the feeding style variable.
4.4) OU1.txt, OUB.txt, OUF.txt, OUG.txt
Outputs of the model comparison conducted under the R scripts presented above (4.3). Different files correspond to different regime setups of the feeding style variable.
5) Ferula analyses
In the models_ME directory there are input and output files from the mvSLOUCH analyzes of Ferula data with measurement error included, while in the models_no_ME directory the analyzes of data without measurement error. In each directory, one can find the following files:
- input files: Data_ME.csv (with mesurment error) or Data_no_ME.csv (without measurement error) and tree file in Newick format (Ferula_fruits_tree.txt); the trait names in data files are abbreviated as follows: ration_canals – the proportion of oil ducts covering the space between median and lateral ribs, mean_gr_peri – periderm thickness, wings_area – wing area, thick_wings – wing thickness and seed_mass – seed mass,
- the results for 8 analyzed models (see Fig. 2 in the main text), each in separate directory named model1, model2 and so on,
- each model directory comprises the following files: two R scripts (for analyzes with diagonal and with upper triangular matrix Σyy; each model was run 1000 times), two csv files included information such as number of repetition (i), seed for preliminary analyzes generating starting point (seed_start_point), seed for the main analyses (seed) and AIC, AICc, SIC, BIC, R2 and loglik for each model run (these csv files are sorted according to AICc values), two directories containing results for 1000 analyzes, and two files extracted from these directories showing parameter estimation for the best models (with UpperTri and Diagonal matrix Σyy)
any text file editor and R
Vetenskapsrådet, Award: 2017–04951
NSF CAREER, Award: 2225683
Polish National Science Centre, Award: 2015/18/E/NZ8/00716
ERC–2020–STG, Award: 948465
ELLIIT, Award: ELLIIT Call C
Stiftelsen for Vetenskaplig Forskning och Utbildning i Matematik