MAST: Phylogenetic inference with mixtures across sites and trees (revised)
Data files
Jul 15, 2024 version files 5.44 GB
-
Empirical_experiments.tgz
14.04 MB
-
README.md
5.84 KB
-
Simulation_experiments.tgz
5.43 GB
-
supplement_programs.tgz
9.79 KB
Feb 25, 2025 version files 5.44 GB
-
Empirical_experiments.tgz
14.04 MB
-
README.md
5.94 KB
-
Simulation_experiments.tgz
5.43 GB
-
supplement_programs.tgz
9.79 KB
Abstract
Hundreds or thousands of loci are now routinely used in modern phylogenomic studies. Concatenation approaches to tree inference assume that there is a single topology for the entire dataset, but different loci may have different evolutionary histories due to incomplete lineage sorting, introgression, and/or horizontal gene transfer; even single loci may not be treelike due to recombination. To overcome this shortcoming, we introduce an implementation of a multi-tree mixture model that we call MAST. This model extends a prior implementation by Boussau et al. (2009) by allowing users to estimate the weight of each of a set of pre-specified bifurcating trees in a single alignment. The MAST model allows each tree to have its own weight, topology, branch lengths, substitution model, nucleotide or amino acid frequencies, and model of rate heterogeneity across sites. We implemented the MAST model in a maximum-likelihood framework in the popular phylogenetic software, IQ-TREE. Simulations show that we can accurately recover the true model parameters, including branch lengths and tree weights for a given set of tree topologies, under a wide range of biologically realistic scenarios. We also show that we can use standard statistical inference approaches to reject a single-tree model when data are simulated under multiple trees (and vice versa). We applied the MAST model to multiple primate datasets and found that it can recover the signal of incomplete lineage sorting in the Great Apes, as well as the asymmetry in minor trees caused by introgression among several macaque species. When applied to a dataset of four Platyrrhine species for which standard concatenated maximum likelihood and gene tree approaches disagree, we observe that MAST gives the highest weight (i.e. the largest proportion of sites) to the tree also supported by gene tree approaches. These results suggest that the MAST model is able to analyse a concatenated alignment using maximum likelihood while avoiding some of the biases that come with assuming there is only a single tree. We discuss how the MAST model can be extended in the future.
https://doi.org/10.5061/dryad.51c59zwfx(opens in new window)
Having implemented the MAST model in IQ-TREE, we used simulated data to test the performance of the MAST model under a wide range of scenarios. The first and second simulation experiments test the accuracy of the unlinked and linked MAST models when the true model is specified. The third simulation experiment simulates data with varying levels of introgression to compare the performance of standard (i.e. single-tree) concatenation methods to the performance of the MAST model. The fourth simulation experiment examines the performance of the MAST model when an incorrect model is specified, by applying an unlinked MAST model with different numbers of trees to an alignment simulated under a single tree.
Description of the files/folders
supplementary.pdf : a document containing supplementary material, including figures and tables.
Empirical_experiments.tgz : a compressed file containing the data sets, results, input trees, and the scripts for all the empirical experiments. The scripts include those used for data generation and the commands for running the IQ-Tree with MAST module
Simulation_experiments : a folder containing the information for all simulation experiments, such as data sets, results, model parameters, and input trees.
supplement_programs.tgz : a compressed file containing some programs used for the experiments. It does not include the software: IQTREE2, ms, and uniqueTree.
Empirical_experiments
This folder contains the data and scripts for the empirical experiments. There is another Readme.txt file inside each of the subfolder.
Inside folder “SetA”, which is for the experiment on empirical dataset A:
commands.sh - contains the commands to run in this experiment
setA.fa - Concatenated alignment set A (generated from the scripts inside the folder “obtainDataSets”)
setA.input.trees.txt - The input topologies for MAST model
Inside the folder “Results”:
setA.singletree.iqtree - The output of IQ-Tree by using standard single-tree model
setA.submodelx.iqtree - The output of IQ-Tree by using MAST submodel x
Inside folder “SetB”, which is for the experiment on empirical dataset B:
commands.sh - contains the commands to run in this experiment
setB.fa - Concatenated alignment set B (generated from the scripts inside the folder “obtainDataSets”)
setB.input.trees.txt - The input topologies for MAST model
Inside the folder “Results”:
setB.singletree.iqtree - The output of IQ-Tree by using standard single-tree model
setB.submodelx.iqtree - The output of IQ-Tree by using MAST submodel x
Inside folder “SetC”, which is for the experiment on empirical dataset C:
commands.sh - contains the commands to run in this experiment
setC.fa - Concatenated alignment set C (generated from the scripts inside the folder “obtainDataSets”)
setC.input.trees.txt - The input topologies for MAST model
Inside the folder “Results”:
setC.singletree.iqtree - The output of IQ-Tree by using standard single-tree model
setC.submodelx.iqtree - The output of IQ-Tree by using MAST submodel x
Inside folder “SetD”, which is for the experiment on empirical dataset D:
commands.sh - contains the commands to run in this experiment
setD.fa - Concatenated alignment set D (generated from the scripts inside the folder “obtainDataSets”)
setD.input.trees.txt - The input topologies for MAST model
Inside the folder “Results”:
setD.singletree.iqtree - The output of IQ-Tree by using standard single-tree model
setD.submodelx.iqtree - The output of IQ-Tree by using MAST submodel x
Insider folder “obtainDataSets”, which contains the scripts for obtaining/generating each of the empirical datasets:
run.sh - the script to come up the data sets setA.fa, setB.fa, setC.fa, setD.fa for the empirical experiments
get.gene.tree.stat.sh - the script to get the information of each gene, like the gene tree, number of variable sites, number of parsimony informative sites, etc.
The output files are: set_A_summary.txt , set_B_summary.txt , set_C_summary.txt , set_D_summary.txt
Simulation_experiments
This folder contains several folders consist of data and results for various simulation experiments. Each folder is for different experiment. Inside each folder, there are subfolders as follows:
1. Subfolder “data”: Contains the simulated datasets
2. Subfolder “models”: Contains the simulated model parameters
3. Subfolder “results”: Contains the result files from the analysis.
4. Subfolder “trees”: Contains the simualted tree files
Folder: simulation1_5K_10K_50K
- This contains the data and results for the simulation experiment 1 with sequence lengths 5K, 10K and 50K.
Folder: simulation1_100K
- This contains the data and results for the simulation experiment 1 with sequence length 100K.
Folder: simulation2
- This contains the data and results for the simulation experiment 2.
Folder: simulation3
- This contains the data and results for the simulation experiment 3.
Folder: simulation4
- This contains the data and results for the simulation experiment 4.
Folder: simulation5
- This contains the data and results for the simulation experiment 5.
Folder: simulation6
- This contains the data and results for the simulation experiment 6.
Folder: simulation7
- This contains the data and results for the simulation experiment 7.
supplement_programs
This folder contains some programs used for the experiments. It does not include the software: IQTREE2, ms, and uniqueTree.
To compile the source files, please type:
$make
Three binary programs will appear:
- rm_edge_len_fr_tree : to remove the edge lengths from the tree
- randomize_file : to randomize the lines inside a file
- sprtree : to report the trees with specific number of SPR distance from the input tree