Comparing partition and mixture models with Akaike information criteria

Susko, Edward 1 ; Lanfear, Robert2 ; Roger, Andrew J.1

Published Jan 29, 2026 on Dryad. https://doi.org/10.5061/dryad.3xsj3txrb

Data files

Jan 29, 2026 version files 27.84 MB

README.md

3.40 KB
simulated-data-sets.zip

27.84 MB

Abstract

Sophisticated phylogenetic models often include mixture and/or partition model components. It was recently noted that information criteria tend to favour partition models over mixture models even in some cases where the latter are misspecified and give poor topological estimation. We show that this problem arises because partition models and mixture models fundamentally differ in their probability calculations: mixture models calculate site-wise likelihoods as the marginal probability of the data averaging over parameter vectors that might have arisen at a site whereas partition model site likelihoods are calculated as the probability of the site pattern conditional upon a fixed assigned parameter vector at that site. These differing probability calculations lead to AIC estimates that are not comparable. We explore three generally applicable ways of correcting the issue.

https://doi.org/10.5061/dryad.3xsj3txrb

Description of the data and file structure

We have submitted simulated-data-sets.zip which contains a directory, simulated-data-sets, with simulated data sets from the paper. That directory has subdirectories named 0, 5, ..., 50 indicating the percentage of missclassified sites, each of which have further subdirectories, 1, 2, ..., 100 indicating which of the 100 simulated data sets/setting were considered. Each of these subdirectories have three files: part1.seqfile, part2.seqfile, concat.seqfile giving the sequence data for the first partition, the second partition and the concatenated data

Simulation was from the Jukes-Cantor substitution model and an unrooted four taxon tree that has taxa labeled 0, 1, 2 and 3, split as 01|23. Data was simulated for two separate partitions, leading to part1.seqfile and part2.seqfile. For both partitions data was from the tree with split 01|23. The first partition was simulated from the 01|23 tree with middle edge-length was 0.05, terminal edges of length 0.75 for the taxa labeled 0 and 2, and terminal edges of length 0.05 for the taxa labeled 1 and 3. For the second partition roles were reversed and data simulated from the 01|23 tree with middle edge-length was 0.05, terminal edges of length 0.05 for the taxa labeled 0 and 2, and terminal edges of length 0.75 for the taxa labeled 1 and 3.

Files and variables

File: simulated-data-sets.zip

Description: The results of the base simulation are in subdirectory 0. It has subdirectories 1, 2, ..., 100 each of which give results for a single independent simulation. The files for a single independent simulation are part1.seqfile, part2.seqfile and concat.seqfile. They are PHYLIP format sequence files of the form

number_of_taxa number_of_sites

0 Nucleotide sequence data for Taxon 0

1 Nucleotide sequence data for Taxon 1

2 Nucleotide sequence data for Taxon 2

3 Nucleotide sequence data for Taxon 3

part1.seqfile and part2.seqfile give the data for partitions 1 and 2. concat.seqfile gives the data after concatenating part1.seqfile and part2.seqfile

The data sets from the base simulation were used to construct that data sets where partitions were misspecified. This was done by taking p percent of the sites from part1.seqfile and swapping them with p percent of the sites in part2.seqfile.

The data in directories for the simulations with misspecified partitions has the same structure as those of the base simulation.. The subdirectories 5, ..., 50 of simulated-data-sets indicates the percentage, p, of misspecified sites. Each of these subdirectories has further subdirectories 1, 2, ..., 100, each of which give results for a single independent simulation. The files for a single independent simulation are part1.seqfile, part2.seqfile and concat.seqfile. They are PHYLIP format sequence files of the form