Performance of AIC in selecting partition models and mixture models
Liu, Qin; Charleston, Michael; Richards, Shane; Holland, Barbara (2022), Performance of AIC in selecting partition models and mixture models, Dryad, Dataset, https://doi.org/10.5061/dryad.1jwstqjwj
In molecular phylogenetics, partition models and mixture models provide different approaches to accommodating heterogeneity in genomic sequencing data. Both types of models generally give a superior fit to data than models that assume the process of sequence evolution is homogeneous across sites and lineages. The Akaike Information Criterion (AIC), an estimator of Kullback-Leibler divergence, is a popular tool to select models in phylogenetics. However, there is limited literature investigating the performance of AIC in selecting between partition models and mixture models. Here, we show that under non-standard conditions (i.e. when some edges have small expected number of changes), AIC underestimates the expected Kullback-Leibler divergence. We show that under such conditions, AIC tends to choose partition models over mixture models even when mixture models performed more accurately than partition models in estimating the generating topology and the generating rate matrices. In addition, AIC also prefers simpler partition models over more complex partition models under non-standard conditions, despite the fact that the more complex partition model was the generating model. We incorporated the Bayesian Information Criterion (BIC) in this study and found that the results were similar to AIC. We also investigated how mispartitioning (i.e. a block of an alignment containing incorrectly allocated sites) affects both the performance of partition models compared to mixture models and the model selection process. We found that as the level of mispartitioning increases, the bias of AIC in estimating the expected Kullback-Leibler divergence remains the same, and the branch lengths and evolutionary parameters estimated by partition models become less accurate. Partition models performed better than mixture models in estimating the edge lengths and the base frequencies unless mispartitioning was severe. We recommend that researchers do not rely on AIC and BIC for model selection in non-standard conditions; other alternatives, such as cross-validation and bootstrapping should be considered.
The main processes included generating alignments, fitting four different partition and mixture models, and analysing results. The data were generated under Seq-Gen-1.3.4 (Rambaut and Grass 1997). The model fitting was performed IQ-TREE2 (Minh et al. 2020) on a Linux system. The results were analysed using the R package phangorn in R (version 3.6.2) (Schliep 2011, R Core Team 2019). We wrote custom bash scripts to extract relevant parts of the results from IQ-TREE2, and these results were processed in R.
The zip files contain four folders: "bash-scripts", "data", "R-codes", and "results-IQTREE2". The bash-scripts folder contains all the bash scripts for simulating alignments and performing model fitting. The data folder contains two child folders: "sequence-data" and "Rdata". The child folder "sequence-data" contains the alignments created for the simulations. The other child folder, "Rdata", contains the files created by R to store the results extracted from "IQTREE2" and the results calculated in R. The "R-codes" folder includes the R codes for analysing the results from "IQTREE2". The folder "results-IQTREE2" stores all the results from the fitted models. The three simulations we performed were essentially the same. We used the same parameters of the evolutionary models, and the trees with the same topologies but dierent edge lengths to generate the sequences.
The steps we used were: simulating alignments, model fitting and extracting results, and processing the extracted results. The first two steps were performed on a Linux system using bash scripts, and the last step was performed in R.
To simulate heterogeneous data we created two multiple sequence alignments (MSAs) under simple homogeneous models with each model comprising a substitution model and an edge-weighted phylogenetic tree (the tree topology was fixed). Each MSA contained eight taxa and 1000 sites. This was performed using the bash script “step1_seqgen_data.sh” in Linux.
These two MSAs were then concatenated together giving a MSA with 2000 sites. This was equivalent to generating the concatenated MSA under a two-block unlinked edge lengths partition model (P-UEL). This was performed using the bash script “step2_concat_data.sh”. This created the 0% group of MSAs.
In order to simulate a situation where the initial choice of blocks does not properly account for the heterogeneity in the concatenated MSA (i.e., mispartitioning), we randomly selected a proportion of 0%, 5%, 10%, 15%, …, up to 50% of sites from each block and swapped them. That is, the sites drawn from the first block were placed in the second block, and the sites drawn from the second block were placed in the first block. This process was repeated 100 times for each proportion of mispartitioned sites giving a total of 1100 MSAs.
This process involved two steps. The first step was to generate ten sets of different amounts of numbers without duplicates from each of the two intervals [1,1000] and [1001,2000]. The amounts of numbers were based on the proportions of incorrectly partitioning sites. For example, the first set has 50 numbers on each interval, and the second set has 100 numbers on each interval, etc. The first step was performed in R, and the R code was not provided but the random number text files were included. The second step was to select sites from the concatenated MSAs from the locations based on the numbers created in the first step. This created 5%, 10%, 15%, …, 50% groups of MSAs. The second steps used the following bash scripts: “step3_1_mixmatch_pre_data.sh” and “step3_2_mixmatch_data.sh”. The MSAs used in the simulations were created and stored in the “data” folder.
Model Fitting and Extracting Results
The next steps were to fit four different partition and mixture models to the data in IQ-TREE2 and extract the results. The models used were P-LEL partition model, P-UEL partition model, M-UGP mixture model and M-LGP mixture model. For the partition models, the partitioning schemes were the same: the first 1000 sites as a block and the second 1000 sites as another. For the groups of MSAs with different proportions of mispartitioned sites, this was equivalent to fitting the partition models with an incorrect partitioning scheme. The partitioning scheme was called “parscheme.nex”. The bash scripts for model fitting were stored in the “bash-scripts” folder. To run the bash scripts, users can follow the order which was shown in the names of these bash scripts.
The inferred trees, estimated base frequencies, estimated rate matrices, estimated weight factors and AIC values were extracted from the IQTREE2 results. These extracted results were stored in the “results-IQTREE2” folder and were used to evaluate the performance of AIC and performance of models in R.
Processing Extracted Results in R
To evaluate the performance of AIC and the performance of fitted partition models and mixture models, we calculated the following measures: the rEKL values, the bias of AIC in estimating the rEKL, the Robinson-Foulds (RF) distances, and the branch scores (bs). We also compared the distribution of the estimated model parameters (i.e. base frequencies and rate matrices) to the generating model parameters. These processes were performed in R.
The first step was to read in the inferred trees, estimated base frequencies, estimated rate matrices, estimated weight factors and AIC values that were extracted from IQTREE2 results. These R scripts were stored in the “R-codes” folder, and the names of these scripts started with “readpara_...” (e.g. “readpara_MLGP_standard”). After reading in all the parameters for each model, we estimated the measures mentioned above using the corresponding R scripts that were also in the “R-codes” folder.
The functions used in these R scripts were stored in the “R_functions_simulation”. It is worth noting that the directories need to be changed if users want to run these R scripts on their computers.
University of Tasmania