Skip to main content
Dryad

Data from: SATé-II: very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees

Data files

Jun 08, 2011 version files 354.61 MB

Abstract

Highly accurate estimation of phylogenetic trees for large datasets is difficult, in part because multiple sequence alignments must be accurate for phylogeny estimation methods to be accurate. Co-estimation of alignments and trees has been attempted, but currently only SATé estimates reasonably accurate trees and alignments for large datasets in practical time frames (Liu et al., 2009b). Here, we present a modification to the original SATé algorithm that improves upon SATé (which we now call SATé-I) in terms of speed and of phylogenetic and alignment accuracy. SATé-II uses a different divide-and-conquer strategy than SATé-I, and so produces smaller, more closely related subsets than SATé-I; as a result, SATé-II produces more accurate alignments and trees, can analyze larger datasets, and runs more efficiently than SATé-I. SATé-II is a meta-method that takes an existing multiple sequence alignment method as an input parameter and boosts the quality of that alignment method. SATé-II-boosted alignment methods are significantly more accurate than their unboosted versions, and trees based upon these improved alignments are more accurate than trees based upon the original alignments. Finally, because SATé-I used maximum likelihood methods that treat gaps as missing data to estimate trees, and because we found a correlation between the quality of tree/alignment pairs and maximum likelihood scores, we explored the degree to which SATé’s performance depends on using maximum likelihood with gaps treated as missing data to determine the best tree/alignment pair. We present two lines of evidence that using maximum likelihood with gaps treated as missing data to optimize the alignment and tree produces very poor results. First, we show that the optimization problem where a set of unaligned DNA sequences is given and the output is the tree and alignment of those sequences that maximize likelihood under the Jukes-Cantor model is uninformative in the worst possible sense: for all inputs, all trees optimize the likelihood score. Second, we show that a greedy heuristic that uses GTR+Gamma maximum likelihood to optimize the alignment and the tree can produce very poor alignments and trees. Therefore, the excellent performance of SATé-II and SATé-I is not because maximum likelihood is used as an optimization criterion for choosing the best tree/alignment pair, but rather due to the particular divide-and-conquer re-alignment techniques employed.