Phylogenetic analysis of policistronic amino-acid sequences encoded by 116 flavivirus genomes
Citation
Jermiin, Lars; Jayaswal, Vivek; Robinson, John (2022), Phylogenetic analysis of policistronic amino-acid sequences encoded by 116 flavivirus genomes, Dryad, Dataset, https://doi.org/10.5061/dryad.n8pk0p2xv
Abstract
Recently, Genome Biology and Evolution (11:3341-3352) published three statistical tests for testing whether alignments of sequence data violate the phylogenetic assumption of evolution under homogeneous conditions. The tests extend the matched-pairs tests of symmetry, marginal symmetry, and internal symmetry for pairs of aligned homologous sequences to the case where a whole alignment is considered. Here we reveal that the new tests are misleading. We explain why this is so, reveal how the tests of whole alignments may be done, and release new bioinformatics tools that implement statistically sound methods of dealing with multiple comparisons (i.e., by controlling the family-wise error rate or the false discovery rate). Using the new software to analyse an alignment of amino acids encoded by 116 flavivirus genomes, we reveal, for the first time, that these genomes are unlikely to have evolved under stationary, reversible, and homogeneous Markovian conditions.
Methods
The policistronic amino-acid sequences of 116 flavivirus genomes were retrieved from GenBank (https://www.ncbi.nlm.nih.gov/genbank/) and aligned using MAFFT v7.453 (using ginsi mode) (Mol. Biol. Evol., 30:772-780). The completeness of the alignment was surveyed using AliStat v1.13 (NAR Genom. Bioinf., 2:lqaa024) and sites containing ambiguous characters were deleted if the proportion of such characters exceeded 0.2. The resulting alignment of 3,367 sites had a completeness score (Ca) of 0.9880 (for details, see NAR Genom. Bioinf., 2:lqaa024). Model selection was done using ModelFinder (Nat. Methods, 14:587-589), which is implemented in IQ-TREE2 v2.1.3 (Mol. Biol. Evol., 37:1530-1534). We only considered substitution models for viral polypeptides. For each model of sequence evolution considered, tree space was searched under the AIC and BIC optimality criteria. The rtREV+FO+I+R9 model was optimal under the AIC and the rtREV+FO+I+R7 model was optimal under the BIC. Using IQ-TREE2, the same tree was identified under the two models. The UFBoot2 procedure (Mol. Biol. Evol., 35:518-522) was used to assess the consistency of the phylogenetic signal.
Funding
Australian Research Council, Award: DP200103151