Scripts and data sets associated with: On testing homogeneity of the evolutionary process using alignments of homologous sequences
Data files
May 06, 2024 version files 225.96 MB
-
README.md
-
Supplementary_data.tar.bz2
Abstract
In 2019, Genome Biology and Evolution (11:3341-3352) published three statistical tests for assessing whether alignments of genome sequences violate the phylogenetic assumption of evolution under homogeneous conditions. The new tests extend the matched-pairs tests of symmetry, marginal symmetry, and internal symmetry for alignments of n = 2 homologous sequences of nucleotides or amino acids to cases where alignments of n > 2 sequences are considered. Here we discuss the limitations of these new tests and then outline alternative approaches, which permit formal testing of multiple hypotheses (i.e., by controlling either the family-wise error rate or the false discovery rate). We show that the other approaches provide much greater insight into variation of the evolutionary process across lineages, via informal graphical methods and formal statistical procedures. Using one of the procedures (i.e., the Bonferroni test), we show that evolution under heterogeneous conditions is more prevalent than reported in the paper cited above and that the power of the matched-pairs tests of homogeneity is linked to the number of variant sites in an alignment. We release a new version of Homo, a program that allows for formal testing of multiple hypotheses and calculation of adjusted P values. Using Homo, we analysed an alignment of amino acids encoded by 116 flavivirus genomes, and reveal that these viral genomes are unlikely to have evolved under homogeneous conditions. To our knowledge, this is the first time that this has been reported for medically important Flavivirus genomes.
README: Scripts and data sets associated with "On Testing Homogeneity of the Evolutionary Process using Alignments of Homologous Sequences"
https://doi.org/10.5061/dryad.n8pk0p2xv
This DRYAD submission contains batch scripts, sequence data, and explanations of what was done in five computer-based experiments described in the manuscript with the above-mentioned title.
- The content in folder Experiment_1 relates to the analysis of the performance of the Maximum Symmetry test. In particular, the experiment was designed to ascertain whether the edge lengths of a tree have an impact on the Type I error rate.
- The content in folder Experiment_2 relates to the analysis of the performance of the Maximum Symmetry test. In particular, the experiment was designed to ascertain whether the edge lengths of a tree have an impact on the Type II error rate.
- The content in folder Experiment_3 relates to the analysis of the performance of the Maximum Symmetry test. In particular, the experiment was designed to ascertain whether alignment length has an impact on the type II error rate.
- The content in folder Experiment_4 relates to the analysis of the performance of four tests concerning the global null hypothesis (H_G) of evolution under SRH conditions. In this case, the tests considered are the Bonferroni (1936) test, the Hommel (1983) test, the Simes (1986) test, and the Naser-Khdour et al. (2019) test.
- The content in folder Experiment_5 relates to the analysis of the performance of four tests concerning the global null hypothesis (H_G) of evolution under SRH conditions. In this case, the tests considered are the Bonferroni (1936) test, the Hommel (1983) test, the Simes (1986) test, and the Naser-Khdour et al. (2019) test.
Jointly, the results reveal that: (a) the Naser-Khdour et al. (2019) test may mislead by failing to reject the null hypothesis that a process is stationary, reversible, and homogeneous (SRH) over the tree, and (b) other approaches are less likely to mislead users who wants to know whether their data violate the common phylogenetic assumption of evolution under SRH conditions.
Description of the data and file structure
This submission contains batch scripts, sequence data, and explanations of what was done in five computer-based experiments described in the manuscript with the above-mentioned title. The submission is divided into files and folders, all labelled in order of use (e.g., Experiment_1, 00_README, 01_Tree_a, etc). The format of the submission was chosen to make it as FAIR compliant as possible. In other words, the research done in the five experiment is reproducible.
Sharing/Access information
In the case of alternative access to real sequence data used in this study, we refer interested parties to GenBank
Links to other publicly accessible locations of the data:
GenBank accession numbers are included in a FASTA-formatter file, with each sequence of amino acids used from the Flavivirus genomes
Code/Software
All statistical analyses were carried out using Homo v2.2, which is a program written in C++. Heat maps were generated using a Pearl script called HomoHeatMapper. Homo and HomoHeatMapper are available from:
Methods
This submission contains batch scripts, sequence data, and protocols describing what was done in five computer-based experiments outlined in the manuscript with the above-mentioned title. The format of the submission was chosen such that it is as FAIR compliant as possible (i.e., that the data are Findable, Accessible, Interoperable, and Resuasable). In other words, the research done in the five experiment is reproducible.
- The content in folder Experiment_1 relates to the analysis of the performance of the Maximum Symmetry test. In particular, the experiment was designed to ascertain whether the edge lengths of a tree have an impact on the Type I error rate. The file named 00_README, describe the method used.
- The content in folder Experiment_2 relates to the analysis of the performance of the Maximum Symmetry test. In particular, the experiment was designed to ascertain whether the edge lengths of a tree have an impact on the Type II error rate. The file named 00_README, describe the method used.
- The content in folder Experiment_3 relates to the analysis of the performance of the Maximum Symmetry test. In particular, the experiment was designed to ascertain whether alignment length has an impact on the type II error rate. The file named 00_README, describe the method used.
- The content in folder Experiment_4 relates to the analysis of the performance of four tests concerning the global null hypothesis (H_G) of evolution under SRH conditions. In this case, the tests considered are Bonferroni's (1936) test, Hommel's (1983) test, Simes' (1986) test, and Naser-Khdour et al's (2019) test. The file named 00_README, describe the method used.
- The content in folder Experiment_5 relates to the analysis of the performance of four tests concerning the global null hypothesis (H_G) of evolution under SRH conditions. In this case, the tests considered are Bonferroni's (1936) test, Hommel's (1983) test, Simes' (1986) test, and Naser-Khdour et al's (2019) test.
As for the genome data used in Experiment_2, we note:
- The policistronically-encoded amino-acid sequences of 116 flavivirus genomes were retrieved from GenBank (https://www.ncbi.nlm.nih.gov/genbank/) and aligned using MAFFT v7.453 (using ginsi mode) (Mol. Biol. Evol., 30:772-780).
- The completeness of the alignment was surveyed using AliStat v1.13 (NAR Genom. Bioinf., 2:lqaa024) and sites containing ambiguous characters were deleted if the proportion of such characters exceeded 0.2. The resulting alignment of 3,367 sites had a completeness score (Ca) of 0.9880 (for details, see NAR Genom. Bioinf., 2:lqaa024).
- Model selection was done using ModelFinder (Nat. Methods, 14:587-589), which is implemented in IQ-TREE2 v2.1.3 (Mol. Biol. Evol., 37:1530-1534). We only considered substitution models for viral polypeptides. For each model of sequence evolution considered, tree space was searched under the AIC and BIC optimality criteria. The rtREV+FO+I+R9 model was optimal under the AIC and the rtREV+FO+I+R7 model was optimal under the BIC. Using IQ-TREE2, the same tree was identified under the two models.
- The UFBoot2 procedure (Mol. Biol. Evol., 35:518-522) was used to assess the consistency of the phylogenetic signal.