Data from: Scoutknife: A naïve, whole genome informed phylogenetic robusticity metric
Data files
Jul 11, 2023 version files 25.79 GB
-
README.md
-
RealData.zip
-
ScriptsAndTables.zip
-
SimulationData.zip
Jul 18, 2023 version files 25.79 GB
-
README.md
-
RealData.zip
-
ScriptsAndTables.zip
-
SimulationData.zip
Abstract
The phylogenetic bootstrap, first proposed by Felsenstein in 1985, is a critically important statistical method in assessing the robusticity of phylogenetic datasets. Core to its concept was the use of pseudosampling - assessing the data by generating new replicates derived from the initial dataset that was used to generate the phylogeny. In this way, phylogenetic support metrics could overcome the lack of perfect, infinite data. With infinite data, however, it is possible to sample smaller replicates directly from the data to obtain both the phylogeny and its statistical robusticity in the same analysis. Due to the growth of whole genome sequencing, the depth and breadth of our datasets have greatly expanded and are set to only expand further. With genome-scale datasets comprising thousands of genes, we can now obtain a proxy for infinite data. Accordingly, we can potentially abandon the notion of pseudosampling and instead randomly sample small subsets of genes from the thousands of genes in our analyses. Here, we introduce Scoutknife, a jackknife-style subsampling implementation that generates 100 datasets by randomly sampling a small number of genes from an initial large-gene dataset to jointly establish both a phylogenetic hypothesis and assess its robusticity. Using 18 previously published datasets and 100 simulation studies, we show that Scoutknife is conservative and informative as to conflicts and incongruence across the whole genome, without the need for subsampling based on traditional model selection criteria.
Methods
Dataset Construction and Analysis
For both real and simulation analyses (for details see below), 100 genes were randomly selected 100 times from the source datasets, generating 100 100-gene concatenated sample datasets using the Scoutknife Package (https://github.com/JFFleming/Scoutknife). The Scoutknife script package requires catsequences23 to be installed as a prerequisite, available at (https://github.com/ChrisCreevey/catsequences).
Phylogenies for each Scoutknife dataset were constructed under IQ-Tree v1.6.1224 using Modeltest25, with a separate model applied to each gene and no partition merging. As a data density-based technique, Scoutknife might be expected to perform better in high data density scenarios where partitions can be comfortably merged. As such, this was intended to limit the efficacy of Scoutknife further and test its performance under a scenario with more highly variable best fit models than might be expected under normal conditions, whilst conserving computational effort considering the large number of test datasets and simulations. To further facilitate parallelization of the analyses, the phylogenetic analyses of the datasets were submitted using Scoutknifette (https://github.com/Togtja/scoutknifette). Scoutknifette is a custom HPC webhook for the group messaging service Discord that can be easily modified for any HPC tasks that require multiple submission batches and queue tracking.
The trees produced by each Scoutknife sample dataset were then concatenated into a single treelist file (see Supplemental Information), and a consensus tree was constructed using bpcomp, available in Phylobayes26, by using a burnin of 0 and a sampling rate of 1, sampling each tree in the treelist. Trees were constructed as both 70% strict consensus and 50% majority consensus trees, and the results were compared. In two cases (Araneae and Lepidoptera), 30% plurality consensus trees were constructed using the same method, to further explore the data, as explained in the results and discussion section. In a single case (Actinopterygii), the low occupancy of two species in particular (Muraenesox cinerus with 1 gene and Scomber scombrus with 15 genes across the entire dataset of 1105) meant that many of the Scoutknife samples did not contain representatives from these taxa. To address this, we used sumtrees.py v 4.5.2, part of the DendroPy package27, as it is capable of building consensus trees from tree lists containing a variable number of taxa.
The Quartet Similarity, Quartet Divergence, Node Conflict, Node Agreement, Strict Joint Assertions, Semi-Strict Joint Assertions, Symmetric Difference, Marczewski-Steinhaus, Steel-Penny and Overall Similarity were measured with reference to the previously published topology of the 250 most informative genes of that dataset, as selected by GeneSortR20. In the case of the simulated datasets, the topology of the 250 most informative genes of the original source dataset, Milla et al (2020), as selected by GeneSortR20, 28, was used. These similarity metrics were calculated using the ‘Quartet’ Library available in R29.
Real Test Datasets
To assess the efficacy of Scoutknife, we examined 18 real data datasets28, 30-46, those used in a similar benchmarking study by GeneSortR20. These datasets range from 1049 to 5105 genes and from 30 to 332 taxa in size, comprising studies of animals, plants and fungi (Table 1). In contrast to the prior study, genes with less than 50% occupancy were not removed: Scoutknife should show decreased performance at low occupancy levels, as it relies on data density, and so this should give a clearer picture of how the methodology performs across a variety of real datasets. The resultant tree topologies were then compared to the topology recovered by analysing the most informative 250 genes, as determined by GeneSortR20, to assess whether the same topological hypothesis was resolved by the Scoutknife Consensus Tree.
Simulation Datasets
To further assess the efficacy of Scoutknife, we generated 100 simulation datasets using the Alignment Mimic function of AliSim, as implemented in IQTree v2.2.047, 48. For this, 100 simulations were independently created for each gene in the Milla et al (2020)28 Heliozelidae dataset, as it represented a small-sized dataset of those within our real data study, at 1049 genes, and as such should have presented a challenge for Scoutknife. Furthermore, AliSim’s alignment mimic47 allows us to generate alignment datasets that mimic real genes, complete with low occupancy and reasonable variations in alignment length. Alisim was implemented with the following command:
iqtree2 –alisim < Output > -s <Gene> --num-alignments 100
For each set of 1049 simulated genes, 100 100-gene Scoutknife datasets were constructed, and then analysed using IQ Tree as with the real datasets. The Quartet Similarity, Quartet Divergence, Node Conflict, Node Agreement, Strict Joint Assertions, Semi-Strict Joint Assertions, Symmetric Difference, Marczewski-Steinhaus, Steel-Penny and Overall Similarity were then measured with reference to the previously published topology generated by analysing the 250 most informative genes of the Milla et al (2020) dataset as selected by GeneSortR20. As each gene was simulated independently, it should in theory retain the topology of that initial single gene dataset, thereby replicating the discordance present in the original dataset. Furthermore, by directly comparing our random samples of simulated datasets to the most informative genes of the source dataset, this should disadvantage Scoutknife, as some of the simulated data may support a separate alternative topology to either the single gene or the real informative gene topology.
Usage notes
Data Files can all be opened in any text editor.
Supplementary files in the Tables category are in Excel format, and can be opened by LibreOffice
Scoutknife is available at https://github.com/JFFleming/Scoutknife