# Online phylogenetics with matOptimize for SARS-CoV-2

## Cite this dataset

Thornlow, Bryan et al. (2023). Online phylogenetics with matOptimize for SARS-CoV-2 [Dataset]. Dryad. https://doi.org/10.7291/D13Q2J

## Abstract

Phylogenetics has been foundational to SARS-CoV-2 research and public health policy, assisting in genomic surveillance, contact tracing, and assessing emergence and spread of new variants. However, phylogenetic analyses of SARS-CoV-2 have often relied on tools designed for de novo phylogenetic inference, in which all data are collected before any analysis is performed and the phylogeny is inferred once from scratch. SARS-CoV-2 datasets do not fit this mould. There are currently over 14 million sequenced SARS-CoV-2 genomes in online databases, with tens of thousands of new genomes added every day. Continuous data collection, combined with the public health relevance of SARS-CoV-2, invites an "online" approach to phylogenetics, in which new samples are added to existing phylogenetic trees every day. The extremely dense sampling of SARS-CoV-2 genomes also invites a comparison between likelihood and parsimony approaches to phylogenetic inference. Maximum likelihood (ML) and pseudo-ML methods may be more accurate when there are multiple changes at a single site on a single branch, but this accuracy comes at a large computational cost, and the dense sampling of SARS-CoV-2 genomes means that these instances will be extremely rare because each internal branch is expected to be extremely short. Therefore, it may be that approaches based on maximum parsimony (MP) are sufficiently accurate for reconstructing phylogenies of SARS-CoV-2, and their simplicity means that they can be applied to much larger datasets. Here, we evaluate the performance of de novo and online phylogenetic approaches, as well as ML, pseudo-ML, and MP frameworks for inferring large and dense SARS-CoV-2 phylogenies. Overall, we find that online phylogenetics produces similar phylogenetic trees to de novo analyses for SARS-CoV-2, and that MP optimization with UShER and matOptimize produces equivalent SARS-CoV-2 phylogenies to some of the most popular ML and pseudo-ML inference tools. MP optimization with UShER and matOptimize is thousands of times faster than presently available implementations of ML and online phylogenetics is faster than de novo inference. Our results therefore suggest that parsimony-based methods like UShER and matOptimize represent an accurate and more practical alternative to established maximum likelihood implementations for large SARS-CoV-2 phylogenies.

## Methods

Full details and scripts for the analyses are available at our Github repository: https://github.com/bpt26/parsimony

**Repository 1: Make Starting Tree:**

We first developed a "global phylogeny", from which all analyses in this study were performed. We began by downloading VCF and FASTA files corresponding to March 18, 2021 from our own daily-updated database (McBroome et al. 2021). The VCF file contains pairwise alignments of each of the 434,063 samples to the SARS-CoV-2 reference genome. We then implemented filters, retaining only sequences containing at least 28,000 non-N nucleotides, and fewer than two non-[ACGTN-] characters. We used UShER to create a phylogeny from scratch using only the remaining 366,492 samples. To remove potentially erroneous sequences, we iteratively pruned this tree of highly divergent internal branches with branch parsimony scores greater than 30, then terminal branches with branch parsimony scores greater than 6, until convergence, resulting in a final global phylogeny containing 364,427 samples. The branch parsimony score indicates the total number of substitutions along a branch. Similar filters based on sequence divergence are used by existing SARS-CoV-2 phylogenetic inference methods. For full reproducibility, files used for creating the global phylogeny can be found in subrepository 1 on the project GitHub page (Thornlow et al. 2021b).

**Repository 2: Optimize Starting Tree:**

Following this, we tested several optimization strategies on this global phylogeny, hereafter the "starting tree". We used matOptimize, FastTree 2, and maximum parsimony (MP) IQ-TREE 2. MP IQ-TREE 2 uses parsimony as the optimality criterion in contrast to the maximum likelihood mode used in all other experiments, which was infeasible on a dataset of this size. In these optimization experiments, we used experimental versions of MP IQ-TREE 2 that allow finer control of parsimony parameters (specific versions are listed in the supplemental Github repository). In one experiment, we used the starting tree and its corresponding alignment and ran five iterations of MP IQ-TREE 2, varying the SPR radius from 20 to 100 in increments of 20. Experiments on a small dataset indicated that there is little or no improvement in parsimony score beyond a radius of 100. Separately, we tested another strategy that applied two iterations of MP IQ-TREE 2 to the starting tree, the first iteration using an SPR radius of 20 and the second using a radius of 100. Finally, we tested a strategy of six iterations of pseudo-likelihood optimization with FastTree 2 followed by two iterations of parsimony optimization with matOptimize. The tree produced by this strategy, hereafter the "ground truth" tree, had the highest likelihood of all the strategies we tested. This tree (after_usher_optimized_fasttree_iter6.tree) and files for these optimization experiments can be found in subrepository 2.

In the multifurcating ground truth tree of 364,427 samples, there are 265,289 unique (in FASTA sequence) samples. There are 447,643 nodes in the tree. For reference, a full binary tree with the same number of leaves has 728,853 nodes. 23,437 of the 29,903 sites in the alignment are polymorphic (they display at least two non-ambiguous nucleotides). Homoplasies are common in these data. In the starting tree, 19,090 sites display a mutation occurring on at least two different branches, and 4,976 sites display a mutation occurring more than ten times in the tree.

**Repository 3: Real Data Experiments:**

To mimic pandemic-style phylogenetics, we separated a total of 233,326 samples from the starting tree of 364,427 samples into 50 batches of ~5,000 by sorting according to the date of sample collection. We then set up two frameworks for each of the three software packages (matOptimize (commit 66ca5ff, conda version 0.4.8), maximum-likelihood IQ-TREE 2 (multicore version 2.1.3 COVID-edition), and FastTree 2 (Double Precision version 2.1.10)). The online phylogenetics frameworks began by using UShER to infer a small tree de novo from the first batch of samples, followed by alternating steps of optimization using one of the three evaluated methods and placement of additional samples with UShER. In de novo phylogenetics, we supplied each software package with an alignment corresponding to all samples in that batch and its predecessors (or VCF for matOptimize) without a guide tree. For both cases, each tree is larger than its predecessor by ~5,000 samples, and each tree necessarily contains all samples in the immediately preceding tree. For FastTree 2, we used 2 rounds of subtree-prune-regraft (SPR) moves (-spr 2), maximum SPR length of 1000 (-sprlength 1000), zero rounds of minimum evolution nearest neighbor interchanges (-nni 0), and the Generalised Time Reversible + Gamma (GTR+G) substitution model (-gtr -gamma). For IQ-TREE 2, we used a branch length minimum of 0.000000001 (-blmin 1e-9), zero rounds of stochastic tree search (-n 0), and the GTR+G substitution model (-m GTR+G). With these parameters, IQ-TREE 2 constructs a starting parsimony tree and then performs hill-climbing NNI steps to optimize likelihood, avoiding the significant time overhead of stochastic search. We ran all matOptimize analyses using an instance with 15 CPUs and 117.2 GB of RAM, and we ran all IQ-TREE 2 and FastTree 2 analyses on an instance with 31 CPUs and 244.1 GB of RAM, but we limited each command to 15 threads for equivalence with matOptimize. Files for all simulated data experiments can be found in subrepository 3.

**Repository 4: Simulated Data Experiments:**

To generate our simulated data, we used the SARS-CoV-2 reference genome (GISAID ID: EPI_ISL_402125; GenBank ID: MN908947.3) (Shu and McCauley 2017; Sayers et al. 2021) as the root sequence and used phastSim (De Maio et al. 2021b) to simulate according to the ground truth phylogeny described above. Intergenic regions were evolved using phastSim using the default neutral mutation rates estimated in ref. (De Maio et al. 2021a), with position-specific mean mutation rates sampled from a gamma distribution with alpha=beta=4, and with 1% of the genome having a 10-fold increase mutation rate for one specific mutation type (SARS-CoV-2 hypermutability model described in ref. (De Maio et al. 2021b)). Evolution of coding regions was simulated with the same neutral mutational distribution, with a mean nonsynonymous/synonymous rate ratio of omega=0.48 as estimated in (Turakhia et al. 2021a), with codon-specific omega values sampled from a gamma distribution with alpha=0.96 and beta=2. Rates for each intergenic and coding region were not normalized in order to have the same baseline neutral mutation rate distribution across the genome.

We repeated our iterative experiments using de novo and online matOptimize, IQ-TREE 2 and FastTree 2 on this simulated alignment, using the same strategies as before. However, instead of computing parsimony and likelihood scores, we computed the Robinson-Foulds (RF) distance (Robinson and Foulds 1981) of each optimization to the ground truth tree, pruned to contain only the samples belonging to that batch. To calculate each RF distance, we used the -O (collapse tree) argument in matUtils extract (McBroome et al. 2021) and then used the dist.topo command in the ape package in R (Paradis and Schliep 2019), comparing the collapsed optimized tree and the pruned, collapsed ground truth tree at each iteration. We computed normalized RF distances as a proportion of the total possible RF distance, which is equivalent to two times the number of samples in the trees minus six (Steel and Penny 1993).

Eliminating the 24-hour runtime restriction, we also repeated the first three de novo iterative experiments on both real and simulated data to compare UShER+matOptimize, IQ-TREE 2 with stochastic search, and RAxML-NG. These iterations of ~4.5k, ~8.9k, and ~13.2k samples were allowed to run for up to 14 days. For runs that did not terminate within this time (the second and third iterations of RAxML-NG), we used the best tree inferred during the run for comparisons. We ran IQ-TREE 2 and RAxML-NG under the GTR+G model with the smallest minimum branch length parameter that did not cause numerical errors. To compare the trees inferred from real data, we computed log-likelihoods under the GTR+G model for all trees, fixing the model parameters to those estimated by IQ-TREE 2 during tree inference. We also compared the log-likelihoods of the trees under the parameters estimated by RAxML-NG for the first iteration, but could not do so for the second and third iterations which did not terminate in under two weeks. We allowed optimization of branch lengths during likelihood calculation. For the UShER+matOptimize trees, before computing likelihoods, we converted the branch lengths into units of substitutions per site by dividing each branch length by the alignment length (29,903). To compare the trees inferred from simulated data, we computed the RF and quartet distances of each tree to the corresponding ground truth tree described above.

## Usage notes

This repository contains supplemental results, data, and scripts for Thornlow et al., 2022. It is split into four folders. See attached README.md for more detail.

This submission contains the following datatypes:

.fa files which can be viewed with any text-editing software

.nwk files which can be visualized with FigTree or any other phylogenetics software

.tree/.treefile/.bestTree files which are functionally identical to .nwk files

.iqtree files which are text files describing each IQ-TREE command run

.vcf files which follow the format described here: https://en.wikipedia.org/wiki/Variant_Call_Format

.pb files which can be analyzed using matUtils (https://usher-wiki.readthedocs.io/en/latest/QuickStart.html#matutils).

.tar.xz files which can be extracted using the command "tar xvf (filename)"

Other files with extenstions .txt, .bestModel, .log, or others not specified here are standard text files.

Most files are compressed in either .xz format and can be decompressed using the command "unxz (filename)".

## Funding

National Institute of General Medical Sciences, Award: R35GM128932

University of California Office of the President Emergency COVID-19 Research Seed Funding, Award: R00RG2456

Australian Research Council, Award: DP200103151

National Human Genome Research Institute, Award: T32HG008345

National Human Genome Research Institute, Award: F31HG010584