Data from: SNaQ.jl: Improved scalability for phylogenetic network inference
Abstract
Phylogenetic networks represent complex biological scenarios that are overlooked in trees, such as hybridization and horizontal gene transfer. Although numerous methods have been developed for phylogenetic network inference, their scalability is severely limited by the computational demands of likelihood optimization and the vastness of network space. Composite (or pseudo-) likelihood approaches like SNaQ have improved computational tractability for network inference, but they remain inadequate for datasets of sizes routinely handled by tree inference methods. Here, we introduce SNaQ.jl, a new standalone Julia package with the composite likelihood inference originally implemented within PhyloNetworks.jl as well as new scalability features that enhance computational efficiency through (1) parallelization of quartet likelihood calculations during composite likelihood computation, (2) weighted random selection of quartets, and (3) probabilistic decision-making during network search. Through a simulation study and empirical data analysis, we show that this new version of SNaQ.jl (version 1.1) improves average runtimes by up to 400% with no change in accuracy.
Simulation study and empirical analysis evaluating computational improvements made in SNaQ.jl version 1.1, a Julia package for phylogenetic network inference using composite likelihood. This version introduces significant scalability improvements, including parallelized quartet calculations, weighted quartet selection, and probabilistic network search, achieving up to 400% runtime improvements with maintained accuracy.
The data files related to this study are contained in the dryad.zip file here. Scripts used to analyze the data are contained in an accompanying Zenodo data repository.
Replication instructions
Given the breadth of parameters tested in these simulations and the computational intensity of each simulation individually, it is necessary to perform this study on a high throughput computing cluster. For this purpose, we utilize HT Condor. You will find that many of the scripts in this repository are specifically structured to work in a Condor environment and will not work properly outside of such an environment. See condor/README.md for replication instructions on a Condor cluster. Additionally, be sure to download the scripts in the accompanying Zenodo repository (DOI: 10.5281/zenodo.17545337). These simulations will take input data from the data/input/ folder and write output to the data/output/ folder to easy to read CSV files.
Additionally, replicating the empirical results simply requires submitting the empirical-cui/condor/submit.submit file with the condor_submit command to an HT Condor submit node. Then, SNaQ.jl output files will be written to empirical-cui/condor/snaq_outputs/.
Study overview
This repository contains the complete computational pipeline for:
- Simulation Study: Network inference performance and runtime analysis with networks consisting of either 10 or 20 taxa and either 1 or 3 hybrid nodes.
- Empirical Analysis: Application to Xiphophorus phylogenomic data from Cui et al. 2013
- Performance Evaluation: Runtime and accuracy comparisons between SNaQ.jl versions 1.0 and 1.1
Repository structure
condor/- HTCondor job submission and execution scriptsdata/- Original networks, simulation outputs, and figuresempirical-cui/- Empirical analysis using Cui et al. 2013 Xiphophorus datasetpipelines/- Core Julia simulation and analysis scriptsresults/- Consolidated results from all simulation analysesscripts/- Utility scripts for data processingsoftware/- External tools (IQ-TREE, Seq-Gen)
