Algebraic invariants for inferring 4-leaf semi-directed phylogenetic networks
Data files
Oct 29, 2025 version files 106.62 GB
-
GMM_results.tar.gz
111.75 KB
-
GMM_simulated_data.tar.gz
1.59 GB
-
JC_deg3.tar.gz
1.96 MB
-
JC_phy_deg3.tar.gz
721.76 KB
-
JC_simulated_data.tar.gz
52.42 GB
-
K2P_deg3.tar.gz
2 MB
-
K2P_simulated_data.tar.gz
52.43 GB
-
NMSC_results.tar.gz
15 KB
-
NMSC_simulated_data.tar.gz
176.31 MB
-
QNR-SVM_results.tar.gz
840.47 KB
-
README.md
5.72 KB
-
Xiphophorus_bootstrap.tar.gz
1.76 MB
Abstract
A core goal of phylogenomics is to determine the evolutionary history of a set of species from biological sequence data. Phylogenetic networks are able to describe more complex evolutionary phenomena than phylogenetic trees, but are more difficult to accurately reconstruct. Recently, there has been growing interest in developing methods to infer semi-directed phylogenetic networks. As computing such networks can be computationally intensive, one approach to building such networks is to puzzle together smaller networks. Thus, it is essential to have robust methods for inferring semi-directed phylogenetic networks on small numbers of taxa. In this paper, we investigate an algebraic method for performing phylogenetic network inference from nucleotide sequence data on 4-leaf semi-directed phylogenetic networks by analysing the distribution of leaf-pattern probabilities. On simulated data, we found that we can correctly identify with high accuracy the undirected phylogenetic network for sequences of length at least 10kbp. We found that identifying the semi-directed network is more challenging and requires sequences of length approaching 10Mbp. We are also able to use our approach to identify tree-like evolution and determine the underlying tree. Finally, we employ our method on a real dataset from the Xiphophorus species and use the results to build a phylogenetic network.
Dataset DOI: 10.5061/dryad.44j0zpcrk
Description of the data and file structure
Simulated data and results from "Algebraic Invariants for Inferring 4-leaf Semi-Directed Phylogenetic Networks". Simulation and analysis tools available at https://github.com/SR-Martin/4cycle_invariants.
Files and variables
File: JC_simulated_data.tar.gz
Description: JC simulated alignments generated by the script simulateJC.py. Alignments are in phylip format and organised according to the dataset. E.g. the directory network0123 contains the alignments generated under the sunlet (0,1,2,3) used in Section 3.1. The directory tree_ratio contains the alignments for which the tree ratio is varied in Section 3.2, where e.g. the directory g_.35 contains those alignments for which the tree ratio was 0.35. Each alignment file is named according to the network it was generated under, the length of the alignment, and the replicate number. E.g. network_0213_medium_1000_1.phylip is the first replicate generated under the sunlet (0,2,1,3) with length 1000.
File: K2P_simulated_data.tar.gz
Description: K2P simulated alignments generated by the script simulateK2P.py. Directory structure is as the same as for JC_simulated_data.tar.gz.
File: JC_deg3.tar.gz
Description: Results on JC simulated data using all JC degree 3 invariants. Each file contains the output of the script evaluate.py on the corresponding simulated alignment. Files are organised by directories mirroring the directory structure of the file JC_simulated_data.tar.gz, and named to match the simulated alignment, e.g. the file network0123/results/results_1000_1.txt is the output from the command
python evaluate.py -m JC -i invariants/4LeafJC_GB_deg3.txt -a /path/to/JC/simulated/data/network0123/network_0123_medium_1000_1.phylip.
File: JC_phy_deg3.tar.gz
Description: Results on JC simulated data as above but using all JC degree 3 'phylogenetic' invariants. e.g. the file network0123/results/results_1000_1.txt is the output from the command
python evaluate.py -m JC -i invariants/4LeafJC_GB_phylogenetic_deg3.txt -a /path/to/JC/simulated/data/network0123/network_0123_medium_1000_1.phylip.
File: K2P_deg3.tar.gz
Description: Results on K2P simulated data using all K2P degree 3 invariants. Each file contains the output of the script evaluate.py on the corresponding simulated alignment. Files are organised by directories mirroring the directory structure of the file K2P_simulated_data.tar.gz, and named to match the simulated alignment, e.g. the file network0123/results/results_1000_1.txt is the output from the command
python evaluate.py -m K2P -i invariants/4LeafK2P_GB_deg3.txt -a /path/to/K2P/simulated/data/network0123/network_0123_medium_1000_1.phylip.
File: QNR-SVM_results.tar.gz
Description: Results of running the script evaluate.py on QNR-SVM data using all JC degree 3 invariants, as in Section 3.4.
File: Xiphophorus_bootstrap.tar.gz
Description: Bootstrap results on Xiphophorus data using all K2P degree 3 invariants, as described in Section 3.7. Each file contains the output of the script evaluate_bootstrap.py on the corresponding alignment. Files are split into directories of 1,000 files each. Each file is named after the 4-taxon subset that it evaluates, e.g. the file Xandersi_Xgordoni_Xmontezuma_Xsignum_results.txt is the output for the alignment of X. andersi, X. gordoni, X. montezuma, and X. signum.
File: NMSC_simulated_data.tar.gz
Description: Gene trees simulated under network multispecies coalescent (NMSC) model for the three networks described in Section 3.6, and the corresponding simulated alignments. Each directory corresponds to a network (network1, network2, or network3) and contains simulated gene trees in Newick format (e.g. genetrees_0.5_1.phy contains a list of gene trees, simulated under network 1 and with CU multiplier 0.5), and the corresponding alignments in phylip format, 10 for each gene tree file (e.g. sim1_0.5_1.phy contains alignments simulated under the trees in the file genetrees_0.5_1.phy). Gene trees were simulated by the software PhyloCoal, alignments were simulated under the JC model by the software AliSim.
File: NMSC_results.tar.gz
Description: Results on NMSC data using all JC degree 3 invariants. Each file contains the output of the script evaluate.py on the corresponding alignment. The directory structure mirrors that in NMSC_simulated_data.tar.gz, e.g. the file network1/results1_0.5_1.txt is the output from running evaluate.py on the file network1/sim1_0.5_1.phy.
File: GMM_simulated_data.tar.gz
Description: Alignments simulated under the general Markov model (GMM) on the network (0,1,2,3), by the script simulateGMM.py. Alignments are in phylip format and organised and named according to the length of the alignment. Each length has 100 replicates.
File: GMM_results.tar.gz
Description: Results on GMM data using all JC degree 3 invariants and all K2P degree 3 invariants. Each file contains the output of the script evaluate.py on the corresponding simulated alignment, as in other results files.
Code/software
Results files are in plain-text format. Simulated alignments are in phylip format. All code is written in python and available at https://github.com/SR-Martin/4cycle_invariants.
