Supplemental data from: Inference of phylogenetic networks from sequence data using composite likelihood

Kong, Sungsik 1 ; Swofford, David2 ; Kubatko, Laura1

Published Sep 30, 2024 on Dryad. https://doi.org/10.5061/dryad.bg79cnpkm

Data files

Sep 30, 2024 version files 4.55 MB

phynest-data.zip

4.54 MB
README.md

10.38 KB

Abstract

While phylogenies have been essential in understanding how species evolve, they do not adequately describe some evolutionary processes. For instance, hybridization, a common phenomenon where interbreeding between two species leads to the formation of a new species, must be depicted by a phylogenetic network, a structure that modifies a phylogenetic tree by allowing two branches to merge into one, resulting in reticulation. However, existing methods for estimating networks become computationally expensive as the dataset size and/or topological complexity increase. The lack of methods for scalable inference hampers phylogenetic networks from being widely used in practice, despite accumulating evidence that hybridization occurs frequently in nature. Here, we propose a novel method, PhyNEST (Phylogenetic Network Estimation using SiTe patterns), that estimates binary, level-1 phylogenetic networks with a fixed, user-specified number of reticulations directly from sequence data. By using the composite likelihood as the basis for inference, PhyNEST is able to use the full genomic data in a computationally tractable manner, eliminating the need to summarize the data as a set of gene trees prior to network estimation. To search network space, PhyNEST implements both hill climbing and simulated annealing algorithms. PhyNEST assumes that the data are composed of coalescent independent sites that evolve according to the Jukes-Cantor substitution model and that the network has a constant effective population size. Simulation studies demonstrate that PhyNEST is often more accurate than two existing composite likelihood summary methods (SNaQ and PhyloNet) and that it is robust to at least one form of model misspecification (assuming a less complex nucleotide substitution model than the true generating model). We applied PhyNEST to reconstruct the evolutionary relationships among Heliconius butterflies and Papionini primates, characterized by hybrid speciation and widespread introgression, respectively. PhyNEST is implemented in an open-source Julia package and is publicly available at https://github.com/sungsik-kong/PhyNEST.jl.

https://doi.org/10.5061/dryad.bg79cnpkm

Description of the data and file structure

Written by Sungsik Kong September 2024 mailto: sungsik.kong@gmail.com

Supplementary Materials are in phynest-supp-kong-swofford-kubatko.pdf (Zenodo)

Folder main-text contains Illustrator files, all output files from the analyses, and R scripts for visualizing those outputs as presented in the figures in "Inference of Phylogenetic Networks from Sequence Data using Composite Likelihood" by Sungsik Kong, David Swofford, and Laura Kubatko, published in Systematic Biology. These files are useful for reproducing the figures of this study.

In detail:

main-text > Figure 1

Figure1.key: Keynote file for Figure 1
Figure1.pdf: PDF file for Figure 1

main-text > Figure 2

Figure2.ai: Adobe Illustrator file for Figure 2
Figure2.pdf: PDF file for Figure 2

main-text > Figure 3

net_distance_concat_all1.csv: Data file containing information about the hardwired cluster dissimilarity (HWCD) between the network topology estimated by SNaQ, PhyloNet, PhyNEST using hill climbing (PhyNEST_HC), and PhyNEST using simulated annealing (PhyNEST_SA), and the true network topology that the data was generated on (i.e., Figure 2a, referred to as N1 hereafter). The headings method, gt, i, and dist represent the names of the four methods evaluated, dataset size in terms of the number of gene trees, replicate ID, and the HWCD measured, respectively.
net_distance_concat_all2.csv: Data file for the scenario shown in Figure 2b (or N2).
Figure3.R: R script to create Figure 3 in the main text using the two data files above.
Figure3.pdf: PDF file for Figure 3.

main-text > Figure 4

time-sc1.csv: Data file containing information about the running time to infer the network using SNaQ, PhyloNet, PhyNEST using hill climbing (PhyNEST_HC), and PhyNEST using simulated annealing (PhyNEST_SA) for the network shown in Figure 2a. The headings t0, method, g, and i represent the running time measured in seconds, the names of the four methods evaluated, size in terms of the number of gene trees, and replicate ID, respectively.
time-sc2.csv: Data file for the scenario shown in Figure 2b.
Figure4.R: R script to create Figure 4 in the main text using the two data files above.
Figure4.pdf: PDF file for Figure 4. In this figure, y-axis represent time measured in hours, log-scaled for visualization. Please see the main text Figure 4 for more information.

main-text > Figure 5

Figure5.ai: Adobe Illustrator file for Figure 5
Figure5.pdf: PDF file for Figure 5

main-text > Figure 5 > analysis

starttree_sc1.txt: Starting topology used to infer the network in Figure 5a
starttree_sc2.txt: Starting topology used to infer the network in Figure 5b
PhyNEST_hc_sc1_hmax1.log: Analysis log for the network in Figure 5a using hill climbing with hmax=1
PhyNEST_hc_sc2_hmax1.log: Analysis log for the network in Figure 5b using hill climbing with hmax=1
PhyNEST_sa_sc1_hmax1.log: Analysis log for the network in Figure 5a using simulated annealing with hmax=1
PhyNEST_sa_sc2_hmax1.log: Analysis log for the network in Figure 5b using simulated annealing with hmax=1
PhyNEST_hc_sc1_hmax1.out: Final network inferred for the species in Figure 5a using hill climbing with hmax=1
PhyNEST_hc_sc2_hmax1.out: Final network inferred for the species in Figure 5b using hill climbing with hmax=1
PhyNEST_sa_sc1_hmax1.out: Final network inferred for the species in Figure 5a using simulated annealing with hmax=1
PhyNEST_sa_sc2_hmax1.out: Final network inferred for the species in Figure 5b using simulated annealing with hmax=1

main-text > Figure 6

Figure6-bl.ai: Adobe Illustrator file for Figure 6
Figure6-bl.pdf: PDF file for Figure 6

main-text > Figure 6 > analysis

starttree.txt: Starting topology used to infer the networks in Figure 5
PhyNEST_hc_hmax1.log: Analysis log for the Papionini network in Figure 6 using hill climbing with hmax=1
PhyNEST_hc_hmax2.log: Analysis log for the Papionini network in Figure 6 using hill climbing with hmax=2
PhyNEST_sa_hmax1.log: Analysis log for the Papionini network in Figure 6 using simulated annealing with hmax=1
PhyNEST_sa_hmax2.log: Analysis log for the Papionini network in Figure 6 using simulated annealing with hmax=2
PhyNEST_hc_hmax1.out: Final network inferred for the Papionini species in Figure 6 using hill climbing with hmax=1
PhyNEST_hc_hmax2.out: Final network inferred for the Papionini species in Figure 6 using hill climbing with hmax=2
PhyNEST_sa_hmax1.out: Final network inferred for the Papionini species in Figure 6 using simulated annealing with hmax=1
PhyNEST_sa_hmax2.out: Final network inferred for the Papionini species in Figure 6 using simulated annealing with hmax=2

Folder supplementary-materials contains all output files and R scripts for visualizing those outputs as presented in the figures in the supplementary materials of "Inference of Phylogenetic Networks from Sequence Data using Composite Likelihood" by Sungsik Kong, David Swofford, and Laura Kubatko, published in Systematic Biology.

In detail:

supplementary-materials > S2-S5

S2-S5_summary1.csv: Data file containing information about the accuracy and efficiency of the simulation study conducted in SM S1 using the network scenario shown in Figure S1a. Each row represents the result from each replicate (100 replicates in total) for each of the four methods (SNaQ, PhyloNet, PhyNEST_HC, and PhyNEST_SA) for various dataset sizes. The columns Method, g, and x1 represent the method evaluated, number of gene trees in the data, and replicate ID, respectively. The columns t and x2 represent time to complete the analysis measured in seconds and the final network inferred, respectively. Columns tr1, tr2, and tr3 represent the trees found in the true network but not in the estimated network divided by the number of trees in the true network (the false negative rate), the trees found in the estimated true network but not in the true network divided by the number of trees in the true network (the false positive rate), and the average of the two rates. Columns cl1, cl2, and cl3 are similar to tr1, tr2, and tr3, respectively, but use clusters instead of trees. Columns tp1, tp2, and tp3 are also similar but use tripartitions instead of trees or clusters.
S2-S5_summary2.csv: Data file similar to S2-S5_summary1.csv using the network scenario in Figure S1b.
S2-S5_iqtree1.csv: Data file containing the running time information for gene tree estimation using IQ-TREE. Columns gt and loci represent the analysis running time measured in seconds and the number of loci in the dataset (generated using the scenario in Figure S1a), respectively.
S2-S5_iqtree2.csv: Data file similar to S2-S5_iqtree1.csv but for the scenario in Figure S1b.
FiguresS2-5.R: R script to create Figure S2 to S5 using the four data files above.
FigureS2.pdf: PDF file for Figure S2.
FigureS3.pdf: PDF file for Figure S3.
FigureS4.pdf: PDF file for Figure S4.
FigureS5.pdf: PDF file for Figure S5.

supplementary-materials > S6

fig_inheritance.key: Keynote file for Figure S6
fig_inheritance.pdf: PDF file for Figure S6

supplementary-materials > S8

S8_data1.csv: Data file containing information about the efficiency and accuracy of the simulated annealing algorithm for the network search in various combinations of alpha and constant values of 20 replicates. The headers Alpha, Constant, AverageNumStep, NumMCLNet, and Score represent the selected alpha, constant values, the average number of steps taken to complete the analysis given the selected alpha and constant values (maximum=1001), number of times the true topology was inferred, and the computed F-score, respectively.

S8_data2.csv: Data file containing information from each replicate. The headers Alpha, Constant, Likelihood, and NumSteps represent the selected alpha and constant values, the composite likelihood of the network inferred in each replicate, and the number of steps taken until the end of the analysis (maximum=1001).

FigureS8.R: R script to create Figure S8 using the two data files above.
FigureS8.pdf: PDF file for Figure S8.

supplementary-materials > S9

S9_data1.csv: Data file containing information about the parameter estimation for the network shown in Figure S1a. The columns NumLoci, Tau# (# represents the index number assigned to each vertex), Gamma% (% represents the index number assigned to inheritance probability), Theta, DeltaT#, and RatioT# represent the number of loci in the simulated data, estimate of Tau#, estimate of Gamma%, estimate of Theta, difference between the estimated Tau# and the true value, and the ratio of the estimated Tau# to the true value.
S9_data2.csv: Data file containing information about the parameter estimation for the network shown in Figure S1b. See the description for S8_data1.csv for what each column represents.
FigureS9.R: R script to create Figure S9 using the two data files above.
FigureS9.pdf: PDF file for Figure S9.

supplementary-materials > S11

S11_data.csv: Data file containing information about the topological accuracy and estimated Gamma when hmax=1 of the network inferred using PhyNEST, using the data generated under a tree. The headings level, RF1, RF2, majGamma, and minGamma represent the name of the scenario considered, Robinson-Foulds (RF) distance between the estimated tree (hmax=0) using PhyNEST and the true tree, RF distance between the major tree of the network inferred (hmax=1) using PhyNEST and the true tree, major Gamma, and minor Gamma of the network estimated (hmax=1), respectively.
S11_plots.R: R script to create Figure S11 using the two data files above.
S11.pdf: PDF file for Figure S11.

Files and variables

File: phynest-data.zip

Description: Compressed .zip file that contains two folders 'main-text' and 'supplementary-materials' that contains all data and R codes to replicate figures in the main text and supplementary materials of the article entitled Inference of Phylogenetic Networks from Sequence Data using Composite Likelihood in Systematic Biology.