Network calibration in Xiphophorus

This IJulia notebook (ipynb) file describes the calibration of networks obtained with SNaQ, on the Xiphophorus group, using pairwise genetic distances. These distances were obtained with these steps:

  • obtain a RAxML gene tree from each locus
  • normalize each gene tree have a median branch length of 1 (to remove/reduce rate variation across loci, in a robust way)
  • calculate the pairwise distances between all taxa from each normalized gene tree (using function pairwiseTaxonDistanceMatrix)
  • average these distance matrices. This average distance matrix is the input for the network calibration.

To launch the IJulia environement, the following commands can be executed. (See the IJulia webpage for more information.)

In [1]:
# Pkg.add("IJulia")
# using IJulia; notebook(detached=true)

Install the necessary packages, if not installed already:

In [2]:
# Pkg.add("PhyloNetworks")
# Pkg.add("PhyloPlots")
# Pkg.add("RCall")

Data Import

We first read the networks:

In [3]:
using PhyloNetworks
net0 = readTopology("xiphophorus_snaq_h0.tre");
net1 = readTopology("xiphophorus_snaq_h1.tre");
net3 = readTopology("xiphophorus_snaq_h3.tre");

then root them correctly:

In [4]:
using PhyloPlots
using RCall
R"par(mar=c(0,0,0,0))"
plot(net0, :RCall, showEdgeNumber=true, xlim=[0,30]);
rootonedge!(net0, 15);
plot(net0, :RCall, xlim=[0,25]);
rootonedge!(net1, 43);
rootonedge!(net3, 42);

and rotate edges around some nodes, to plot these networks in the same way down the road:

In [5]:
for i in [24,-7,-8,-9,-5,-2,-20,-21]
    rotate!(net0, i)
end

Next we read the pairwise distance matrix.

In [6]:
using DataFrames
df = CSV.read("xiphophorus_genetree_pairwiseDistances.csv") # data frame
avD = disallowmissing(Array(df));  # 2-dimensional array
taxa = String.(names(df)); # taxon names: were column names in data frame

Network Calibration

tree h=0

By default, the calibration uses the option ultrametric=true, which assumes that tips are present-day taxa, and optimizes node ages, so the calibrated networks is guaranteed to be time-consistent. With this option, it is important for the topology to be rooted correctly (which was done earlier).

The calibration function was run multiple time and with 2 different optimization algorithms (either using the gradient or not), to ensure best optimization.

In [7]:
calibrateFromPairwiseDistances!(net0, avD, taxa, forceMinorLength0=true, NLoptMethod=:LD_MMA) # verbose=true
calibrateFromPairwiseDistances!(net0, avD, taxa, forceMinorLength0=true, NLoptMethod=:LN_COBYLA)
calibrateFromPairwiseDistances!(net0, avD, taxa, forceMinorLength0=true, NLoptMethod=:LD_MMA)
Out[7]:
(5748.659331126962, [19.6313, 17.6559, 10.2669, 10.2669, 8.36208, 11.2546, 16.5749, 7.41099, 2.6221, 1.34782  …  5.59883, 5.59883, 16.4683, 15.634, 14.177, 9.61756, 9.61756, 8.32924, 1.3295, 1.3295], :XTOL_REACHED)

In what follows, we look for edges estimated to 0 (up to rounding): there are 5 such edges (lengths <5e-7). and set them to 0.0, which avoid negative branch lengths of -0.

In [8]:
filter(x -> isapprox(x, 0., atol=1e-5), [e.length for e in net0.edge])
for e in net0.edge
    if isapprox(e.length, 0., atol=1e-5) e.length=0.0; end
end
R"par(mar=c(0,0,0,0))"
plot(net0, :RCall, useEdgeLength=true, xlim=[0,28]);

network h=1

Same option: ultrametric=true (the network was rooted correctly earlier).
Extra option: forceMinorLength0=true to force minor hybrid edges to have length 0, i.e. no unsampled taxa.
The calibration optimization was run as shown below, but several more iterations of each optimization method, until the score did not improve. :LN_COBYLA seemed to get a better score than :LD_MMA at a given iteration, so :LN_COBYLA was run several more times.

In [9]:
calibrateFromPairwiseDistances!(net1, avD, taxa, forceMinorLength0=true, NLoptMethod=:LD_MMA);
calibrateFromPairwiseDistances!(net1, avD, taxa, forceMinorLength0=true, NLoptMethod=:LN_COBYLA);
calibrateFromPairwiseDistances!(net1, avD, taxa, forceMinorLength0=true, NLoptMethod=:LD_MMA);

The same polytomies were found as with the tree, when looking at edges of length 0:

In [10]:
filter(x -> isapprox(x, 0., atol=1e-5), [e.length for e in net1.edge])
for e in net1.edge
    if isapprox(e.length, 0., atol=1e-5) e.length=0.0; end
end
R"par(mar=c(0,0,0,0))"
plot(net1, :RCall, useEdgeLength=true, xlim=[0,28]);

network h=3

On the network with 3 reticulations, we do let minor hybrid edges have a positive length, because 2 of them imply the existence of unsampled taxa. Again, more round were run than shown below, until the criterion stopped improving (i.e decreasing).

In [11]:
calibrateFromPairwiseDistances!(net3, avD, taxa, forceMinorLength0=false, NLoptMethod=:LD_MMA);
calibrateFromPairwiseDistances!(net3, avD, taxa, forceMinorLength0=false, NLoptMethod=:LN_COBYLA);
calibrateFromPairwiseDistances!(net3, avD, taxa, forceMinorLength0=false, NLoptMethod=:LD_MMA);

again, some edges were estimated with a 0 length (or near-0, which we turn to 0). 1 of them is the minor hybrid edge that was allowed to have a positive length.

In [12]:
filter(x -> isapprox(x, 0., atol=1e-5), [e.length for e in net3.edge])
for e in net3.edge
    if isapprox(e.length, 0., atol=1e-5) e.length=0.0; end
end
R"par(mar=c(0,0,0,0))"
plot(net3, :RCall, useEdgeLength=true, xlim=[0,30]);

Save and plot calibrated networks

In [13]:
writeMultiTopology([net0,net1,net3], "calibratednetworks.tre")
In [14]:
R"pdf('bestnets_calibrated.pdf', width=6, height=12)"
R"layout(matrix(1:3,3,1))"
R"par(mar=c(1,0,0,1), mgp=c(1.4,.4,0))"
plot(net0, :RCall, useEdgeLength=true, xlim=[0,25]);
plot(net1, :RCall, useEdgeLength=true, xlim=[0,25]);
plot(net3, :RCall, useEdgeLength=true, xlim=[0,25]);
R"dev.off()";