Estimating waiting distances between genealogy changes under a multi-species extension of the sequentially Markov coalescent
Data files
Oct 22, 2025 version files 60.32 KB
-
mcmc2.py
33.64 KB
-
README.md
3.08 KB
-
validate-2.py
13.30 KB
-
validate-x5.py
10.30 KB
Abstract
Genomes are composed of a mosaic of segments inherited from different ancestors, each separated by past recombination events. Consequently, genealogical relationships among multiple genomes vary spatially across different genomic regions. Genealogical variation among unlinked (uncorrelated) genomic regions is well described for either a single population (coalescent) or multiple structured populations (multispecies coalescent). However, the expected similarity among genealogies at linked regions of a genome is less well characterized. Recently, an analytical solution was derived for the distribution of the waiting distance for a change in the genealogical tree spatially across a genome for a single population with constant effective population size. Here we describe a generalization of this result, in terms of the distribution of waiting distances between changes in genealogical trees and topologies for multiple structured populations with branch-specific effective population sizes (i.e., under the multispecies coalescent). We implemented our model in the Python package ipcoal and validated its accuracy against stochastic coalescent simulations. Using a novel likelihood framework, we show that tree and topology-change waiting distances in an ARG can be used to fit species tree model parameters, demonstrating an application of our model for developing new methods for phylogenetic inference. The Multi-Species Sequentially Markov Coalescent (MS-SMC) model presented here represents a major advance for linking local ancestry inference to hierarchical demographic models.
https://doi.org/10.5061/dryad.jdfn2z3n7
Description of the data and file structure
This DRYAD archive contains Python scripts and Jupyter Notebooks to reproduce all analyses. These scripts are also available in the GitHub repo linked below. The notebooks (uploaded to Zenodo) can be viewed in rendered form at the following links:
- notebook 1: Demonstration
- notebook 2: Validation
- notebook 3: Likelihood Surface
- notebook 4: Likelihood MCMC
- notebook 5: Topo inhomogeneity bias
This archive also contains the following scripts which are demonstrated in the notebooks to run the large-scale simulation analyses in the manuscript in parallel.
- mcmc2.py
- validate-2.py
- validate-x5.py
validate-2.py
This script is used to generate data analyzed in notebook 2. This includes simulated ARGs from which we record the observed waiting distance and predicted waiting distances under the MS-SMC, across a range of demographic models and Ne values. This is used to validate the accuracy/error of our approach.
validate-x5.py
This script is used to generate data that is analyzed in notebook 5. This includes simulated ARGs under the SMC' versus Hudson models and storing their observed waiting distances, and predicted waiting distances calculated under the MS-SMC. This is used to measure the error/bias in our calculations caused by the SMC' approximation, and inhomogeneity of trees between topo-change events.
mcmc2.py
This script is used to simulate ARGs under a specified species tree and then infer the parameters of that species tree using a full likelihood approach by analyzing (1) genealogy probabilities; (2) waiting distance probabilities; or (3) both.
Files and variables
The archived GitHub project repository url: https://github.com/eaton-lab/waiting-distance-code
Code/software
All code can be run by installing the Python package ipcoal v.0.5, installed from conda-forge, which will also install all other required dependencies.
conda install ipcoal -c conda-forge
Access information
Other publicly accessible locations of the data:
Data was derived from the following sources:
- None
