Data for: bigrig: A range simulator for the DEC[+J] model

Bettisworth, Benjamin 1 ; Stamatakis, Alexandros1 2 3

Published Mar 03, 2026 on Dryad. https://doi.org/10.5061/dryad.1jwstqk8t

Data files

Mar 03, 2026 version files 2.92 GB

bigrig-dryad.tar.gz

2.92 GB
README.md

9.82 KB

Abstract

This archive contains the data and scripts used to produce the figures for the
paper "bigrig: A range simulator for the DEC+J model". The scripts contained are used to benchmark the simulation times for bigrig and BioGeoBEARS.

Contained in the dataset is a large number of simulated geographic range alignments which were used evaluate the performance of the previously mentioned tools. The simulated datasets range in size, from 3 ranges and 8 taxa to 63 ranges and 65536 taxa. In addition to the simulated data, this archive contains execution timings and scripts to benchmark the two tools. Potential uses for this archive are as use as an example for further analysis of biogeographical tools or as a source of diverse and high quality biogeographic data.

Introduction

This dataset contains benchmarking data and scripts for comparing the
performance of bigrig, a range simulator for the DEC+J (Dispersal, Extinction,
Cladogenesis + Jump) biogeographic model, against BioGeoBEARS, a widely-used
historical biogeography package.

Background

The DEC+J model is used in historical biogeography to reconstruct the
evolutionary history of species geographical ranges on based on the phylogenetic
tree describing the evolutionary history of the relevant species. The DEC[+J]
model is, at the time of writing, the most popular model for historical
biogeography.

Directory Structure

The entire dataset is contained in the tarball bigrig-dryad.tar.gz. When
extracted, it has the following directory structure.

.
├── bigrig/                   # bigrig benchmarking data and scripts
│   ├── benchmarks/
│   │   ├── bigrig_times.csv
│   │   ├── results_bigrig.html
│   │   ├── with-both/        # Main timing experiment (random parameters)
│   │   │   ├── bigrig_times.csv
│   │   │   ├── results_bigrig.html
│   │   │   ├── figs/
│   │   │   │   ├── bigrig.times.boxplot.svg
│   │   │   │   └── bigrig.times.linreg.svg
│   │   │   └── <trees>/<ranges>/bigrig/
│   │   │       ├── config.yaml
│   │   │       ├── model.json
│   │   │       ├── bigrig.log
│   │   │       └── results.json.gz
│   │   └── static-model/     # Static model experiment (fixed parameters)
│   │       ├── bigrig_times.csv
│   │       ├── results_bigrig.html
│   │       ├── figs/
│   │       │   ├── bigrig.times.boxplot.svg
│   │       │   └── bigrig.times.linreg.svg
│   │       └── <trees>/<ranges>/bigrig/
│   │           ├── config.yaml
│   │           ├── model.json
│   │           ├── bigrig.log
│   │           └── results.json.gz
│   ├── notebooks/            # Jupyter notebooks for analysis
│   │   ├── bigrig.py.ipynb
│   │   ├── bigrig.r.ipynb
│   │   └── plots.py.ipynb
│   ├── utils/                # Utility scripts
│   ├── Snakefile             # Snakemake workflow
│   ├── pixi.toml             # Project dependencies
│   └── pixi.lock
├── biogeobears/
│   ├── results/              # Experimental results directory
│   ├── Snakefile             # Snakemake workflow
│   ├── run.R                 # BioGeoBEARS timing script
│   ├── install.R             # BioGeoBEARS installation script
│   ├── pixi.toml
│   └── pixi.lock
└── readme.md

File Descriptions

Both methods use pixi for sandboxing and version control. So, each of the two
directories, bigrig and biogeobears contains a pixi.toml and a
pixi.lock. These files specify the environment structure and the precise
versions, respectively.

To deploy these environments, please see the section on reproducing results
below.

`bigrig/` directory

benchmarks/with-both/: Timing experiments with randomized model parameters
- Trees: 8 to 65,536 taxa (10 iterations each)
- Range sizes: 3, 7, 15, 31, 63 areas
- Model parameters:
  - Dispersion and Extinction: Drawn from a uniform distribution from 0.0 to
    1.0
  - Cladogenesis parameters:
    - Jumps disabled: allopatry, sympatry, and copy are all set to 1.0
    - Jumps enabled: allopatry, sympatry, copy, and jump are all set to 1.0
- Each subdirectory contains:
  - config.yaml: The bigrig configuration file.
  - model.json: The specfiic model values used for the run.
  - bigrig.log: The recorded output from stdout
  - results.json.gz: Results stored as a compressed JSON file.
benchmarks/static-model/: Timing experiments with fixed model parameters
- Trees: 8, 32, 128, 512, 2,048 taxa (10 iterations each)
- Range sizes: 3, 7, 9 areas
- Model parameters are dispersion and extinction set to 0.1, and cladogenesis
  parameters (sympatry, allopatry, copy, and jump) set to 1.0.
notebooks/: Jupyter notebooks for analyzing and visualizing results
utils/: Python utilities for running benchmarks and analyzing data

At the level labeled <trees> in the directory structure, there is a newick tree
saved in tree.nwk. This tree is used for the configuration of all runs in that
directory.

For both experiments, the timing results are compiled into bigrig_times.csv,
which includes the time for each trial, as well as the model parameters
described above. In addition, there is a compiled notebook saved in
results_bigrig.html. The notebook contains two plots, a box plot of time vs
taxa and ranges, and a regression of time vs taxa broken out by the number of
ranges. Each of these plots is also saved in the figs/ directory as
bigrig.times.boxplot.svg and bigrig.times.linreg.svg, respectively.

The specification for each experiment is contained within Snakefile, a
Snakemake pipeline which actually performs the experiment. The Snakemake
pipeline needs configuration, which is specified by a YAML configuration file.
Specifically, the with-both experiment is specified by with-both.yaml and
the static-model experiment is specified by static-model.yaml.

`utils` directory

Contained within this directory are python scripts which are utilized by the
Snakemake pipeline to parse program logs, generate configuration files, and
compute some statistics. Below is a detailed list of the files.

config.py: Python script with functions to generate configuration files for
bigrig, as well as other programs not utilized within this dataset.
graph.py/ vector.py: Python scripts containing functions for computing a
metric that is not used in this dataset. However, it can compute a distance
between two biogeographic ranges. For more information, please see
Bettisworth 2023.
logs.py: Python script with functions to parse the results of bigrig and
other programs not used in this dataset.
util.py: Python script containing simple helper functions.

`notebooks` directory

This contains the notebook templates that are used by the Snakemake pipeline to
plot and summarize results. Of the notebook files here, only bigrig.r.ipynb is
used in this dataset. Please see above for the details.

The other notebook, plots.py.ipynb, is used in conjunction with Lagrange-NG
to summarize the results of an error metric. For more information, please see
Bettisworth 2023.

`results.json.gz`

The relevant keys in the results JSON are

align: An object containing the range alignment which was simulated.
events: A list of dispersion or extinction events which occurred during the
simulation of the ranges.
regions: Number of regions to simulate.
root-range: The starting range.
splits: The result of cladogenesis simulations on the tree.
stats: Timings for the simulation. Broken up into configuration time, and
execution time. The total time, excluding time to write out files, is time.

`biogeobears/` directory

This directory contains the timing experiments for BioGeoBEARS, which is used
for comparison with bigrig in the article. In addition, the directory contains a
pixi.toml file, containing commands which will install and run BioGeoBEARS .

In order to control R the .Renviron and .Rprofile files set certain paths.
This ensures the R instance installed by pixi:

Only installs libraries to the environment managed by pixi; and
Does not use system libraries stored in the outside users profile.

Each directory in the experiment is of the form
<regions>_regions_<taxa>_taxa_<iter#>_iter, where:

<regions> is the number of regions
<taxa> is the number of taxa
<iter#> is the experiment replication number

The files in each experiment directory are:

data.*.csv: Log files output from bigrig. The BioGeoBEARS experiment
script requires an existing alignment to seed the simulation, so bigrig is
used to to generate the seed alignment.
times.csv: contains the times, augmented with the region count, taxa count,
and iteration number. For most experiments, the time is in seconds. However,
for longer running experiments, the time is in minutes. This is due to the
automatic rounding that R does when printing time. This only occurs with 9
regions and 512 or 2048 taxa.
tree.nwk: The tree used in this experiment.

Model parameters used for all experiments are dispersion and extinction set to
0.1, and all cladogenesis parameters (allopatry, sympatry, copy, and jump) set
to 1.0.

Reproducing the Benchmarks

The only prerequisite is the pixi package
manager, which is used to download and install the other dependencies.

Running bigrig benchmarks

cd bigrig
pixi shell
# Update the binary path in the YAML config file, then run:
snakemake --snakefile Snakefile --configfile=benchmarks/with-both.yaml results_bigrig.html
snakemake --snakefile Snakefile --configfile=benchmarks/static-model.yaml results_bigrig.html

Running BioGeoBEARS benchmarks

cd biogeobears
pixi run install  # Install BioGeoBEARS
pixi run snakemake

Note: Update the BIGRIG_PATH variable in the Snakefile to point to the bigrig binary before running.

Related Publication

This dataset accompanies the paper titled "bigrig, a high-performance range
simulator for the DEC+J model".