Data and code from: Tensor cores unlock efficient and lower-energy massive parallelization on phylogenetic trees

Gangavarapu, Karthik 1 ; Ji, Xiang2; Shao, Yucai3; Rambaut, Andrew4; Lemey, Philippe5; Baele, Guy5; Suchard, Marc3

Published Mar 18, 2026 on Dryad. https://doi.org/10.5061/dryad.3j9kd51xb

Data files

Mar 18, 2026 version files 1.68 GB

beagle_tensor_cores_supplement_data.tar.gz

1.68 GB
README.md

3.77 KB

Abstract

Massively parallel algorithms leveraging graphics processing units (GPUs) have significantly accelerated inference in statistical phylogenetics, with applications in understanding pathogen evolution, population dynamics, natural selection, and evolutionary timescales using ancient genomes. Continued advancements in GPU hardware necessitate innovative algorithms to fully exploit their potential. Here, we introduce three novel algorithms that accelerate matrix multiplication operations using tensor cores on NVIDIA GPUs to calculate the observed sequence data likelihood and the gradient of the log-likelihood with respect to branch-length-specific parameters under continuous-time Markov chain models of evolution. The algorithms presented in this paper deliver 2 to 3-fold gains in performance for amino acid and codon models compared to existing GPU-based massively parallel algorithms. Notably, these performance gains are accompanied by a ~2-fold reduction in energy usage, demonstrating the potential of these algorithms to lower the carbon footprint of evolutionary computing. We make our new algorithms available to the broader phylogenetics community through the high-performance, open source library BEAGLE v4.0.0.

You can find the scripts and installation instructions for BEAGLE to replicate the benchmarking in this Github repository. We provide the resulting logs and profiling information here.

File: beagle_tensor_cores_supplement_data.tar.gz

Time

We measure the time spent in each GPU kernel across 10 replicates for an amino acid (aa/) and codon (codon/) substitution models with 1, 2, 4, 16, ..., 32768 patterns.

For each replicate, we provide screen logs (*.txt) and the scripts (run.sh) used to generate the logs. We also provide *.nsys-rep reports that can be opened using NVIDIA Nsight Systems to view profiling information. For convenience, we extracted the times spent in each GPU kernel into .csv files in the timings folder. We calculate the speedup across 10 replicates in ./scripts/plot_speedup.R using these .csv files. The resulting plot is at ./plots/speedup_aa_codon_a100_h100.pdf

    time
    ├── A100
    │   └── benchmark_time
    │       ├── aa
    │       │   └── timings
    │       └── codon
    │           └── timings
    └── H100
        └── benchmark_time
            ├── aa
            │   └── timings
            └── codon
                └── timings

Energy

We measure the energy spent on each GPU kernel for both an amino acid and codon substitution model across 100 replicates with a randomized time delay of 2 to 10 seconds between each measurement. We measure the energy spent using the NVML API and report the result in the *.txt log files. We parse these log files and compare the energy using ./scripts/plot_energy.R. The resulting plot is at ./plots/energy_comparison.pdf.

├── energy
│   ├── A100
│   │   └── benchmark_energy
│   └── H100
│       └── benchmark_energy

Examples

We provide BEAST XML, corresponding log files (including screen logs) that also report the time taken for the analysis and the MCC tree files for our examples in the manuscript. We render each tree using ./scripts/render*mcc_tree.py scripts to produce plots at ./plots/*_mcc.pdf.

├─ examples
   ├── carnivores
   └── dengue

In each subfolder, we also include the same XML and log files (including screen logs) for running the same analysis on CUDA cores in the subfolder, "cuda_cores".

256 state

We measure the time taken by the post-order kernel for a phylogeographic problem with a state-space size of 256. We provide *.csv files reporting the time spent on each kernel. We prase these results using ./scripts/plot_speedup_256_state.R.

├─ 256_state
   └── timings

Bank conflicts

We measure the time taken for the post- and pre-order kernels across five replicates on tensor cores with and without a permuted shared memory layout and on CUDA cores. We provide *.qdrep files, which can also be opened by NVIDIA Nsight Systems to view profiling information. Additionally, for more detailed profiling of the GPU kernels, we also provide *.ncu-rep files which can be opened using NVIDIA Nsight Compute
. In particular, these reports can be used to see the number of memory bank conflicts under Details > Memory Workload Analysis. We parse these results using ./scripts/bank_conflicts.R.

├─ bank_conflicts
   └── timings

Here we provide system-level and kernel-level profiling information, benchmarks, BEAST XML files, and associated scripts to reproduce the energy and time measurements.

Instructions to reproduce benchmarks are below and also available at https://github.com/suchard-group/beagle_tensor_cores_supplement.

Please note that you will need nsys and ncu installed on the system to time and profile kernels, respectively. Both these tools are available as part of the CUDA toolkit or can be installed from nsys and ncu.

This repository was used to benchmark kernels with the following versions,

CUDA release 12.4, V12.4.131
NVIDIA Nsight Systems version 2024.4.2.133-244234382004v0
NVIDIA (R) Nsight Compute Command Line Profiler Version 2024.2.0.0 (build 34181891) (public-release)

Different versions of CUDA or the profiling tools would require further edits to the commands in ./benchmark_time/aa/run.sh, ./benchmark_time/codon/run.sh, and ./benchmark_energy/run.sh.

Please see the instructions to compile BEAGLE and install BEAST for reproducing the benchmarks.

Results: Benchmark kernel timings

Installation

BEAGLE

git clone -b tensor-cores https://github.com/beagle-dev/beagle-lib.git
cd beagle-lib/
mkdir build
cd build/
cmake -DBEAGLE_BENCHMARK_ENERGY=OFF -DBUILD_OPENCL=OFF -DBEAGLE_TENSOR_CORES=ON -DCMAKE_INSTALL_PREFIX=$HOME ..
make
make install

You can set the install location using CMAKE_INSTALL_PREFIX.

BEAST

git clone -b tensor-cores https://github.com/beast-dev/beast-mcmc.git
cd beast-mcmc/
ant

In ./benchmark_time/aa/run.sh and ./benchmark_time/codon/run.sh, set up BEAST_JAR and LD_LIBRARY_PATH variables to point to the BEAST JAR file and the BEAGLE library, respectively.

Run the following commands to time all the XMLs with an increasing number of patterns, across 10 replicates for both amino acid and codon models.

cd benchmark_time/
./run_all.sh

Benchmark energy consumption

Installation

BEAGLE

Please note that the cmake flag -DBEAGLE_BENCHMARK_ENERGY=ON is different from the installation above. Do not turn this flag ON to benchmark timings since measuring energy requires significant overhead.

git clone -b tensor-cores https://github.com/beagle-dev/beagle-lib.git
cd beagle-lib/
mkdir build
cd build/
cmake -DBEAGLE_BENCHMARK_ENERGY=ON -DBUILD_OPENCL=OFF -DBEAGLE_TENSOR_CORES=ON -DBEAGLE_DEBUG_SYNCH ..

You can set the install location using CMAKE_INSTALL_PREFIX.

BEAST

git clone -b tensor-cores https://github.com/beast-dev/beast-mcmc.git
cd beast-mcmc/
ant

In ./benchmark_energy/run.sh, set BEAST_JAR and LD_LIBRARY_PATH variables to point to the BEAST JAR file and the BEAGLE library, respectively.

Run the following commands to measure energy consumption across 10 replicates with a random time delay between consecutive iterations for both amino acid and codon models,

cd benchmark_energy/
./run.sh carnivores_aa_hmc_patterns_4096.xml
./run.sh carnivores_codon_hmc_patterns_4096.xml

Data and code from: Tensor cores unlock efficient and lower-energy massive parallelization on phylogenetic trees

Data files

Abstract

README: Data and code from: Tensor cores unlock efficient and lower-energy massive parallelization on phylogenetic trees

Time

Energy

Examples

256 state

Bank conflicts

Methods