Data and scripts for: Read length dominates phylogenetic placement accuracy of ancient DNA reads

Published May 09, 2025 on Dryad. https://doi.org/10.5061/dryad.4f4qrfjn3

Data files

May 09, 2025 version files 2.83 GB

empirical_dataset_results.tar.xz

672.87 MB
pygargammel_logs.tar.gz

172.42 MB
README.md

10.04 KB
synthetic_dataset_results.tar.xz

1.15 GB
tool_parameter_exploration_results.tar.xz

829.58 MB

Abstract

One of the central problems facing researchers who analyze ancient DNA (aDNA) is identifying the species which corresponds to the recovered aDNA. Prior analysis of aDNA data normally uses sequence matching tools (such as BLAST) to identify reads obtained from aDNA. However, as the source of aDNA is often an previously unsampled taxon due to the taxon having gone extinct prior to the advent of modern sequencing technology, it is likely the case that there is no exact match in any database. As a consequence tools such as BLAST are of limited use in helping to place a read in a phylogenetic context, I.E. identifying the likely source of a read on a phylogenetic tree. Phylogenetic placement is a technique where a sequence or read is placed onto a specific branch phylogenetic tree. These tools offer a the potential for a much finer resolution when identifying reads. However, phylogenetic placement has primarily only been used to place reads obtained from extant sources. Phylogenetic placement's applicability to aDNA data is complicated by the characteristic pattern of degradation that aDNA undergoes. This characteristic damage is generally not accounted for by popular phylogenetic placement tools, and as a consequence some authors have cast doubt on the potential accuracy of such tools. To understand how the characteristic aDNA damage affects placement phylogenetic tools, implemented a statistical model of aDNA damage as a tool, which we call PyGargammel, that takes sequences applies damage characteristic of aDNA to them. We deploy PyGargammel, along with the existing phylogenetic placement assessment pipeline PEWO, to 7 empirical datasets. With this pipeline, we explore the parameter space of aDNA damage via a grid search in order to identify the factors of aDNA damage which are most impactful. We test 4 leading phlyogenetic placement tools: APPLES, \epang{}, \pplacer{}, and RAPPAS. We find that the frequency of DNA backbone nicks (and consequently read length) is the primary driver of error for aDNA reads. Additionally, we find that other factors, such as the rate of A to G misincorporations, have a negligible effect on the overall accuracy of phylogenetic placement tools.

https://doi.org/10.5061/dryad.4f4qrfjn3

This archive contains the plots, data files, result files, and scripts to
generate plots for the paper "Read Length Dominates Phylogenetic Placement
Accuracy of Ancient DNA Reads".

Description of the data and file structure

There are 5 archives:

pygargammel_logs.tar.gz which contains the logs from pygarammel. These logs
contain the data used to make plots relating to the fragment distributions.
plot_and_analysis_scripts.tar.gz which contains all the scripts and python/r
notebooks used to perform the analysis and plotting. The details about the
contents of this archive are detailed in the next section.
empirical_dataset_results.tar.xz which contains all the jplace files, the
pruned trees, the generated reads, and the aligned sequences for the datasets
labeled ds*
synthetic_dataset_results.tar.xz which contains all the jplace files, the
pruned trees, the generated reads, and the aligned sequences for the datasets
labeled sds*
tool_parameter_exploration_results.tar.xz which contains all the jplace
files, the pruned trees, the generated reads, and the aligned sequences for the
tool parameter exploration on ds06

Pipeline summary

Before describing the contents of each archive, it will be helpful to quickly
summarize the method by which the data in these archives was generated.

The results are generated via a Snakemake pipeline, which was originally
released under the name PEWO1. For the purposes of testing aDNA damage, we
modified the pipeline to include a damage phase. This modified version is
available on GitHub2.

The goal of PEWO is to benchmark phylogenetic placement software for efficiency
and accuracy. The benchmarks are performed using a provided phylogenetic tree
and its associated Multiple Species Alignment (MSA). Using the provided tree,
PEWO produces artificial datasets by pruning a subtree from the provided tree,
and removing the associated sequences from the MSA. The position of the pruned
tips are recorded, and the associated sequences are placed into a new MSA file.
Together, the pruned tips and associated MSA are called a "pruning". Once
produced, PEWO then does some initial preprocessing, and executes the placement
software with the pruning data. The results from each program are then analyzed
to see how accurate they are, and the results are written to the CSV file
results.csv.

For this analysis, we benchmarked the phylogenetic placement tools epa-ng,
pplacer, rappas, and apples. The parameters for each tool, PEWO, and the
damage process are defined in the config.yaml file, which is passed to
Snakemake.

Summary Tool Parameters

Snakemake encodes the parameters as part of the filename. Therefore, to
understand how the result files are generated, we must examine the filename.
Each parameter has a globally unique prefix, and each parameter value is
separated by an _. For a detailed explanation of the effects of each tools
parameters, please refer to the tool's documentation.

PEWO

The two parameters for PEWO are the pruning number, and the read length. The
pruning number is the first number in the filename, and has an "empty" prefix
(that is, no prefix). The read length is set to 100000 for all runs, and is
prefixed by r.

`pygargammel`

pygargammel has 4 parameters:

nf: The nick frequency,
ov: The overhang parameter,
ss: The single strand deamination rate,
ds: the double strand deamination rate.

Each parameter is a real number between 0 and 1. The exact effect of each of
these parameters is described in the main paper.

`apples`

apples has 2 parameters:

meth: The method for apples. Can take the values OLS, FM, or BE.
crit: The criteria for apples. Can take the values MLSE, ME, HYBRID.

`epang`

epang has a primary parameter, which heuristic to use for analysis.
Once the heuristic is selected, there might be an additional parameter specific
to that heuristic.

h: The heuristic to use. Possible values are 1 to 4.
g: Specific to heuristic 1. Larger numbers are more accurate, but slower.
G: Specific to heuristic 2. Larger numbers are more accurate, but slower.

`pplacer`

pplacer uses a "baseball" heuristic to accelerate the placement computation,
so all the parameters have baseball associated names.

ms: Max strikes.
sb: Strike Box.
mp: Max Pitches.

`rappas`

k: k-mer length.
o: Omega (probability threshold)
red: Reduction parameter (gap tolerance).

Example

Suppose that we have the file

3_r100000_nf0.0_ov1.0_ss0.65_ds0.0_h1_g0.99999_epang.jplace

Then we can read that this file was generated

with pruning 3,
with a nick frequency rate of 0.0,
with an overhang rate of 1.0,
with a single strand deamination rate of 0.65,
with a double strand deamination rate of 0.0,
using epang with heuristic 1,
and a g parameter of 0.99999.

Dataset Analysis Archives

The archives empirical_dataset_results.tar.xz,
synthetic_dataset_results.tar.xz, and
tool_parameter_exploration_results.tar.xz contain results from running PEWO
with damage enabled. All of the archives have the same structure, at the top
level are the dataset directories, and each dataset directory has the structure:

configs
data
programs
reads
trees
results.csv

`configs`

We used the damage-config.yaml configuration file to perform the damage
analysis for these datasets, and tool-params.yaml for the tool parameter
analysis. For each dataset, the configuration file will
contain the exact alignment and tree which was used for the benchmark.

`data`

This directory will contain the MSA and the phylogenetic tree in Newick format.
There might be several versions of the dataset. To determine which one was used
to produce the results for this dataset, please see the damage-config.yaml
file in the configs directory.

`programs`

This directory contains the intermediate results most of the programs in the
pipeline. There are directories for each of the placement tools benchmarked:
apples, epang, pplacer, and rappas. In addition, there is a directory
for the alignment tool hmmer, which is used to realign the pruned sequences to
the reference alignment.

The files of interest in this directory are the jplace files. All other
intermediate files have been removed, in the interest of reducing the size of
the archive.

`reads`

This directory contains the reads which have been split by PEWO, and damaged by
pygargammel. The are aligned relative to the reference MSA.

`trees`

This directory contains the trees after pruning. These trees are provided to
each placement tool to place the pruned sequences back onto the tree.

`results.csv`

This CSV file contains the results from from all the analysis. It is delimited
by ;, and it contains the following columns:

software: The placement tool that produced the placement.
pruning: The pruning number for this placement.
rstart, rend: The start and end of the "read". However, we ignore this and
instead fragment the reads using an aDNA damage simulation.
nd: Error of the placement in node distance.
e_nd: Error of the placement in expected node distance. If the placement
tool produces multiple placements, and weights these placements by LWR, then
the expected node distance is the weighted average of the node distance for
each placement, weighted by LWR of that placement.
r: Read length. This is not used, and so is set to 100000.
nf, ov, ss, ds: aDNA damage parameters which were used to simulate the
damaged reads.

The remaining columns are tool specific parameters, which are described above in
the pipeline summary.

`pygargammel_logs.tar.gz`

This directory contains all the pygargammel logs from all the runs. The logs
are in a JSON format. Below is an example of the log.

[
 {
  "taxa": TAXA NAME
  "fragments": [
   {
    "start": NUMBER,
    "end": NUMBER,
    "taxa": TAXA NAME
    "damage": [
     {
      "index": NUMBER
      "old": CHARACTER,
      "new": CHARACTER
     },
     ...
    ],
    "overhang_l_len": NUMBER,
    "overhang_r_len": NUMBER
   },
   ...
  ]
 },
...
]

taxa: The taxa name.
fragments: A list of fragments for the given sequence, identified by the
taxa.
- start, end: The start and end positions of the fragment with respect
  to the sequence.
- overhang_l_len, overhang_r_len: The length of the overhangs, starting
  from the left and right, respectively.
- damage: A list of the damage (deamination) events for the fragement.
  - index: Index of the damaged site, relative to the start of the
    fragment.
  - old: The state before damage.
  - new: The state after damage.

Code/Software

Code in this archive is primarily used to generate plots, or to generate support
files, to generate plots.

compute-mean-gc.py is the script used to compute the GC% column in the
dataset table.
compute_read_stats.py produces the fragments.csv.gz file from the
PyGargammel logs in results/*.
fragment-stats.ipynb is a python notebook which plots the read length and
damage count histograms.
import_data_csv.py produces allresults.csv.gz from program results in
results/*.
r-plots.ipynb is an R notebook with all the plotting code.
pygargammel-verification.ipynb is a Python notebook performing various
statistical tests verifying pygargammel

In addition to the above scripts, there are also several miscellaneous helper
scripts which were used to generate intermediate files. They have name like
make_sql_db.py.

Data and scripts for: Read length dominates phylogenetic placement accuracy of ancient DNA reads

Data files

Abstract

README: Datasets, result files, and scripts for "Read Length Dominates Phylogenetic Placement Accuracy of Ancient DNA Reads"

Description of the data and file structure

Pipeline summary

Summary Tool Parameters

PEWO

pygargammel

apples

epang

pplacer

rappas

Example

Dataset Analysis Archives

configs

data

programs

reads

trees

results.csv

pygargammel_logs.tar.gz