Data and scripts for: Read length dominates phylogenetic placement accuracy of ancient DNA reads
Data files
May 09, 2025 version files 2.83 GB
-
empirical_dataset_results.tar.xz
672.87 MB
-
pygargammel_logs.tar.gz
172.42 MB
-
README.md
10.04 KB
-
synthetic_dataset_results.tar.xz
1.15 GB
-
tool_parameter_exploration_results.tar.xz
829.58 MB
Abstract
One of the central problems facing researchers who analyze ancient DNA (aDNA) is identifying the species which corresponds to the recovered aDNA. Prior analysis of aDNA data normally uses sequence matching tools (such as BLAST) to identify reads obtained from aDNA. However, as the source of aDNA is often an previously unsampled taxon due to the taxon having gone extinct prior to the advent of modern sequencing technology, it is likely the case that there is no exact match in any database. As a consequence tools such as BLAST are of limited use in helping to place a read in a phylogenetic context, I.E. identifying the likely source of a read on a phylogenetic tree. Phylogenetic placement is a technique where a sequence or read is placed onto a specific branch phylogenetic tree. These tools offer a the potential for a much finer resolution when identifying reads. However, phylogenetic placement has primarily only been used to place reads obtained from extant sources. Phylogenetic placement's applicability to aDNA data is complicated by the characteristic pattern of degradation that aDNA undergoes. This characteristic damage is generally not accounted for by popular phylogenetic placement tools, and as a consequence some authors have cast doubt on the potential accuracy of such tools. To understand how the characteristic aDNA damage affects placement phylogenetic tools, implemented a statistical model of aDNA damage as a tool, which we call PyGargammel, that takes sequences applies damage characteristic of aDNA to them. We deploy PyGargammel, along with the existing phylogenetic placement assessment pipeline PEWO, to 7 empirical datasets. With this pipeline, we explore the parameter space of aDNA damage via a grid search in order to identify the factors of aDNA damage which are most impactful. We test 4 leading phlyogenetic placement tools: APPLES, \epang{}, \pplacer{}, and RAPPAS. We find that the frequency of DNA backbone nicks (and consequently read length) is the primary driver of error for aDNA reads. Additionally, we find that other factors, such as the rate of A to G misincorporations, have a negligible effect on the overall accuracy of phylogenetic placement tools.
https://doi.org/10.5061/dryad.4f4qrfjn3
This archive contains the plots, data files, result files, and scripts to
generate plots for the paper "Read Length Dominates Phylogenetic Placement
Accuracy of Ancient DNA Reads".
Description of the data and file structure
There are 5 archives:
pygargammel_logs.tar.gzwhich contains the logs from pygarammel. These logs
contain the data used to make plots relating to the fragment distributions.plot_and_analysis_scripts.tar.gzwhich contains all the scripts and python/r
notebooks used to perform the analysis and plotting. The details about the
contents of this archive are detailed in the next section.empirical_dataset_results.tar.xzwhich contains all thejplacefiles, the
pruned trees, the generated reads, and the aligned sequences for the datasets
labeledds*synthetic_dataset_results.tar.xzwhich contains all thejplacefiles, the
pruned trees, the generated reads, and the aligned sequences for the datasets
labeledsds*tool_parameter_exploration_results.tar.xzwhich contains all thejplace
files, the pruned trees, the generated reads, and the aligned sequences for the
tool parameter exploration onds06
Pipeline summary
Before describing the contents of each archive, it will be helpful to quickly
summarize the method by which the data in these archives was generated.
The results are generated via a Snakemake pipeline, which was originally
released under the name PEWO1. For the purposes of testing aDNA damage, we
modified the pipeline to include a damage phase. This modified version is
available on GitHub2.
The goal of PEWO is to benchmark phylogenetic placement software for efficiency
and accuracy. The benchmarks are performed using a provided phylogenetic tree
and its associated Multiple Species Alignment (MSA). Using the provided tree,
PEWO produces artificial datasets by pruning a subtree from the provided tree,
and removing the associated sequences from the MSA. The position of the pruned
tips are recorded, and the associated sequences are placed into a new MSA file.
Together, the pruned tips and associated MSA are called a "pruning". Once
produced, PEWO then does some initial preprocessing, and executes the placement
software with the pruning data. The results from each program are then analyzed
to see how accurate they are, and the results are written to the CSV file
results.csv.
For this analysis, we benchmarked the phylogenetic placement tools epa-ng,
pplacer, rappas, and apples. The parameters for each tool, PEWO, and the
damage process are defined in the config.yaml file, which is passed to
Snakemake.
Summary Tool Parameters
Snakemake encodes the parameters as part of the filename. Therefore, to
understand how the result files are generated, we must examine the filename.
Each parameter has a globally unique prefix, and each parameter value is
separated by an _. For a detailed explanation of the effects of each tools
parameters, please refer to the tool's documentation.
PEWO
The two parameters for PEWO are the pruning number, and the read length. The
pruning number is the first number in the filename, and has an "empty" prefix
(that is, no prefix). The read length is set to 100000 for all runs, and is
prefixed by r.
pygargammel
pygargammel has 4 parameters:
nf: The nick frequency,ov: The overhang parameter,ss: The single strand deamination rate,ds: the double strand deamination rate.
Each parameter is a real number between 0 and 1. The exact effect of each of
these parameters is described in the main paper.
apples
apples has 2 parameters:
meth: The method forapples. Can take the valuesOLS,FM, orBE.crit: The criteria forapples. Can take the valuesMLSE,ME,HYBRID.
epang
epang has a primary parameter, which heuristic to use for analysis.
Once the heuristic is selected, there might be an additional parameter specific
to that heuristic.
h: The heuristic to use. Possible values are 1 to 4.g: Specific to heuristic 1. Larger numbers are more accurate, but slower.G: Specific to heuristic 2. Larger numbers are more accurate, but slower.
pplacer
pplacer uses a "baseball" heuristic to accelerate the placement computation,
so all the parameters have baseball associated names.
ms: Max strikes.sb: Strike Box.mp: Max Pitches.
rappas
k: k-mer length.o: Omega (probability threshold)red: Reduction parameter (gap tolerance).
Example
Suppose that we have the file
3_r100000_nf0.0_ov1.0_ss0.65_ds0.0_h1_g0.99999_epang.jplace
Then we can read that this file was generated
- with pruning 3,
- with a nick frequency rate of 0.0,
- with an overhang rate of 1.0,
- with a single strand deamination rate of 0.65,
- with a double strand deamination rate of 0.0,
- using
epangwith heuristic 1, - and a
gparameter of 0.99999.
Dataset Analysis Archives
The archives empirical_dataset_results.tar.xz,
synthetic_dataset_results.tar.xz, and
tool_parameter_exploration_results.tar.xz contain results from running PEWO
with damage enabled. All of the archives have the same structure, at the top
level are the dataset directories, and each dataset directory has the structure:
configsdataprogramsreadstreesresults.csv
configs
We used the damage-config.yaml configuration file to perform the damage
analysis for these datasets, and tool-params.yaml for the tool parameter
analysis. For each dataset, the configuration file will
contain the exact alignment and tree which was used for the benchmark.
data
This directory will contain the MSA and the phylogenetic tree in Newick format.
There might be several versions of the dataset. To determine which one was used
to produce the results for this dataset, please see the damage-config.yaml
file in the configs directory.
programs
This directory contains the intermediate results most of the programs in the
pipeline. There are directories for each of the placement tools benchmarked:
apples, epang, pplacer, and rappas. In addition, there is a directory
for the alignment tool hmmer, which is used to realign the pruned sequences to
the reference alignment.
The files of interest in this directory are the jplace files. All other
intermediate files have been removed, in the interest of reducing the size of
the archive.
reads
This directory contains the reads which have been split by PEWO, and damaged by
pygargammel. The are aligned relative to the reference MSA.
trees
This directory contains the trees after pruning. These trees are provided to
each placement tool to place the pruned sequences back onto the tree.
results.csv
This CSV file contains the results from from all the analysis. It is delimited
by ;, and it contains the following columns:
software: The placement tool that produced the placement.pruning: The pruning number for this placement.rstart,rend: The start and end of the "read". However, we ignore this and
instead fragment the reads using an aDNA damage simulation.nd: Error of the placement in node distance.e_nd: Error of the placement in expected node distance. If the placement
tool produces multiple placements, and weights these placements by LWR, then
the expected node distance is the weighted average of the node distance for
each placement, weighted by LWR of that placement.r: Read length. This is not used, and so is set to 100000.nf,ov,ss,ds: aDNA damage parameters which were used to simulate the
damaged reads.
The remaining columns are tool specific parameters, which are described above in
the pipeline summary.
pygargammel_logs.tar.gz
This directory contains all the pygargammel logs from all the runs. The logs
are in a JSON format. Below is an example of the log.
[
{
"taxa": TAXA NAME
"fragments": [
{
"start": NUMBER,
"end": NUMBER,
"taxa": TAXA NAME
"damage": [
{
"index": NUMBER
"old": CHARACTER,
"new": CHARACTER
},
...
],
"overhang_l_len": NUMBER,
"overhang_r_len": NUMBER
},
...
]
},
...
]
taxa: The taxa name.fragments: A list of fragments for the given sequence, identified by the
taxa.start,end: The start and end positions of the fragment with respect
to the sequence.overhang_l_len,overhang_r_len: The length of the overhangs, starting
from the left and right, respectively.damage: A list of the damage (deamination) events for the fragement.index: Index of the damaged site, relative to the start of the
fragment.old: The state before damage.new: The state after damage.
Code/Software
Code in this archive is primarily used to generate plots, or to generate support
files, to generate plots.
compute-mean-gc.pyis the script used to compute the GC% column in the
dataset table.compute_read_stats.pyproduces thefragments.csv.gzfile from the
PyGargammel logs inresults/*.fragment-stats.ipynbis a python notebook which plots the read length and
damage count histograms.import_data_csv.pyproducesallresults.csv.gzfrom program results in
results/*.r-plots.ipynbis an R notebook with all the plotting code.pygargammel-verification.ipynbis a Python notebook performing various
statistical tests verifyingpygargammel
In addition to the above scripts, there are also several miscellaneous helper
scripts which were used to generate intermediate files. They have name like
make_sql_db.py.
