Data from: krepp: A k-mer-based maximum pseudo-likelihood method for estimating read distances and genome-wide phylogenetic placement

Şapcı, Ali Osman Berk 1 ; Mirarab, Siavash 1

Research facility: San Diego Supercomputer Center

Published Feb 03, 2026 on Dryad. https://doi.org/10.5061/dryad.63xsj3vd3

Abstract

Comparing each sequencing read in a sample to a reference database is a fundamental step in wide-ranging applications. The results of these comparisons can facilitate phylogenetic characterization. However, phylogenetic placement is currently only possible at scale for marker genes, a small fraction of the genome. We introduce krepp, an alignment-free k-mer-based method that enables placing reads from anywhere on the genome on an ultra-large reference phylogeny (e.g., 123,853 leaves). This repository contains data from benchmarking experiments in which we show the scalability and accuracy of krepp. We also demonstrate the ability of our method to compare and characterize real metagenomic samples.

DOI: 10.5061/dryad.63xsj3vd3

Datasets and queries analyzed to benchmark krepp, together with results and evaluation metrics.
The preprint is available at bioRxiv.

The source code can be found on GitHub (github.com/bo1929/krepp).
The version used in the manuscript is v0.4.5 (which is also available on Zenodo).
All results, auxiliary data, and scripts used in the analyses can be found at github.com/bo1929/shared.krepp.
For reference indexes, refer to:

Please refer to this repository for analysis of the data presented here. We note that the placement error is measured either in the number of edges or the branch length difference. We measure the distances in normalized Hamming distance and in ANI for genome-wide comparison. For QIIME archives, please refer to relevant documentation.

Access information

Other relevant data can be accessed as listed below:

29 bacterial single-cell assembled genomes from the GORG dataset are available at NCBI under BioProject ID GenBank: PRJEB33281 (https://doi.org/10.1016/j.cell.2019.11.01).
The subsampled whole-genome sequencing reads for human microbiome analysis and feature tables for Woltka were derived from Zhu et al. (https://doi.org/10.1128/msystems.00167-22).
The raw sequencing data for Earth's microbiome analysis from Shaffer et al. (https://doi.org/10.1038/s41564-022-01266-x) is available at www.qiita.ucsd.edu (study: 13114).
The Web of Life databases can be accessed at https://biocore.github.io/wol/ (version 1) and http://ftp.microbio.me/pub/wol2/ (version 2).
Reference genomes (a RefSeq snapshot as of January 8, 2019) and the taxonomy for CAMI-II experiments are available at https://cami-challenge.org/reference-databases.
Contigs analyzed from the gold standard assembly for taxonomic binning experiments can be found at http://frl.publisso.de/data/frl:6425521/marine/ and http://frl.publisso.de/data/frl:6425521/strain/, for marine and strain-madness datasets.
CAMI-II results for other methods submitted to the challenge are available at https://github.com/CAMI-challenge/second_challenge_evaluation.
Scripts for postprocessing results, computing evaluation metrics, and generating figures, together with the results, can also be found at https://github.com/bo1929/shared.krepp.

Description of the data and file structure

Files and variables

`backbone_trees/`

backbone-*.nwk: WoL-v1 and WoL-v2 backbone tree with 10,575 and 15,953 references, respectively
ladderized_tree-*.nwk: A caterpillar tree for WoL-v2 and WoL-v1 references
random_tree-*.nwk: A tree simulated with a dual-birth model for WoL references

`microbiome_metadata/`

hmi_retained_samples.tsv: Sample IDs used in human microbiome analysis.
metadata_emp-*.tsv: Metadata (sample ID and environment label at a specific EMPO level) of Earth's Microbiome Project according to EMPO (version 1 and 2)
qiime2_hmi_metadata.tsv: Metadata of human microbiome samples (body sites and subject genders)

`queries-WoLv1_16S_comparison/`

all_mindist.tsv: Branch lengths to the closest reference (scaled by 100)
dist_to_closest.txt: 1-ANI to the closest reference for each query (with IDs)
download.list: Download links to all WoL-v1 reference genomes
errors16S-EPAng.txt: Edge errors for each 16S read placement across different query genomes
metadata.tsv: Metadata for WoL-v1 references
query16S/queries16S_all.txt: Query genomes together with distances (branch length) to the closest
query16S/query_list.txt: All IDs of query genomes
query16S/simulate_reads.sh: Commands used to simulate reads
ranks.tid.tsv: Taxonomic ID labels of WoL-v1 reference genomes
ranks.tsv: Taxonomic labels of WoL-v1 reference genomes
read_id_mapping-queries16S.tsv.xz: Genome-wide read ID to query genome mapping for queries
read_id_mapping-reads16S.tsv.xz: 16S read ID to query genome mapping for queries
reads16S.tar.gz: 16S reads for all query genomes
reference_list.txt The list of references from WoL-v1

`queries-WoLv2_placement_comparison/`

all.map: Genome ID to accession ID mapping
all_seq-WoLv2.map: Genome ID to accession ID mapping
assembly_info-WoLv2.tsv: Metadata for WoL-v2 genomes
downloads.txt: URLs to download WoL-v2 genomes
query_selection/cluster_varying_threshold.sh: Script used to cluster references for selection
query_selection/get_common_singletons.sh: Script used to find the intersection of singleton genomes in clusters
query_selection/read_id_mapping.tsv.xz: Genome ID to read ID mapping
query_selection/selected_queries-mindist.tsv: Selected queries and their minimum distance to the closest reference
query_selection/selected_queries.txt: The list of queries selected
query_selection/selected_queries_levels.tsv: Selected queries and the TreeCluster threshold that they correspond to
query_selection/simulate_reads.sh: Script used to simulate reads
query_selection/singletons-*.tsv: List of singleton references at each clustering threshold
query_selection/tree_clustering-t*.tsv: Result of TreeCluster at each threshold
reference_list.txt: List of all reference genomes used in WoL-v2
taxonomy_WoLv2.tsv: Taxonomic labels of genomes from WoL-v2

`query_reference_pairs-WoLv2-ge20p/`

all_distances-mash.tsv: All pairwise 1-ANI values according to Mash
all_distances-orthoANI.tsv: All pairwise 1-ANI values according to orthoANI
all_distances-skani.tsv: All pairwise 1-ANI values according to skani
all_genomes.tsv: All references with a sufficiently high number of mapped reads
all_pairs.tsv: All pairs with sufficient mapping rate
common_distances-mash.tsv: Distances for common pairs with >=20% mapping rate according to orthoANI
common_distances-orthoANI.tsv: Distances for common pairs with >=20% mapping rate according to orthoANI
common_distances-skani.tsv: Distances for common pairs with >=20% mapping rate according to skani
common_pairs.tsv: Pairs that have >=20% reads mapped for both bowtie2 and krepp
logs.tar.gz: Log files for ANI estimation
mash_estimate.sh: Command for estimating ANI using Mash
orthoANI_distances.tar.gz: Output of orthoANI for different query/reference pairs
orthoANI_estimate.sh: Command for estimating ANI using orthoANI
skani_distances.tar.gz: Output of skani for different query/reference pairs
skani_estimate.sh: Command for estimating ANI using skani

`expt-simulated_genomes/`

build_libraries.sh: Commands used to build a krepp index for each reference genome
estimate_distances.sh: Commands used to estimate distances from mutated genomes to the reference genomes
evaluate_distances.sh: Commands used to evaluate the accuracy of estimates
genomes_fasta_paths.txt: Paths of reference genomes used to generate mutated genomes
mutated_fasta_paths.txt: Paths of mutated genomes with their corresponding distance configuration
postprocess_for_eval.py: Script used to evaluate the accuracy of krepp estimates
simulated_reads_paths.txt: Paths of reads simulated from mutated genomes
distance_estimates/{$GENOME_ID}_a{$ALPHA}_gnd{$DISTANCE}-distances.tsv: Resulting distance estimates for reads simulated from a genome mutated from $GENOME_ID with $DISTANCE and $ALPHA
distance_evaluation/{$GENOME_ID}_a{$ALPHA}_gnd{$DISTANCE}-results.tsv: Accuracy of distance estimates for reads simulated from a genome mutated from $GENOME_ID with $DISTANCE and $ALPHA
distance_evaluation/all_simulations-results.tsv: All results concatenated
simulated_genomes/${GENOME_ID}/${GENOME_ID}_contigs.fasta: Contigs (nucleic acids) of the original assembled reference genome
simulated_genomes/${GENOME_ID}/${GENOME_ID}_contigs_genes.faa: Genes (amino acids) of the original assembled reference genome
simulated_genomes/${GENOME_ID}/g_weights_a22.npy: Weights used to drop mutations alpha=22
simulated_genomes/${GENOME_ID}/g_weights_a5.npy: Weights used to drop mutations alpha=5
simulated_genomes/${GENOME_ID}/input_map.tsv: The path to the reference to index
simulated_genomes/${GENOME_ID}/mutated_BLOSUM62_{$GENOME_ID}_a{$ALPHA}_gnd{$DISTANCE}/mutated_genes.faa: Genes for the mutated genome with $DISTANCE and $ALPHA
simulated_genomes/${GENOME_ID}/mutated_BLOSUM62_{$GENOME_ID}_a{$ALPHA}_gnd{$DISTANCE}/mutated.fasta Sequences for the mutated genome with $DISTANCE and $ALPHA
simulated_genomes/${GENOME_ID}/mutated_BLOSUM62_{$GENOME_ID}_gnd{$DISTANCE}/gnd_aad.txt: The true distance between the reference and the mutated genome
simulated_reads/${GENOME_ID}_a{$ALPHA}_gnd{$DISTANCE}_errFree.fasta: Reads simulated from the mutated genome
simulated_reads/${GENOME_ID}_a{$ALPHA}_gnd{$DISTANCE}.aln: Coordinates of the simulated reads

`hmi_reads/`

SR${SAMPLE_ID}.${PAIR_END}.fq: 1M subsampled paired-end ($PAIR_END) short reads for sample $SAMPLE_ID

`match_stats-krepp_dth4-all.tsv.gz`

Results for the proportions of reads with at least one match up to the Hamming distance threshold (three columns: genome ID, read ID, and the minimum match HD).

`expt-alignment_comparison`

extra_queries_genomes/${GENOME_ID}.fna: Selected genomes used to simulate reads from for distance benchmarking
alignment_reports-WoLv2/${GENOME_ID}.sam: Resulting SAM files obtained using bowtie2
alignment_reports_postprocessed-WoLv2/{GENOME_ID}.tsv: Postprocessed SAM files with alignment distances for aligned reads
distance_reports-WoLv2/${GENOME_ID}.tsv: Distance estimates of krepp
distance_reports_postprocessed-WoLv2/{GENOME_ID}.tsv: Postprocessed version of krepp distance estimates used to compute evaluation metrics
summary_reports-WoLv2/count_summary-{GENOME_ID}.csv: Evaluation metrics for distances per query genome summarizing all reads
summary_reports-WoLv2/read_summary-{GENOME_ID}.csv: Evaluation metrics for distances per query genome for each read
BOWTIE_VERSION: Bowtie2 version information used in the benchmarking.
mash_estimate.sh: Script for pairwise ANI estimations using Mash
build_library.sh: Script for bowtie2 indexing of WoL databases
align_queries.sh: Script for bowtie2 alignment.
estimate_queries.sh: Script used for krepp (alpha-version) distance estimation
postprocess_results.sh: Script used for postprocessing of results for computing metrics (see shared.krepp for each Python script)
logs-WoLv2/{JOB_NAME}.out: Log files for stdout
logs-WoLv2/{JOB_NAME}.err: Log files for stderr
novel_queries_sampreads/{GENOME_ID}.fq: Reads simulated using ART for novel query genomes
selected_queries_sampreads/{GENOME_ID}.fq: Reads simulated using ART for selected query genomes

`resource_usage_benchmarking`

backbone-WoLv1*.nwk: Trees in Newick format for the WoLv1 subsamples (w/ 2000 and 5000 references).
benchmark-WoL*.sh: Scripts used for benchmarking across indexes with varying sizes.
build_library*.sh: Scripts used to create Bowtie2 and krepp indexes for reference subsets.
concat-WoLv*: Concatenated FASTA file for reference genome subsets.
db-WoLv*: Bowtie2 indexes.
input_map-WoL*: The list of references.
library-WoLv1_sampled*: krepp indexes for WoLv1 subsets.
logs/bowtie2*: Logs for Bowtie2 benchmarking.
logs/krepp*: Logs for krepp benchmarking.
logs/resource_benchmarking*: Logs for resource usage comparison.
ref_ids-WoLv1*: IDs of references.
resource_benchmarking.tsv: Benchmarking results (running time and memory) for both querying and indexing.
selected_queries_concat*: Simulated input reads used for benchmarking.

All results and metrics combined

`all_results_combined/expt-hmi/`

hmi-woltka-WoLv2: Human microbiome results for Woltka OGUs using WoL-v2 reference.
hmi-bracken-WoLv2: Human microbiome results for Bracken profiles using WoL-v2 reference.
hmi-ogu-RefSeqCIIdup: Human microbiome results for krepp OGUs using duplicated uDance tree as reference.
hmi-pp-RefSeqCIIdup: Human microbiome results for krepp placements using duplicated uDance tree as reference.
hmi-ogu-WoLv2: Human microbiome results for krepp OGUs using duplicated WoL-v2 reference.
microbiome_metadata.tsv: Metadata of samples.
hmi-pp-RefSeqCII: Human microbiome results for krepp placements using deduplicated uDance tree as reference.
hmi_separation_summary.csv: Summary for separation statistics (pseudo-F) across all methods/references.
hmi-ogu-RefSeqCII: Human microbiome results for krepp OGUs using deduplicated uDance tree as reference.
hmi-pp-WoLv2: Human microbiome results for krepp placements using duplicated WoL-v2 reference.
hmi-woltka-WoLv2-v0: Human microbiome results for Woltka OGU using WoL-v2 as reference.

`all_results_combined/alignment_comparison/`

count_summary-all-WoLv2-alignment_comparison.csv: Number of matches per query read.
all_results_combined/alignment_comparison/read_summary-WoLv2-alignment_comparison-1M.csv: Distance benchmarking against alignment, metrics per read (1M subsample) for all queries.
read_summary-WoLv2-alignment_comparison.csv: Distance benchmarking against alignment, metrics per read for all queries.
resource_benchmarking.tsv: Running time and memory use (bowtie2 and krepp).
dist_to_closest-final.tsv: Novelty values for queries, measured by Mash.
reference_summary-all-WoLv2-alignment_comparison.csv: Distance benchmarking against alignment summarizing reads per query.

`all_results_combined/expt-appspam_comparison/`

expt-Bartonella_50/*: App-SpaM comparison on Bartonella.
expt-Mycobacterium_40/*: App-SpaM comparison on Mycobacterium.
expt-Rhizobiaceae_50/*: App-SpaM comparison on Rhizobiaceae.
expt-Piscirickettsiaceae_40/*: App-SpaM comparison on Piscirickettsiaceae.
expt-Moraxella_40/*: App-SpaM comparison on Moraxella.
expt-Bacteroides_40/*: App-SpaM comparison on Bacteroides.
appspam_comparison.tsv: App-SpaM comparison combining all results.
**/appspam_eval_metrics*.tsv: App-SpaM results (with and without cp--default without) across reads.
**/krepp_eval_metrics*.tsv: krepp results across all reads for different configurations, in particular HD threshold.

`all_results_combined/expt-emp/`

emp_separation_summary-v*.tsv: Separation results across EMPO levels (v1 and v2).
emp-pp-WoL*: krepp placement results and BIOM tables on Earth's microbiome for WoL-v1/v2.
emp-ogu-WoL*: krepp OGUs results and BIOM tables on Earth's microbiome for WoL-v1/v2.
metadata_emp-v*.tsv: Sample labels across EMPO levels (v1 and v2).
In the subfolders: files with *.qza and *.qzv extensions are QIIME 2 zipped artifacts. Please refer to QIIME2 documen for details and how to export them to other data formats.
In these filenames, empo_${LEVEL} stands for the level of the EMP ontology for which the categories are considered for beta significance calculations.

`all_results_combined/algorithmic_evaluation/`

multitree_heights_info-WoLv2.tsv: Heights of the nodes of the multitree.
out_degrees.tsv: Out-degrees of the nodes of the multitree.
match_stats-krepp_dth4-all.tsv: The number of matches per read at each HD threshold.
all_simulations-results.tsv: Evaluation metrics for distance benchmarking with simulated genomes.
clade_versus_multitree_sizes.tsv: Information for colors corresponding to tree nodes/clades.
color_graph_stats.tsv: Statistics for the color multitree (including degrees) and maximal clades.
index_info-WoLv2.tsv: Details of the color multitree (including the number of k-mers).
postorder_sizes.tsv: Size of the color multitree during the post-order traversal of the tree.

`all_results_combined/cami-ii/`

See amber_strain_madness_contigs-min_dist and amber_marine_contigs-min_dist for results used in the paper (computed using AMBER).
Please refer to AMBER documentation of descriptions of these files and the metrics reported.

`all_results_combined/placement_comparison`

ppmetrics-heuristic_comparison.tsv: Placement metrics (edge errors) on WoLv2 comparing krepp-closest, krepp-LCA, krepp and bowtie2-closest.
ppmetrics-bowtie-closestpp.tsv: Placement metrics (edge errors) only for bowtie2-closest.
ppmetrics_queries16S-all.tsv: Placement metrics (edge errors) for queries used in 16S marker placement benchmarking.
ppmetrics_reads16S-all.tsv: Placement metrics (edge errors) for 16S reads placed using krepp (not used in the paper).

Files that are missing in the `shared.krepp` GitHub repository:

`misc_data/`

data-WoLv2_placement/query_selection/read_id_mapping.tsv.xz: Mapping from read IDs to genomes.
data-WoLv1_placement/read_id_mapping-reads16S.tsv.xz: Read IDs for simulated genome-wide short reads for 16S comparison.
data-WoLv1_placement/distances_WoLv1.csv.xz: Pairwise tree distances on the WoLv1 tree.
data-WoLv1_placement/read_id_mapping-queries16S.tsv.xz: Read IDs for the actual 16S reads.

`results/`

expt-hmi/hmi-ogu-RefSeqCIIdup/feature_tables_merged*: Feature tables in HMI, using krepp OGUs wrt the duplicated RefSeq snapshot, in .qza/BIOM format; -f for 0.0001 filtered.
alignment_comparison/*-alignment_comparison.csv.xz: Comparison of hit counts and read distances across references in WoLv2; krepp versus bowtie2.
expt-emp/emp-ogu-WoLv2/feature_tables_merged*: Feature tables in EMP, using krepp OGUs wrt WoLv2, in .qza/BIOM format; -f for 0.0001 filtered.
expt-emp/emp-ogu-WoLv1/feature_tables_merged*: Feature tables in EMP, using krepp OGUs wrt WoLv1, in .qza/BIOM format; -f for 0.0001 filtered.
algorithmic_evaluation/multitree_heights_info-WoLv2.tsv.xz: Multitree heights of krepp's color index in WoLv2.
all_simulations-results.tsv.xz: Combined results for all simulations.

Data from: krepp: A k-mer-based maximum pseudo-likelihood method for estimating read distances and genome-wide phylogenetic placement

Data files

Abstract

README: Data from: krepp: A k-mer-based maximum pseudo-likelihood method for estimating read distances and genome-wide phylogenetic placement

Access information

Description of the data and file structure

Files and variables

backbone_trees/

microbiome_metadata/

queries-WoLv1_16S_comparison/

queries-WoLv2_placement_comparison/

query_reference_pairs-WoLv2-ge20p/

expt-simulated_genomes/

hmi_reads/

match_stats-krepp_dth4-all.tsv.gz

expt-alignment_comparison

resource_usage_benchmarking

All results and metrics combined

all_results_combined/expt-hmi/

all_results_combined/alignment_comparison/

all_results_combined/expt-appspam_comparison/

all_results_combined/expt-emp/

all_results_combined/algorithmic_evaluation/

all_results_combined/cami-ii/

all_results_combined/placement_comparison

Files that are missing in the shared.krepp GitHub repository:

misc_data/

results/