Data from: krepp: A k-mer-based maximum pseudo-likelihood method for estimating read distances and genome-wide phylogenetic placement
Data files
Feb 03, 2026 version files 95.52 GB
-
all_results_combined.tar.gz
7.50 GB
-
backbone_trees.tar.gz
787.35 KB
-
expt-alignment_comparison.tar
28.35 GB
-
expt-simulated_genomes.tar.gz
2.97 GB
-
match_stats-krepp_dth4-all.tsv.gz
1.23 MB
-
microbiome_metadata.tar.gz
10.39 KB
-
queries-WoLv1_16S_comparison.tar.gz
5.09 MB
-
queries-WoLv2_placement_comparison.tar.gz
12.29 MB
-
query_reference_pairs-WoLv2-ge20p.tar.gz
360.07 KB
-
README.md
19.24 KB
-
resource_usage_benchmarking.tar.gz
53.72 GB
-
shared.tar.gz
2.96 GB
Abstract
Comparing each sequencing read in a sample to a reference database is a fundamental step in wide-ranging applications. The results of these comparisons can facilitate phylogenetic characterization. However, phylogenetic placement is currently only possible at scale for marker genes, a small fraction of the genome. We introduce krepp, an alignment-free k-mer-based method that enables placing reads from anywhere on the genome on an ultra-large reference phylogeny (e.g., 123,853 leaves). This repository contains data from benchmarking experiments in which we show the scalability and accuracy of krepp. We also demonstrate the ability of our method to compare and characterize real metagenomic samples.
Datasets and queries analyzed to benchmark krepp, together with results and evaluation metrics.
The preprint is available at bioRxiv.
The source code can be found on GitHub (github.com/bo1929/krepp).
The version used in the manuscript is v0.4.5 (which is also available on Zenodo).
All results, auxiliary data, and scripts used in the analyses can be found at github.com/bo1929/shared.krepp.
For reference indexes, refer to:
- https://github.com/bo1929/krepp/wiki/Available-reference-indexes
- https://registry.opendata.aws/kreppref/
- https://ter-trees.ucsd.edu/data/krepp/
Please refer to this repository for analysis of the data presented here. We note that the placement error is measured either in the number of edges or the branch length difference. We measure the distances in normalized Hamming distance and in ANI for genome-wide comparison. For QIIME archives, please refer to relevant documentation.
Access information
Other relevant data can be accessed as listed below:
- 29 bacterial single-cell assembled genomes from the GORG dataset are available at NCBI under BioProject ID GenBank: PRJEB33281 (https://doi.org/10.1016/j.cell.2019.11.01).
- The subsampled whole-genome sequencing reads for human microbiome analysis and feature tables for Woltka were derived from Zhu et al. (https://doi.org/10.1128/msystems.00167-22).
- The raw sequencing data for Earth's microbiome analysis from Shaffer et al. (https://doi.org/10.1038/s41564-022-01266-x) is available at www.qiita.ucsd.edu (study: 13114).
- The Web of Life databases can be accessed at https://biocore.github.io/wol/ (version 1) and http://ftp.microbio.me/pub/wol2/ (version 2).
- Reference genomes (a RefSeq snapshot as of January 8, 2019) and the taxonomy for CAMI-II experiments are available at https://cami-challenge.org/reference-databases.
- Contigs analyzed from the gold standard assembly for taxonomic binning experiments can be found at http://frl.publisso.de/data/frl:6425521/marine/ and http://frl.publisso.de/data/frl:6425521/strain/, for marine and strain-madness datasets.
- CAMI-II results for other methods submitted to the challenge are available at https://github.com/CAMI-challenge/second_challenge_evaluation.
- Scripts for postprocessing results, computing evaluation metrics, and generating figures, together with the results, can also be found at https://github.com/bo1929/shared.krepp.
Description of the data and file structure
Files and variables
backbone_trees/
backbone-*.nwk: WoL-v1 and WoL-v2 backbone tree with 10,575 and 15,953 references, respectivelyladderized_tree-*.nwk: A caterpillar tree for WoL-v2 and WoL-v1 referencesrandom_tree-*.nwk: A tree simulated with a dual-birth model for WoL references
microbiome_metadata/
hmi_retained_samples.tsv: Sample IDs used in human microbiome analysis.metadata_emp-*.tsv: Metadata (sample ID and environment label at a specific EMPO level) of Earth's Microbiome Project according to EMPO (version 1 and 2)qiime2_hmi_metadata.tsv: Metadata of human microbiome samples (body sites and subject genders)
queries-WoLv1_16S_comparison/
all_mindist.tsv: Branch lengths to the closest reference (scaled by 100)dist_to_closest.txt: 1-ANI to the closest reference for each query (with IDs)download.list: Download links to all WoL-v1 reference genomeserrors16S-EPAng.txt: Edge errors for each 16S read placement across different query genomesmetadata.tsv: Metadata for WoL-v1 referencesquery16S/queries16S_all.txt: Query genomes together with distances (branch length) to the closestquery16S/query_list.txt: All IDs of query genomesquery16S/simulate_reads.sh: Commands used to simulate readsranks.tid.tsv: Taxonomic ID labels of WoL-v1 reference genomesranks.tsv: Taxonomic labels of WoL-v1 reference genomesread_id_mapping-queries16S.tsv.xz: Genome-wide read ID to query genome mapping for queriesread_id_mapping-reads16S.tsv.xz: 16S read ID to query genome mapping for queriesreads16S.tar.gz: 16S reads for all query genomesreference_list.txtThe list of references from WoL-v1
queries-WoLv2_placement_comparison/
all.map: Genome ID to accession ID mappingall_seq-WoLv2.map: Genome ID to accession ID mappingassembly_info-WoLv2.tsv: Metadata for WoL-v2 genomesdownloads.txt: URLs to download WoL-v2 genomesquery_selection/cluster_varying_threshold.sh: Script used to cluster references for selectionquery_selection/get_common_singletons.sh: Script used to find the intersection of singleton genomes in clustersquery_selection/read_id_mapping.tsv.xz: Genome ID to read ID mappingquery_selection/selected_queries-mindist.tsv: Selected queries and their minimum distance to the closest referencequery_selection/selected_queries.txt: The list of queries selectedquery_selection/selected_queries_levels.tsv: Selected queries and the TreeCluster threshold that they correspond toquery_selection/simulate_reads.sh: Script used to simulate readsquery_selection/singletons-*.tsv: List of singleton references at each clustering thresholdquery_selection/tree_clustering-t*.tsv: Result of TreeCluster at each thresholdreference_list.txt: List of all reference genomes used in WoL-v2taxonomy_WoLv2.tsv: Taxonomic labels of genomes from WoL-v2
query_reference_pairs-WoLv2-ge20p/
all_distances-mash.tsv: All pairwise 1-ANI values according to Mashall_distances-orthoANI.tsv: All pairwise 1-ANI values according to orthoANIall_distances-skani.tsv: All pairwise 1-ANI values according to skaniall_genomes.tsv: All references with a sufficiently high number of mapped readsall_pairs.tsv: All pairs with sufficient mapping ratecommon_distances-mash.tsv: Distances for common pairs with >=20% mapping rate according to orthoANIcommon_distances-orthoANI.tsv: Distances for common pairs with >=20% mapping rate according to orthoANIcommon_distances-skani.tsv: Distances for common pairs with >=20% mapping rate according to skanicommon_pairs.tsv: Pairs that have >=20% reads mapped for both bowtie2 and krepplogs.tar.gz: Log files for ANI estimationmash_estimate.sh: Command for estimating ANI using MashorthoANI_distances.tar.gz: Output of orthoANI for different query/reference pairsorthoANI_estimate.sh: Command for estimating ANI using orthoANIskani_distances.tar.gz: Output of skani for different query/reference pairsskani_estimate.sh: Command for estimating ANI using skani
expt-simulated_genomes/
build_libraries.sh: Commands used to build a krepp index for each reference genomeestimate_distances.sh: Commands used to estimate distances from mutated genomes to the reference genomesevaluate_distances.sh: Commands used to evaluate the accuracy of estimatesgenomes_fasta_paths.txt: Paths of reference genomes used to generate mutated genomesmutated_fasta_paths.txt: Paths of mutated genomes with their corresponding distance configurationpostprocess_for_eval.py: Script used to evaluate the accuracy of krepp estimatessimulated_reads_paths.txt: Paths of reads simulated from mutated genomesdistance_estimates/{$GENOME_ID}_a{$ALPHA}_gnd{$DISTANCE}-distances.tsv: Resulting distance estimates for reads simulated from a genome mutated from$GENOME_IDwith$DISTANCEand$ALPHAdistance_evaluation/{$GENOME_ID}_a{$ALPHA}_gnd{$DISTANCE}-results.tsv: Accuracy of distance estimates for reads simulated from a genome mutated from$GENOME_IDwith$DISTANCEand$ALPHAdistance_evaluation/all_simulations-results.tsv: All results concatenatedsimulated_genomes/${GENOME_ID}/${GENOME_ID}_contigs.fasta: Contigs (nucleic acids) of the original assembled reference genomesimulated_genomes/${GENOME_ID}/${GENOME_ID}_contigs_genes.faa: Genes (amino acids) of the original assembled reference genomesimulated_genomes/${GENOME_ID}/g_weights_a22.npy: Weights used to drop mutations alpha=22simulated_genomes/${GENOME_ID}/g_weights_a5.npy: Weights used to drop mutations alpha=5simulated_genomes/${GENOME_ID}/input_map.tsv: The path to the reference to indexsimulated_genomes/${GENOME_ID}/mutated_BLOSUM62_{$GENOME_ID}_a{$ALPHA}_gnd{$DISTANCE}/mutated_genes.faa: Genes for the mutated genome with$DISTANCEand$ALPHAsimulated_genomes/${GENOME_ID}/mutated_BLOSUM62_{$GENOME_ID}_a{$ALPHA}_gnd{$DISTANCE}/mutated.fastaSequences for the mutated genome with$DISTANCEand$ALPHAsimulated_genomes/${GENOME_ID}/mutated_BLOSUM62_{$GENOME_ID}_gnd{$DISTANCE}/gnd_aad.txt: The true distance between the reference and the mutated genomesimulated_reads/${GENOME_ID}_a{$ALPHA}_gnd{$DISTANCE}_errFree.fasta: Reads simulated from the mutated genomesimulated_reads/${GENOME_ID}_a{$ALPHA}_gnd{$DISTANCE}.aln: Coordinates of the simulated reads
hmi_reads/
SR${SAMPLE_ID}.${PAIR_END}.fq: 1M subsampled paired-end ($PAIR_END) short reads for sample$SAMPLE_ID
match_stats-krepp_dth4-all.tsv.gz
Results for the proportions of reads with at least one match up to the Hamming distance threshold (three columns: genome ID, read ID, and the minimum match HD).
expt-alignment_comparison
extra_queries_genomes/${GENOME_ID}.fna: Selected genomes used to simulate reads from for distance benchmarkingalignment_reports-WoLv2/${GENOME_ID}.sam: Resulting SAM files obtained using bowtie2alignment_reports_postprocessed-WoLv2/{GENOME_ID}.tsv: Postprocessed SAM files with alignment distances for aligned readsdistance_reports-WoLv2/${GENOME_ID}.tsv: Distance estimates of kreppdistance_reports_postprocessed-WoLv2/{GENOME_ID}.tsv: Postprocessed version of krepp distance estimates used to compute evaluation metricssummary_reports-WoLv2/count_summary-{GENOME_ID}.csv: Evaluation metrics for distances per query genome summarizing all readssummary_reports-WoLv2/read_summary-{GENOME_ID}.csv: Evaluation metrics for distances per query genome for each readBOWTIE_VERSION: Bowtie2 version information used in the benchmarking.mash_estimate.sh: Script for pairwise ANI estimations using Mashbuild_library.sh: Script for bowtie2 indexing of WoL databasesalign_queries.sh: Script for bowtie2 alignment.estimate_queries.sh: Script used for krepp (alpha-version) distance estimationpostprocess_results.sh: Script used for postprocessing of results for computing metrics (see shared.krepp for each Python script)logs-WoLv2/{JOB_NAME}.out: Log files for stdoutlogs-WoLv2/{JOB_NAME}.err: Log files for stderrnovel_queries_sampreads/{GENOME_ID}.fq: Reads simulated using ART for novel query genomesselected_queries_sampreads/{GENOME_ID}.fq: Reads simulated using ART for selected query genomes
resource_usage_benchmarking
backbone-WoLv1*.nwk: Trees in Newick format for the WoLv1 subsamples (w/ 2000 and 5000 references).benchmark-WoL*.sh: Scripts used for benchmarking across indexes with varying sizes.build_library*.sh: Scripts used to create Bowtie2 and krepp indexes for reference subsets.concat-WoLv*: Concatenated FASTA file for reference genome subsets.db-WoLv*: Bowtie2 indexes.input_map-WoL*: The list of references.library-WoLv1_sampled*: krepp indexes for WoLv1 subsets.logs/bowtie2*: Logs for Bowtie2 benchmarking.logs/krepp*: Logs for krepp benchmarking.logs/resource_benchmarking*: Logs for resource usage comparison.ref_ids-WoLv1*: IDs of references.resource_benchmarking.tsv: Benchmarking results (running time and memory) for both querying and indexing.selected_queries_concat*: Simulated input reads used for benchmarking.
All results and metrics combined
all_results_combined/expt-hmi/
hmi-woltka-WoLv2: Human microbiome results for Woltka OGUs using WoL-v2 reference.hmi-bracken-WoLv2: Human microbiome results for Bracken profiles using WoL-v2 reference.hmi-ogu-RefSeqCIIdup: Human microbiome results for krepp OGUs using duplicated uDance tree as reference.hmi-pp-RefSeqCIIdup: Human microbiome results for krepp placements using duplicated uDance tree as reference.hmi-ogu-WoLv2: Human microbiome results for krepp OGUs using duplicated WoL-v2 reference.microbiome_metadata.tsv: Metadata of samples.hmi-pp-RefSeqCII: Human microbiome results for krepp placements using deduplicated uDance tree as reference.hmi_separation_summary.csv: Summary for separation statistics (pseudo-F) across all methods/references.hmi-ogu-RefSeqCII: Human microbiome results for krepp OGUs using deduplicated uDance tree as reference.hmi-pp-WoLv2: Human microbiome results for krepp placements using duplicated WoL-v2 reference.hmi-woltka-WoLv2-v0:Human microbiome results for Woltka OGU using WoL-v2 as reference.
all_results_combined/alignment_comparison/
count_summary-all-WoLv2-alignment_comparison.csv: Number of matches per query read.all_results_combined/alignment_comparison/read_summary-WoLv2-alignment_comparison-1M.csv: Distance benchmarking against alignment, metrics per read (1M subsample) for all queries.read_summary-WoLv2-alignment_comparison.csv: Distance benchmarking against alignment, metrics per read for all queries.resource_benchmarking.tsv: Running time and memory use (bowtie2 and krepp).dist_to_closest-final.tsv: Novelty values for queries, measured by Mash.reference_summary-all-WoLv2-alignment_comparison.csv: Distance benchmarking against alignment summarizing reads per query.
all_results_combined/expt-appspam_comparison/
expt-Bartonella_50/*: App-SpaM comparison on Bartonella.expt-Mycobacterium_40/*: App-SpaM comparison on Mycobacterium.expt-Rhizobiaceae_50/*: App-SpaM comparison on Rhizobiaceae.expt-Piscirickettsiaceae_40/*: App-SpaM comparison on Piscirickettsiaceae.expt-Moraxella_40/*: App-SpaM comparison on Moraxella.expt-Bacteroides_40/*: App-SpaM comparison on Bacteroides.appspam_comparison.tsv: App-SpaM comparison combining all results.**/appspam_eval_metrics*.tsv: App-SpaM results (with and without cp--default without) across reads.**/krepp_eval_metrics*.tsv: krepp results across all reads for different configurations, in particular HD threshold.
all_results_combined/expt-emp/
emp_separation_summary-v*.tsv: Separation results across EMPO levels (v1 and v2).emp-pp-WoL*: krepp placement results and BIOM tables on Earth's microbiome for WoL-v1/v2.emp-ogu-WoL*: krepp OGUs results and BIOM tables on Earth's microbiome for WoL-v1/v2.metadata_emp-v*.tsv: Sample labels across EMPO levels (v1 and v2).- In the subfolders: files with
*.qzaand*.qzvextensions are QIIME 2 zipped artifacts. Please refer to QIIME2 documen for details and how to export them to other data formats. - In these filenames,
empo_${LEVEL}stands for the level of the EMP ontology for which the categories are considered for beta significance calculations.
all_results_combined/algorithmic_evaluation/
multitree_heights_info-WoLv2.tsv: Heights of the nodes of the multitree.out_degrees.tsv: Out-degrees of the nodes of the multitree.match_stats-krepp_dth4-all.tsv: The number of matches per read at each HD threshold.all_simulations-results.tsv: Evaluation metrics for distance benchmarking with simulated genomes.clade_versus_multitree_sizes.tsv: Information for colors corresponding to tree nodes/clades.color_graph_stats.tsv: Statistics for the color multitree (including degrees) and maximal clades.index_info-WoLv2.tsv: Details of the color multitree (including the number of k-mers).postorder_sizes.tsv: Size of the color multitree during the post-order traversal of the tree.
all_results_combined/cami-ii/
See amber_strain_madness_contigs-min_dist and amber_marine_contigs-min_dist for results used in the paper (computed using AMBER).
Please refer to AMBER documentation of descriptions of these files and the metrics reported.
all_results_combined/placement_comparison
ppmetrics-heuristic_comparison.tsv: Placement metrics (edge errors) on WoLv2 comparing krepp-closest, krepp-LCA, krepp and bowtie2-closest.ppmetrics-bowtie-closestpp.tsv: Placement metrics (edge errors) only for bowtie2-closest.ppmetrics_queries16S-all.tsv: Placement metrics (edge errors) for queries used in 16S marker placement benchmarking.ppmetrics_reads16S-all.tsv: Placement metrics (edge errors) for 16S reads placed using krepp (not used in the paper).
Files that are missing in the shared.krepp GitHub repository:
misc_data/
data-WoLv2_placement/query_selection/read_id_mapping.tsv.xz: Mapping from read IDs to genomes.data-WoLv1_placement/read_id_mapping-reads16S.tsv.xz: Read IDs for simulated genome-wide short reads for 16S comparison.data-WoLv1_placement/distances_WoLv1.csv.xz: Pairwise tree distances on the WoLv1 tree.data-WoLv1_placement/read_id_mapping-queries16S.tsv.xz: Read IDs for the actual 16S reads.
results/
expt-hmi/hmi-ogu-RefSeqCIIdup/feature_tables_merged*: Feature tables in HMI, using krepp OGUs wrt the duplicated RefSeq snapshot, in .qza/BIOM format;-ffor 0.0001 filtered.alignment_comparison/*-alignment_comparison.csv.xz: Comparison of hit counts and read distances across references in WoLv2; krepp versus bowtie2.expt-emp/emp-ogu-WoLv2/feature_tables_merged*: Feature tables in EMP, using krepp OGUs wrt WoLv2, in .qza/BIOM format;-ffor 0.0001 filtered.expt-emp/emp-ogu-WoLv1/feature_tables_merged*: Feature tables in EMP, using krepp OGUs wrt WoLv1, in .qza/BIOM format;-ffor 0.0001 filtered.algorithmic_evaluation/multitree_heights_info-WoLv2.tsv.xz: Multitree heights of krepp's color index in WoLv2.all_simulations-results.tsv.xz: Combined results for all simulations.
