Data from: Memory-bound k-mer selection for large and evolutionary diverse reference libraries

Şapcı, Ali Osman Berk1 ; Mirarab, Siavash 1

Research facility: University of California, San Diego

Published Sep 10, 2024 on Dryad. https://doi.org/10.5061/dryad.0000000c2

Data files

Sep 10, 2024 version files 110.19 GB

human_pangenome-lib_rand_free-k29_w34_h13_b16_s8.tar.gz
2.94 GB
README.md
8.79 KB
refseq_cami2-lib_rand_free-k30_w35_h14_b16_s9.tar.gz
24.24 GB
refseq_cami2-lib_reps_adpt-k30_w35_h14_b16_s9.tar.gz
24.69 GB
shared-files.tar
250.44 MB
shared-files.tar.gz
48.03 MB
wol_v1-lib_rand_free-k29_w35_h13_b16_s8.tar.gz
6.46 GB
wol_v1-lib_rand_free-k30_w35_h14_b16_s9.tar.gz
22.51 GB
wol_v1-lib_reps_adpt-k29_w35_h13_b16_s8.tar.gz
6.57 GB
wol_v1-lib_reps_adpt-k30_w35_h14_b16_s9.tar.gz
22.48 GB

Abstract

Using k-mers to find sequence matches is increasingly used in many bioinformatic applications, including metagenomic sequence classification. The accuracy of these down-stream applications relies on the density of the reference databases, which, luckily, are rapidly growing. While the increased density provides hope for dramatic improvements in accuracy, scalability is a concern. Reference k-mers are kept in the memory during the query time, and saving all k-mers of these ever-expanding databases is fast becoming impractical. Several strategies for subsampling have been proposed, including minimizers and finding taxon-specific k-mers. However, we contend that these strategies are inadequate, especially when reference sets are taxonomically imbalanced, as are most microbial libraries. In this paper, we explore approaches for selecting a fixed-size subset of k-mers present in an ultra-large dataset to include in a library such that the classification of reads suffers the least. Our experiments demonstrate the limitations of existing approaches, especially for novel and poorly sampled groups. We propose a library construction algorithm called KRANK (K-mer RANKer) that combines several components, including a hierarchical selection strategy with adaptive size restrictions and an equitable coverage strategy. We implement KRANK in highly optimized code and combine it with the locality-sensitive-hashing classifier CONSULT-II to build a taxonomic classification and profiling method. On several benchmarks, KRANK k-mer selection dramatically reduces memory consumption with minimal loss in classification accuracy. We show in extensive analyses based on CAMI benchmarks that KRANK outperforms k-mer-based alternatives in terms of taxonomic profiling and comes close to the best marker-based methods in terms of accuracy.

Data belonging to the following paper:

Şapcı, A. O. B., & Mirarab, S. (2024). Memory-bound k-mer selection for large and evolutionary diverse reference libraries. Genome Research.
Şapcı, A. O. B., & Mirarab, S. (2024). Memory-bound and taxonomy-aware k-mer selection for Ultra-large reference libraries. In J. Ma (Ed.), Research in Computational Molecular Biology (pp. 340–343). Springer Nature Switzerland. https://doi.org/10.1007/978-1-0716-3989-4_26

See https://ter-trees.ucsd.edu/data/krank/ for a catalog of libraries, and query reads that we had simulated for benchmarking. Descriptions of libraries and a tutorial can be found in the main GitHub repository.

KRANK libraries

wol_v1-lib_rand_free-k29_w35_h13_b16_s8.tar.gz: WoL-v1 library lightweight (6.25GB) & with random - fast-mode
wol_v1-lib_rand_free-k30_w35_h14_b16_s9.tar.gz: WoL-v1 library high-sensitivity (25GB) & with random - fast-mode
wol_v1-lib_reps_adpt-k29_w35_h13_b16_s8.tar.gz: WoL-v1 library lightweight (6.25GB) & with ranking - selective-mode
wol_v1-lib_reps_adpt-k30_w35_h14_b16_s9.tar.gz: WoL-v1 library high-sensitivity (25GB) & with ranking - selective-mode
human_pangenome-lib_rand_free-k29_w34_h13_b16_s8.tar.gz: a human pangenome library (3GB) & with random - fast-mode
refseq_cami2-lib_rand_free-k30_w33_h14_b16_s9.tar.gz: CAMI-II library library high-sensitivity (25GB) & with random - fast-mode
refseq_cami2-lib_reps_adpt-k30_w33_h14_b16_s9.tar.gz: CAMI-II library library high-sensitivity (25GB) & with ranking - selective-mode

In order to reproduce CAMI-II profiling results presented in the paper query against both fast-mode and selective-mode libraries at the same time. The human pangenome library was not used in the paper and is mainly for contamination removal.

KRANK-0.5.1.tar.gz

Description: This archive includes KRANK software version v0.5.1; together with scripts for testing and a tutorial.

shared-files.zip

Description: This archive includes all scripts used in the evaluation of read classification and profiling accuracies, resulting metrics and figures together with scripts that were used to generate them. In addition, we also provide taxonomy information, download links of query genomes, and other auxiliary data such as query-reference distance information in this file.

Results and evaluation metrics

results/cscores-10kSpecies_-_combined.csv: taxonomic classification results for all tools on WoL queries
results/all_tools-profiling_evaluation-CAMI1_hc.tsv: abundance profiling metrics of all tools on CAMI-I high-complexity dataset
results/cscores-10kSpecies_-_with_sizes.csv: taxonomic classification results for all tools on WoL queries with library sizes of each tool
results/resultsCAMI2-marine.tsv: abundance profiling metrics of all tools on CAMI-II marine dataset
results/cscores-10kSpecies_-_KRANK-candidates.csv: comparison of different sizes and parameters for KRANK on WoL taxonomic classification
results/cscores-10kSpecies_-_Kraken-II_4Gb.csv: taxonomic classification results of Kraken-II on WoL using 4GB
results/resultsCAMI2-strain_madness.tsv: abundance profiling metrics of all tools on CAMI-II strain-madness dataset
results/cscores-10kSpecies_-_CLARK.csv: taxonomic classification results of CLARK on WoL
results/cscores-10kSpecies_-_Kraken-II_16Gb.csv: taxonomic classification results of Kraken-II on WoL using 16GB
results/cscores-10kSpecies_-_CONSULT-II.csv: taxonomic classification results of CONSULT on WoL using the default configuration
results/running_times-query.tsv: query and library construction running times
results/cscores-10kSpecies_-_KRANK-rankingkmers_comparison.csv: comparison of different heuristics for KRANK’s selection on WoL taxonomic classification
results/cscores-10kSpecies_-_Kraken-II_default.csv: taxonomic classification results of Kraken-II on WoL using the default parameters
results/cscores-10kSpecies_-_KRANK-sizeconst_comparison.csv: comparison of different heuristics for KRANK’s size constraint on WoL taxonomic classification

Auxiliary data, dataset descriptions and taxonomy information

data/ReferenceTaxonomy-nodes.dmp.gz: WoL-v1 taxonomy nodes
data/ref_taxa_counts.txt: genome counts for each taxon in WoL-v1 with rank information
data/10kBacteria-metadata.tsv: WoL-v1 metadata including download links and additional information for reference genomes
data/query_genomes_list.txt: IDs of genomes used in read classification on WoL (download simulated reads here)
data/ref_genome_counts: genome counts for each taxon in WoL-v1
data/ReferenceTaxonomyRWoL-nodes.dmp.gz: WoL-v1 taxonomy nodes reduced to species set
data/taxonomy_lookup: taxonomy lookup table used by CONSULT-II, parent list of each taxon
data/dist_wrt_lastcommonrank.csv: Jaccard similarity between randomly sampled genomes and their corresponding groups
data/query_ranks.tsv: taxonomy information for query genomes, ground truth for evaluation of taxonomic classification
data/reference_genomes_list: reference genomes and corresponding species
data/uDance-ranks_tid.tsv: WoL-v2 taxonomic ranks, some queries were retrieved from here
data/10kBacteria-ranks_tid.tsv: WoL-v1 taxonomic ranks, all genomes in the reference library
data/dist_to_closest.txt: closest reference genome of each query genome and their genomic distance similarity estimated by Mash
data/download-links/all_download.txt: all download links for WoL-v2 used in uDance
data/download-links/download_final_extra_queries.txt: download links for genomes that are not used in CONSULT-II paper
data/download-links/genomes_uniq_uDance.txt
data/download-links/downloads-uDance_exc10k.txt
data/auxiliary/sampleg_dists.txt
data/auxiliary/dist-extra-to-closest.txt
data/auxiliary/uDance_exc10k-ranks_tid.tsv
data/auxiliary/uDance-genera_list.txt
data/auxiliary/uDance-species_list.txt
data/auxiliary/uDance_oneperfamily-ranks_tid.tsv
data/auxiliary/uDance_exc10k-order_info
data/auxiliary/closest_taxon_wrank.txt
data/auxiliary/dist-bacteria-to-closest.txt
data/auxiliary/uDance_exc10k-ranks_tid-downloadable.tsv
data/auxiliary/dist-to-closest.txt
data/auxiliary/dist-archaea-to-closest.txt

Scripts used to do the empirical evaluation and create figures

scripts/construct_taxonomy_lookup.py: constructs the taxonomy lookup table for CONSULT-II from a taxonomy nodes file
scripts/shrink_taxdump.py: given taxonomy nodes and names files and a set of species, reduces taxonomy to the set of species of interest
scripts/evaluate_CLARK.py: custom script to evaluate the read classification output of CLARK, computes TP/FP/TN/FN for each rank and each read
scripts/evaluate_KRANK.py: custom script to evaluate the read classification output of KRANK, computes TP/FP/TN/FN for each rank and each read
scripts/evaluate_CONSULTII.py: custom script to evaluate the read classification output of CONSULT-II, computes TP/FP/TN/FN for each rank and each read
scripts/evaluate_KrakenII.py: custom script to evaluate the read classification output of Kraken-II, computes TP/FP/TN/FN for each rank and each read
scripts/summarize_evaluations.py: summarize TP/FN/TN/FN counts across ranks and genomes, should be used with the output of the above evaluate_*.py scripts
scripts/prepprocess_methods_psummary.py: uses distances in dist/dist_to_closest.txt to compute F1/precision/recall for different distance levels
scripts/prepprocess_methods_summary.py: uses distances in dist/dist_to_closest.txt to compute F1/precision/recall across different novelty bins
scripts/prepprocess_methods_csummary.py: uses distances in dist/dist_to_closest.txt to compute F1/precision/recall across taxon sizes
scripts/match_closest_taxon.py
scripts/dist_wrt_lastcommonrank.py
scripts/get_taxa_count.py
scripts/count_taxa.sh
scripts/find_closest_taxon.sh
scripts/resource_benchmarking.R
scripts/shared_kmers_analysis.R
scripts/profiling_cami2_analysis.R
scripts/profiling_tool_comparision.R
scripts/comparison_-_withCONSULT-II.R
scripts/size_const_comparison-10kSpecies.R
scripts/kmer_ranking_comparison-10kSpecies.R
scripts/classification_comparison-10kSpecies.R
scripts/numgenomes_per_taxon-violinplot-10kSpecies.R
scripts/summary_analysis_cami2.R
scripts/weight_dist_simulations.R
scripts/query_info.R