Data from: Memory-bound k-mer selection for large and evolutionary diverse reference libraries
Data files
Sep 10, 2024 version files 110.19 GB
-
human_pangenome-lib_rand_free-k29_w34_h13_b16_s8.tar.gz
2.94 GB
-
README.md
8.79 KB
-
refseq_cami2-lib_rand_free-k30_w35_h14_b16_s9.tar.gz
24.24 GB
-
refseq_cami2-lib_reps_adpt-k30_w35_h14_b16_s9.tar.gz
24.69 GB
-
shared-files.tar
250.44 MB
-
shared-files.tar.gz
48.03 MB
-
wol_v1-lib_rand_free-k29_w35_h13_b16_s8.tar.gz
6.46 GB
-
wol_v1-lib_rand_free-k30_w35_h14_b16_s9.tar.gz
22.51 GB
-
wol_v1-lib_reps_adpt-k29_w35_h13_b16_s8.tar.gz
6.57 GB
-
wol_v1-lib_reps_adpt-k30_w35_h14_b16_s9.tar.gz
22.48 GB
Abstract
Using k-mers to find sequence matches is increasingly used in many bioinformatic applications, including metagenomic sequence classification. The accuracy of these down-stream applications relies on the density of the reference databases, which, luckily, are rapidly growing. While the increased density provides hope for dramatic improvements in accuracy, scalability is a concern. Reference k-mers are kept in the memory during the query time, and saving all k-mers of these ever-expanding databases is fast becoming impractical. Several strategies for subsampling have been proposed, including minimizers and finding taxon-specific k-mers. However, we contend that these strategies are inadequate, especially when reference sets are taxonomically imbalanced, as are most microbial libraries. In this paper, we explore approaches for selecting a fixed-size subset of k-mers present in an ultra-large dataset to include in a library such that the classification of reads suffers the least. Our experiments demonstrate the limitations of existing approaches, especially for novel and poorly sampled groups. We propose a library construction algorithm called KRANK (K-mer RANKer) that combines several components, including a hierarchical selection strategy with adaptive size restrictions and an equitable coverage strategy. We implement KRANK in highly optimized code and combine it with the locality-sensitive-hashing classifier CONSULT-II to build a taxonomic classification and profiling method. On several benchmarks, KRANK k-mer selection dramatically reduces memory consumption with minimal loss in classification accuracy. We show in extensive analyses based on CAMI benchmarks that KRANK outperforms k-mer-based alternatives in terms of taxonomic profiling and comes close to the best marker-based methods in terms of accuracy.
Data belonging to the following paper:
- Şapcı, A. O. B., & Mirarab, S. (2024). Memory-bound k-mer selection for large and evolutionary diverse reference libraries. Genome Research.
- Şapcı, A. O. B., & Mirarab, S. (2024). Memory-bound and taxonomy-aware k-mer selection for Ultra-large reference libraries. In J. Ma (Ed.), Research in Computational Molecular Biology (pp. 340–343). Springer Nature Switzerland. https://doi.org/10.1007/978-1-0716-3989-4_26
See https://ter-trees.ucsd.edu/data/krank/ for a catalog of libraries, and query reads that we had simulated for benchmarking. Descriptions of libraries and a tutorial can be found in the main GitHub repository.
KRANK libraries
wol_v1-lib_rand_free-k29_w35_h13_b16_s8.tar.gz
: WoL-v1 library lightweight (6.25GB) & with random - fast-modewol_v1-lib_rand_free-k30_w35_h14_b16_s9.tar.gz
: WoL-v1 library high-sensitivity (25GB) & with random - fast-modewol_v1-lib_reps_adpt-k29_w35_h13_b16_s8.tar.gz
: WoL-v1 library lightweight (6.25GB) & with ranking - selective-modewol_v1-lib_reps_adpt-k30_w35_h14_b16_s9.tar.gz
: WoL-v1 library high-sensitivity (25GB) & with ranking - selective-modehuman_pangenome-lib_rand_free-k29_w34_h13_b16_s8.tar.gz
: a human pangenome library (3GB) & with random - fast-moderefseq_cami2-lib_rand_free-k30_w33_h14_b16_s9.tar.gz
: CAMI-II library library high-sensitivity (25GB) & with random - fast-moderefseq_cami2-lib_reps_adpt-k30_w33_h14_b16_s9.tar.gz
: CAMI-II library library high-sensitivity (25GB) & with ranking - selective-mode
In order to reproduce CAMI-II profiling results presented in the paper query against both fast-mode and selective-mode libraries at the same time. The human pangenome library was not used in the paper and is mainly for contamination removal.
KRANK-0.5.1.tar.gz
Description: This archive includes KRANK software version v0.5.1; together with scripts for testing and a tutorial.
shared-files.zip
Description: This archive includes all scripts used in the evaluation of read classification and profiling accuracies, resulting metrics and figures together with scripts that were used to generate them. In addition, we also provide taxonomy information, download links of query genomes, and other auxiliary data such as query-reference distance information in this file.
Results and evaluation metrics
results/cscores-10kSpecies_-_combined.csv
: taxonomic classification results for all tools on WoL queriesresults/all_tools-profiling_evaluation-CAMI1_hc.tsv
: abundance profiling metrics of all tools on CAMI-I high-complexity datasetresults/cscores-10kSpecies_-_with_sizes.csv
: taxonomic classification results for all tools on WoL queries with library sizes of each toolresults/resultsCAMI2-marine.tsv
: abundance profiling metrics of all tools on CAMI-II marine datasetresults/cscores-10kSpecies_-_KRANK-candidates.csv
: comparison of different sizes and parameters for KRANK on WoL taxonomic classificationresults/cscores-10kSpecies_-_Kraken-II_4Gb.csv
: taxonomic classification results of Kraken-II on WoL using 4GBresults/resultsCAMI2-strain_madness.tsv
: abundance profiling metrics of all tools on CAMI-II strain-madness datasetresults/cscores-10kSpecies_-_CLARK.csv
: taxonomic classification results of CLARK on WoLresults/cscores-10kSpecies_-_Kraken-II_16Gb.csv
: taxonomic classification results of Kraken-II on WoL using 16GBresults/cscores-10kSpecies_-_CONSULT-II.csv
: taxonomic classification results of CONSULT on WoL using the default configurationresults/running_times-query.tsv
: query and library construction running timesresults/cscores-10kSpecies_-_KRANK-rankingkmers_comparison.csv
: comparison of different heuristics for KRANK’s selection on WoL taxonomic classificationresults/cscores-10kSpecies_-_Kraken-II_default.csv
: taxonomic classification results of Kraken-II on WoL using the default parametersresults/cscores-10kSpecies_-_KRANK-sizeconst_comparison.csv
: comparison of different heuristics for KRANK’s size constraint on WoL taxonomic classification
Auxiliary data, dataset descriptions and taxonomy information
data/ReferenceTaxonomy-nodes.dmp.gz
: WoL-v1 taxonomy nodesdata/ref_taxa_counts.txt
: genome counts for each taxon in WoL-v1 with rank informationdata/10kBacteria-metadata.tsv
: WoL-v1 metadata including download links and additional information for reference genomesdata/query_genomes_list.txt
: IDs of genomes used in read classification on WoL (download simulated reads here)data/ref_genome_counts
: genome counts for each taxon in WoL-v1data/ReferenceTaxonomyRWoL-nodes.dmp.gz
: WoL-v1 taxonomy nodes reduced to species setdata/taxonomy_lookup
: taxonomy lookup table used by CONSULT-II, parent list of each taxondata/dist_wrt_lastcommonrank.csv
: Jaccard similarity between randomly sampled genomes and their corresponding groupsdata/query_ranks.tsv
: taxonomy information for query genomes, ground truth for evaluation of taxonomic classificationdata/reference_genomes_list
: reference genomes and corresponding speciesdata/uDance-ranks_tid.tsv
: WoL-v2 taxonomic ranks, some queries were retrieved from heredata/10kBacteria-ranks_tid.tsv
: WoL-v1 taxonomic ranks, all genomes in the reference librarydata/dist_to_closest.txt
: closest reference genome of each query genome and their genomic distance similarity estimated by Mashdata/download-links/all_download.txt
: all download links for WoL-v2 used in uDancedata/download-links/download_final_extra_queries.txt
: download links for genomes that are not used in CONSULT-II paperdata/download-links/genomes_uniq_uDance.txt
data/download-links/downloads-uDance_exc10k.txt
data/auxiliary/sampleg_dists.txt
data/auxiliary/dist-extra-to-closest.txt
data/auxiliary/uDance_exc10k-ranks_tid.tsv
data/auxiliary/uDance-genera_list.txt
data/auxiliary/uDance-species_list.txt
data/auxiliary/uDance_oneperfamily-ranks_tid.tsv
data/auxiliary/uDance_exc10k-order_info
data/auxiliary/closest_taxon_wrank.txt
data/auxiliary/dist-bacteria-to-closest.txt
data/auxiliary/uDance_exc10k-ranks_tid-downloadable.tsv
data/auxiliary/dist-to-closest.txt
data/auxiliary/dist-archaea-to-closest.txt
Scripts used to do the empirical evaluation and create figures
scripts/construct_taxonomy_lookup.py
: constructs the taxonomy lookup table for CONSULT-II from a taxonomy nodes filescripts/shrink_taxdump.py
: given taxonomy nodes and names files and a set of species, reduces taxonomy to the set of species of interestscripts/evaluate_CLARK.py
: custom script to evaluate the read classification output of CLARK, computes TP/FP/TN/FN for each rank and each readscripts/evaluate_KRANK.py
: custom script to evaluate the read classification output of KRANK, computes TP/FP/TN/FN for each rank and each readscripts/evaluate_CONSULTII.py
: custom script to evaluate the read classification output of CONSULT-II, computes TP/FP/TN/FN for each rank and each readscripts/evaluate_KrakenII.py
: custom script to evaluate the read classification output of Kraken-II, computes TP/FP/TN/FN for each rank and each readscripts/summarize_evaluations.py
: summarize TP/FN/TN/FN counts across ranks and genomes, should be used with the output of the above evaluate_*.py scriptsscripts/prepprocess_methods_psummary.py
: uses distances in dist/dist_to_closest.txt to compute F1/precision/recall for different distance levelsscripts/prepprocess_methods_summary.py
: uses distances in dist/dist_to_closest.txt to compute F1/precision/recall across different novelty binsscripts/prepprocess_methods_csummary.py
: uses distances in dist/dist_to_closest.txt to compute F1/precision/recall across taxon sizesscripts/match_closest_taxon.py
scripts/dist_wrt_lastcommonrank.py
scripts/get_taxa_count.py
scripts/count_taxa.sh
scripts/find_closest_taxon.sh
scripts/resource_benchmarking.R
scripts/shared_kmers_analysis.R
scripts/profiling_cami2_analysis.R
scripts/profiling_tool_comparision.R
scripts/comparison_-_withCONSULT-II.R
scripts/size_const_comparison-10kSpecies.R
scripts/kmer_ranking_comparison-10kSpecies.R
scripts/classification_comparison-10kSpecies.R
scripts/numgenomes_per_taxon-violinplot-10kSpecies.R
scripts/summary_analysis_cami2.R
scripts/weight_dist_simulations.R
scripts/query_info.R