Code and data from: Multiscale analysis and optimal glioma therapeutic candidate discovery using the CANDO platform
Data files
Jun 02, 2026 version files 1.96 GB
-
Benchmarking_Glioma.csv
750 B
-
Benchmarking_Glioma.xlsx
21.84 KB
-
Bottom_24_predictions_filtered_n_geq4.xlsx
57.65 KB
-
canbenchmark.py
930 B
-
cando.py
336.98 KB
-
canpredict_compounds_and_top_targets.py
1.09 KB
-
drugbank-v2.9-approved.tsv
110.37 KB
-
drugbank-v2.9.tsv
763.57 KB
-
drugbank2ctd-v2.9.tsv
924.17 KB
-
Expanded_Benchmarking_results.xlsx
243.16 KB
-
Expanded_Benchmarking_results.zip
9.02 KB
-
filtered-alphafold-CxP-rd_ecfp4-v2.9-all.tsv
1.64 GB
-
Gold_standard_from_Genecard.xlsx
10.94 KB
-
Gold_standard_from_Table_2.xlsx
9.09 KB
-
Gold_standard_from_Uniprot.xlsx
10.48 KB
-
histograms.py
6.08 KB
-
JC_comparison_data_rank_leq10.xlsx
85.75 KB
-
JC_comparison_graph.py
2.90 KB
-
jc_line_graph_ranks.py
2.99 KB
-
overlap_frequency_ranks.py
3 KB
-
random_control.py
604 B
-
rd_ecfp4-int-dice-alphafold-homo_sapien-coach-c0.0-p0.0-CxP-approved.tsv
298.36 MB
-
README.md
17.54 KB
-
summary.py
7.02 KB
-
summary.xlsx
20.55 KB
-
Tanimoto.zip
24.38 MB
-
Targets_from_random_24_predictions.xlsx
61.34 KB
-
Targets_from_top_24_predictions.xlsx
58.38 KB
Abstract
Glioma is a highly malignant brain tumor with limited treatment options. This dataset accompanies our study and contains the computed resources used with the Computational Analysis of Novel Drug Opportunities (CANDO) platform for multiscale therapeutic discovery to predict new glioma therapies. It includes the compound–protein interaction scores used to generate proteome-scale interaction signatures; ranked similarity and consensus lists (generated by the accompanying scripts) that model compound behavior across targets; and all benchmarking outputs measuring recovery of approved drugs at multiple cutoffs with several metrics and random-control comparisons across all indications. Compounds ranked highly by consensus but not previously associated with glioma are provided as new predictions together with literature corroboration tags, yielding 23 potential treatments (e.g., vitamin D, taxanes, vinca alkaloids, topoisomerase inhibitors, folic acid; investigational compounds include ginsenosides, chrysin, resiniferatoxin, cryptotanshinone). We also provide top-target tables and functional-annotation summaries highlighting proteins with the strongest predicted interactions to these compounds, including vitamin D3 receptor, thyroid hormone receptor, acetylcholinesterase, cyclin-dependent kinase 2, tubulin alpha chain, dihydrofolate reductase, and thymidylate synthase. Collectively, the files enable reproduction of our rankings and benchmarks, extension to alternative libraries or indications, and reuse for target/pathway interrogation using the same multitarget, multiscale framework that supported candidate identification in the paper.
Our Dryad repository provides the Python scripts and data files required to reproduce the results of our study.
Experimental data for this study were generated by running CANDO’s benchmarking and prediction functions on existing datasets derived from DrugBank, CTD, and AlphaFold mapping files. Specifically, we used the approved compound list drugbank-v2.9-approved.tsv and the matrix file rd_ecfp4-int-dice-alphafold-homo_sapien-coach-c0.0-p0.0-CxP-approved.tsv for canbenchmark.py, and the full compound list drugbank-v2.9.tsv together with the matrix file filtered-alphafold-CxP-rd_ecfp4-v2.9-all.tsv for canpredict_compounds_and_top_targets.py.
System requirements
All scripts have been tested on Windows 11 Version 10.0.26100 and Python version 3.12.8. Most modern computers and Python version 3 users should be able to run the scripts.
Installation
Download cando.py and learn how to install the program from the CANDO tutorial: https://github.com/ram-compbio/CANDO/blob/master/CANDO_tutorial.ipynb. The tutorial also provides documentation on specific functions mentioned in this tutorial and serves as a useful reference.
Change the value of the variables cmpd_map, ind_map, and matrix_file in each supporting Python script (canbenchmark.py and canpredict_compounds_and_top_targets.py) to match the location of your data files on your machine.
Core CANDO scripts:
canbenchmark.py: Implements the benchmarking procedure described in Methods 2.5 (“Benchmarking”).
This script evaluates the accuracy of the CANDO platform by comparing ranked similarity and consensus lists against known drug-indication associations, computing AIA, nAIA, IA, nIA, NDCG, and nNDCG metrics. It reproduces the benchmarking results shown in Figure 2 and Benchmarking_Glioma.xlsx and Benchmarking_Glioma.csv by reading the specified compound-proteome matrix (matrix_file), compound-indication mappings (cmpd_map, ind_map), and running canbenchmark, canbenchmark_ndcg, and canbenchmark_new.
canpredict_compounds_and_top_targets.py: Reproduces the drug-prediction and top-target analyses detailed in Methods 2.6–2.7 (“Generating drug predictions” and “Analyzing top targets and associated pathways”).
The script applies the CANDO platform to the glioma indication (MeSH:D005910), ranks compounds based on interaction-signature similarity, generates the top 100 putative glioma drug candidates, and outputs the top 100 predicted protein targets for each of the 24 high-corroboration compounds used in downstream overlap and Jaccard analyses (Figures 3–5, Tables 1–2).
Set cmpd_map, ind_map, and matrix_file in canbenchmark.py and canpredict_compounds_and_top_targets.py to match local data paths.
random_control.py: Implements the random-matrix generation procedure referenced in Methods 2.5 (“Benchmarking”).
It shuffles the compound–protein interaction matrix to create randomized controls for benchmarking, reproducing the random-control distributions compared against experimental results in Figure 2. Generate 3 randomized matrices, run canbenchmark.py with them, and average the results to obtain control AIA, IA, NDCG, and related values.
cando.py: Core module providing all CANDO platform functions described in Methods 2.3–2.5 (“Scoring compound-protein interactions,” “Calculating ranked compound similarity lists,” and “Benchmarking”).
It includes routines for reading compound-proteome matrices, computing all-against-all interaction-signature similarities using cosine distance, ranking compounds, generating consensus lists, and computing benchmarking metrics used by the above driver scripts.
Core CANDO data files:
filtered-alphafold-CxP-rd_ecfp4-v2.9-all.tsv: Comprehensive compound–proteome interaction matrix generated using the BANDOCK protocol (Methods 2.3), containing interaction scores between all 13,457 DrugBank-derived compounds and 20,295 AlphaFold2-predicted Homo sapiens proteins.
- Columns: Each column corresponds to a protein target (UniProt identifier).
- Rows: Each row represents a compound (DrugBank identifier).
- Values: BANDOCK interaction scores (0 – 1) computed as the product of COACH binding-site confidence and Sorenson-Dice chemical similarity.
Used as the default full-library input forcanpredict_compounds_and_top_targets.pyto generate drug-prediction and top-target analyses.
rd_ecfp4-int-dice-alphafold-homo_sapien-coach-c0.0-p0.0-CxP-approved.tsv: Subset of the above interaction matrix restricted to 2,449 approved DrugBank compounds and the same AlphaFold2 Homo sapiens protein library.
- Structure: Identical row/column format and scoring metric as the full matrix.
- Purpose: Used by
canbenchmark.pyfor benchmarking performance of the CANDO pipeline against known drug-indication associations (Figures 2A–2D).
drugbank-v2.9-approved.tsv: Mapping file of all approved DrugBank compounds included in the CANDO benchmarking dataset (Methods 2.2).
Columns:
- CANDO_ID — Internal CANDO compound identifier.
- DRUGBANK_ID — DrugBank accession for the compound.
- GENERIC_NAME — Non-proprietary compound name.
- DRUG_GROUPS — Drug status/group labels (includes “approved”).
- Purpose: Imported by
canbenchmark.pyto define the approved compound subset for benchmarking analyses.
drugbank-v2.9.tsv: Comprehensive compound metadata file containing all approved, experimental, and investigational DrugBank entries used in the full compound-proteome matrix (Methods 2.2).
- Columns: Same as drugbank-v2.9-approved.tsv.
- Purpose: Input for
canpredict_compounds_and_top_targets.pyto identify and rank candidate drugs from the full compound library.
drugbank2ctd-v2.9.tsv
Drug-indication mapping file linking DrugBank compound identifiers to Comparative Toxicogenomics Database (CTD) disease indications (Methods 2.2, 2.5).
Columns:
- CANDO_ID — Internal CANDO compound identifier.
- INDICATION_NAME — Disease/indication name from CTD.
- MESH_ID — MeSH identifier for the indication.
- INDICATION_ID — CTD indication identifier.
- Purpose: Used by both
canbenchmark.pyandcanpredict_compounds_and_top_targets.pyto associate compounds with disease indications (e.g., glioma = MESH:D005910) for benchmarking and prediction tasks.
Overlap analyses scripts and data:
These scripts were used to perform overlap analyses evaluating the alignment between predicted protein targets from the CANDO platform and curated glioma “gold standard” protein libraries (from UniProt, GeneCards, and literature sources compiled in Table 2 of the manuscript). Together, they quantify the overlap, frequency distributions, and Jaccard coefficients between predicted and known glioma-associated proteins across top, random, and bottom-ranked drug predictions, and visualize these results through summary tables and publication-ready figures.
summary.py: Generates the master summary table (summary.xlsx) containing overlap statistics between each gold standard set and predicted targets from the top 24, random 24, and bottom 24 drug predictions. For each comparison, it calculates the number of targets in each set, intersection and union sizes, overlap frequency, and Jaccard coefficients across rank and score thresholds. This summary file serves as the input for all subsequent visualization scripts.
overlap_frequency_ranks.py: Produces line graphs showing the frequency of overlap (percentage of gold standard proteins captured) as a function of rank cutoff (≤10–100) for each gold standard dataset. This allows visualization of how overlap increases with ranking depth across top, random, and bottom prediction sets.
jc_line_graph_ranks.py: Computes and plots the Jaccard coefficient (intersection / union) as a function of rank cutoff for each gold standard reference. This assesses the relative agreement between predicted and known targets across rank thresholds, and exports the results to jc_vs_rank_cutoff.xlsx.
histograms.py: Generates histogram-based bar plots showing the relative frequency of gold standard proteins occurring within rank bins (1–20, 21–40, …, 81–100) across the top, random, and bottom prediction groups. This provides a distributional view of target enrichment at different ranking levels and exports supporting data to histograms_raw_data.xlsx.
JC_comparison_graph.py: Compares Jaccard coefficients across multiple disease indications using aggregated overlap data (JC_comparison_data_rank_leq10.xlsx). It creates side-by-side bar plots comparing JC values between top 24 and top 100 predicted targets for glioma and other disease indications.
random_control.py: Generates randomized control matrices by shuffling compound–protein similarity data within each compound entry. These randomized matrices serve as negative controls for benchmarking overlap analyses against non-biological random expectations.
summary.xlsx
This file compiles the computed overlap statistics between predicted and gold-standard protein sets, generated by summary.py and subsequently used by the visualization scripts (overlap_frequency_ranks.py, jc_line_graph_ranks.py).
Columns:
- Experiment: Text label describing the comparison (e.g., “GS from UniProt vs. targets from top24 (rank≤50)”).
- Targets in GS: Number of proteins in the gold standard dataset.
- Targets in actual / control: Number of predicted proteins in the evaluated drug set (top, random, or bottom).
- Overlap (intersection): Number of shared proteins between the predicted set and the gold standard.
- Overlap (frequency): Proportion of gold-standard proteins captured by the predictions.
- Union: Total number of unique proteins across both sets.
- JC: Jaccard coefficient (intersection / union), quantifying overlap similarity.
Predicted target files:
These files share identical structure and were used to compare CANDO-predicted protein targets across top, random, and bottom-ranked drug prediction sets.
- Targets_from_top_24_predictions.xlsx
- Targets_from_random_24_predictions.xlsx
- Bottom_24_predictions_filtered_n_geq4.xlsx
Columns:
- rank: Integer rank assigned to each predicted protein target (1 = highest).
- score: Similarity score between the compound and protein target.
- id: UniProt identifier of the predicted protein target.
Gold standard reference files:
These files contain curated lists of glioma-associated proteins used as “gold standards” for overlap benchmarking.
- Gold_standard_from_Table_2.xlsx
- Gold_standard_from_Uniprot.xlsx
- Gold_standard_from_Genecard.xlsx
Columns:
- id: UniProt identifier of the known glioma-associated protein.
Jaccard comparison file:
Contains precomputed overlap and union values used to calculate Jaccard coefficients across disease indications.
- JC_comparison_data_rank_leq10.xlsx
Columns:
- Top 24 Overlap / Top 24 Union: Counts of overlapping and total proteins for the top 24 predictions per indication.
- Top 100 Overlap / Top 100 Union: Counts of overlapping and total proteins for the top 100 predictions per indication.
Tanimoto.zip:
This Dryad package contains the scripts and outputs used to benchmark a 2D ligand-similarity pipeline within the CANDO platform. In this workflow, drug–drug similarity scores are computed using the Tanimoto coefficient (based on ECFP4 circular fingerprints) to produce a compound–compound similarity matrix, which is then used as input for CANDO benchmarking across all indications and for glioma-specific analyses. The deposited code also includes the variant used to suppress near-duplicate analog relationships and the fusion implementation that combines ligand- and proteomic-signature–based similarities. Collectively, these files reproduce the ligand, filtered-ligand, and fusion benchmarking results reported and discussed in our manuscript, and enable comparison against the proteomic pipeline under the same benchmarking framework.
ECFP4_Tanimoto_matrix.py
Python script to generate an NxN ECFP4/Tanimoto similarity matrix for the approved-drug library. 1) Loads RDKit molecules from MOL2 (Chem.MolFromMol2File(..., sanitize=True, removeHs=True)). 2) Computes ECFP4 as a Morgan fingerprint (radius=2) bit vector (2048 bits). 3) Computes row-wise Tanimoto similarities with DataStructs.BulkTanimotoSimilarity.
Inputs:
- Approved drug list (
drugbank-v2.9-approved.tsv). - MOL2 structure directory (files named by
CANDO_ID, e.g.,1234.mol2).
Outputs:
ecfp4_tanimoto_matrix.tsv(tab-delimited, no header; diagonal set to 1.0).
ecfp4_tanimoto_matrix.tsv
The unfiltered ECFP4/Tanimoto similarity matrix generated by ECFP4_Tanimoto_matrix.py.
Structure:
- Square NxN matrix of Tanimoto similarity values (diagonal = 1.0).
Row/column ordering matches the approved-drug list order used during matrix generation (drugbank-v2.9-approved.tsv).
Usage:
- Input to CANDO as a
read_distsfile to benchmark the unfiltered ligand-similarity pipeline (loaded withsimilarity=True).
ECFP4_Tanimoto_matrix_unique_98.py
Python script to generate a “uniqueness-filtered” ECFP4/Tanimoto similarity matrix where the most extreme similarities (top 2% per row) are suppressed (set to 0). Computes the unfiltered similarity row with BulkTanimotoSimilarity. For each query compound i, computes the 98th percentile of similarities to all other compounds (np.quantile(..., 0.98), excluding self), and sets similarities ≥ that cutoff to 0.0 (while keeping self-similarity at 1.0).
Inputs: Same as the unfiltered generator (approved TSV + MOL2 directory).
**Outputs: **
ecfp4_tanimoto_matrix_unique_98.tsv(same format as the unfiltered matrix).
ecfp4_tanimoto_matrix_unique_98.tsv
The filtered ECFP4/Tanimoto similarity matrix generated by ECFP4_Tanimoto_matrix_unique_98.py, with per-row similarities in the ≥98th percentile set to 0.0.
Usage:
- Input to CANDO as
read_distsfor the Tan_filter ligand-similarity benchmark (the “near-duplicate analog suppression” condition).
ECFP4_Tanimoto_benchmark.py
Python script to run CANDO benchmarking using an ECFP4/Tanimoto read_dists matrix (unfiltered or filtered). Initializes a CANDO object with compute_distance=False, loads the matrix via read_dists=..., and treats values as similarities (similarity=True).
Usage:
- Produces similarity-list benchmarking metrics via CANDO’s benchmarking functions (e.g.,
canbenchmark,canbenchmark_ndcg) and can be paired with consensus benchmarking viacanbenchmark_newincando.py.
data_fusion_MUL.py
Python script implementing the fusion pipeline used in the paper (product of proteomic- and ligand-derived similarity signals). Builds separate CANDO objects for (1) Tanimoto, (2) proteomic-signature distances, and (3) a fused object seeded from the Tanimoto read_dists. Convert distances to similarities: s = 1 - d; fuse: s_fused = s_tan * s_prot; convert back: d_fused = 1 - s_fused.
Inputs:
- Ligand
read_distsmatrix (unfilteredecfp4_tanimoto_matrix.tsv). - Proteomic compound–protein interaction signature matrix (CANDO matrix input, see above).
- Drug and indication mapping files (
C_MAP,I_MAP).
Output: A fused similarity list per compound (used for downstream benchmarking in the same framework as the constituent pipelines).
Tanimoto_Distributions.xlsx
A summary table of binned pairwise ECFP4/Tanimoto similarities for three populations: all-vs-all, glioma-vs-all, and glioma-vs-glioma. For each similarity bin (e.g., 0.0–0.1, 0.1–0.2, …), the sheet reports the proportion and count of compound pairs in that bin for each population. These distributions support interpretation of glioma benchmarking behavior by showing whether glioma drugs are relatively more structurally diverse than the broader library (i.e., high-similarity “analog-heavy” pairs are depleted in the glioma–glioma pool).
Expanded_Benchmarking_results.xlsx
This file contains benchmarking results comparing experimental CANDO performance against randomized control distributions across multiple recovery metrics and ranking cutoffs. In response to reviewer feedback, we expanded benchmarking analyses to include ECFP4/Tanimoto ligand-similarity benchmarking, a filtered Tanimoto pipeline suppressing near-duplicate analog relationships, and a fusion approach integrating ligand- and proteomic-signature similarities. Together, these analyses provide benchmarking context beyond random controls and enable comparison of proteomic, ligand-based, and combined approaches within the same framework. Expanded_Benchmarking_results.zip folder contains the unformatted CSV format this Excel file.
License
This project is licensed under the CC0 1.0 Universal (Public Domain Dedication).
You may freely use, modify, and distribute this work without restriction.
