Data and code from: The shared selection landscape of dog and human cancers
Data files
Nov 24, 2025 version files 17.92 GB
-
data_processing.tgz
58.95 MB
-
machine_learning_models.tgz
17.86 GB
-
mutational_signatures.tgz
6.44 MB
-
README.md
15.22 KB
Abstract
Genomics-guided therapies have transformed clinical outcomes for some cancer types, but development of new treatments remains slow. Pet dogs are an underutilized and potentially powerful model for therapeutic innovation. Here, we systematically evaluate genomic similarity between dog and human tumors using the largest comparative cancer dataset assembled to date: 15,315 orthologous genes in 429 dog tumors and 14,966 human tumors across 39 different cancer types. We find that cancers in dogs and humans are genomically almost indistinguishable, with shared mutational signatures and recurrent mutations in the same genes. Some cancer driver genes are more frequently mutated in dogs than in humans, creating an opportunity to recruit large cohorts for clinical trials. Tumors from dogs and humans are, on average, as genomically similar as tumors from different human cancer types, and tumors do not separate by species in unsupervised clustering. Supervised machine learning identified shared genomic features between dog and human tumors. Even so, consistent with the genomic heterogeneity of both dog and human cancers, there was no dog cancer type for which all tumors were classified to a single human type. Our findings establish dogs as a compelling system for the development of targeted cancer therapies and illustrate the potential for using machine learning to discover models for human cancers in dogs and other species.
Dataset DOI: 10.5061/dryad.vhhmgqp68
Description of the data and file structure
Contact information
Elinor Karlsson, PhD
UMass Chan Medical School & Broad Institute of MIT and Harvard
elinor@broadinstitute.org
Diane Genereux, PhD
Broad Institute of MIT and Harvard
genereux@broadinstitute.org
Dataset overview
This dataset contains all data necessary to recreate the analyses presented in Genereux & Megquier, et al. Genomic matching to optimize dog cancer as a model for targeted therapeutics. (In review).
Included are:
Data processing: Feature values for the somatic mutations in each sample and the orthologous regions defined between CanFam3, hg38, and hg19.
Mutational signatures: Trinucleotide contexts for each sample as well as the trinucleotide opportunities for each genome.
Machine learning models: Output of each of the machine learning models trained on human or canine data.
Files and variables
To extract downloaded files, use the Linux command:
tar -xvzf filename.tgz
File: data_processing.tgz
Description: This directory contains the following files relating to feature values, somatic mutations, and orthologous regions.
- Master feature file
- all_feature_values.txt.gz This file contains the feature values for all features in all samples considered in our analyses in tab-delimited format.
- Canine mutation calls
- canine_mutations_merged_orthologous.vcf.gz A VCF file containing somatic mutation calls in the orthologous regions for all canine samples.
- Orthologous regions
- Coding regions defined as orthologous between the canine genome (CanFam3) and human genomes (hg19 and hg38).
- canFam3CDSOrthologousRegions.bed
- hg19CDSOrthologousRegions.bed
- hg38CDSOrthologousRegions.bed
- Coding regions defined as orthologous between the canine genome (CanFam3) and human genomes (hg19 and hg38).
File: mutational_signatures.tgz
Description: This directory contains files related to mutational signature calling.
- trinucleotideFiles/ Subdirectory containing the trinucleotide contexts of the somatic mutations in each sample.
- Trinucleotide frequencies in CanFam 3, hg19, and hg38.
- trinuc_freqs_canFam3.1_orthoRegions.RData
- trinuc_freqs_hg19_orthoRegions.RData
- trinuc_freqs_hg38_orthoRegions.RData
File: machine_learning_models.tgz
Description: Output files from each of the machine learning models.
- Human-trained
- All cancer types
- Full model
- Binary drivers only
- Signatures only
- Low mutation rate cancer types
- Full model
- Binary drivers only
- Signatures only
- All cancer types
- Dog-trained
- Full model
- Binary drivers only
- Signatures only
For each machine learning model, we provide the following files:
- xgb_model_fold_(0-4) json and pickle files (10 files per model)
- xgb_params_fold_(0-4) (5 files per model)
- rfc_params_fold_(0-4) pickle files (5 files per model)
- importantFeatures_fold__(0-4) pickle files (5 files per model)
- fold_assignments.txt
- {model}_featureValues.txt
- {model}_probs.txt
- {model}_accuracies.txt
- shap_vals.pickle
- {type}_avgShapley.txt (32, 17, or 7 files depending on the number of classification choices in the model)
Code/software
Code written for the analyses presented in Genereux & Megquier, et al. is available on GitHub at:
https://github.com/diane-p-genereux/dog-genomic-cancer-model/
Pysam v.0.15.3 Heger, Marshall, Jacobs, and contributors.1–3 pysam.readthedocs.io
Gffutils Python package github.com/daler/gffutils
Extreme Gradient Boosting (XGBoost) scikit-learn scikit-learn.org/stable/install.html
Shap (SHapley Additive exPlanations) package Lundberg, et al.4 shap.readthedocs.io
Fasttreeshap package Jilei Yang5 github.com/linkedin/FastTreeSHAP
SigFit v.2.2 Gori and Baez-Ortega, et al.6 github.com/kgori/sigfit
SnpEff v.4.2 Cingolani, et al.7 pcingola.github.io/SnpEff
Ensembl Variant Effect Predictor (VEP) McLaren, et al.8 ensembl.org/vep
Python v. 3.7 Python Software Foundation https://www.python.org
R 4.4.3 The R Core Team9 www.r-project.org
Tidyverse R package 2.0.0 Wickham, et al.10 tidyverse.org
Cowplot R package 1.1.3 Wilke11 wilkelab.org/cowplot
Rstatix R package 0.7.2 rpkgs.datanovia.com/rstatix/
googlesheets4 R package 1.1.1 Bryan12 googlesheets4.tidyverse.org
ggpubr R package 0.6.0 Kassambara13 rpkgs.datanovia.com/ggpubr/
ggplot2 R package 3.5.1 Wickham14 ggplot2.tidyverse.org
vroom R package 1.6.5 Hester, Wickham and Bryan15 vroom.r-lib.org
ggrepel R package 0.9.6 Slowikowski et al16 github.com/slowkow/ggrepel
1. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
2. Bonfield, J. K. et al. HTSlib: C library for reading/writing high-throughput sequencing data. Gigascience 10, (2021).
3. Danecek, P. et al. Twelve years of SAMtools and BCFtools. Gigascience 10, (2021).
4. Lundberg, S. M. et al. From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2, 56–67 (2020).
5. Yang, J. Fast TreeSHAP: Accelerating SHAP value computation for trees. arXiv [cs.LG] (2021).
6. Gori, K. & Baez-Ortega, A. sigfit: flexible Bayesian inference of mutational signatures. bioRxiv (2018).
7. Cingolani, P. et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly 6, 80–92 (2012).
8. McLaren, W. et al. The Ensembl Variant Effect Predictor. Genome Biol. 17, 122 (2016).
9. RDevelopment CORE TEAM, R. & Others. R: A language and environment for statistical computing. Preprint at (2023).
10. Wickham, H. et al. Welcome to the tidyverse. J. Open Source Softw. 4, 1686 (2019).
11. Wilke, C. O. Cowplot: Streamlined Plot Theme and Plot Annotations for ‘ggplot2’. (2024).
12. Bryan, J. googlesheets4: Access Google Sheets Using the Sheets API V4. (2023).
13. Kassambara, A. Ggpubr: ‘ggplot2’ Based Publication Ready Plots. (2025).
14. Wickham, H. ggplot2: Elegant Graphics for Data Analysis. Preprint at https://ggplot2.tidyverse.org (2016).
15. Hester, J., Wickham, H. & Bryan, J. Vroom: Read and Write Rectangular Text Data Quickly. (2023).
16. Slowikowski, K. ggrepel: Automatically Position Non-Overlapping Text Labels with ‘ggplot2’. Preprint at (2024).
Access information
- Data was derived from the following sources:
- Human reference genome NCBI build 37, GRCh37 NCBI: GCA_000001405.14
- Human gene annotations GRCh37 Ensembl, version 87 ensembl.org
- Human reference genome NCBI build 38, GRCh38 NCBI: GCA_000001405.28
- Human gene annotations GRCh38 Ensembl, version 104 ensembl.org
- Canine reference genome CanFam3.1
- Canine gene annotations Ensembl, version 104 ensembl.org
- Ensembl human/canine gene orthologs Ensembl BioMart ensembl.org/biomart/
- NCBI human/canine gene orthologs Gene1 ncbi.nlm.nih.gov/gene
- Human somatic mutation data (TCGA) The Cancer Genome Atlas portal.gdc.cancer.gov
- Human somatic mutation data (ICGC) International Cancer Genome Consortium docs.icgc-argo.org/docs/data-access/icgc-25k-data#open-release-data---object-bucket-details
- Angiosarcoma data The Angiosarcoma Project2 cBioPortal: angs_painter_2020
- Columbia University pediatric pan-cancer data Oberg, et al.3 cBioPortal: mixed_pipseq_2017
- German Cancer Research Center (DKFZ) pediatric pan-cancer data Gröbner, et al.4 cBioPortal: pediatric_dkfz_2017
- Broad Institute diffuse large B-cell lymphoma data Lohr, et al.5 cBioPortal: dlbc_broad_2012
- Dana-Farber Cancer Institute (DFCI) diffuse large B-cell lymphoma data Chapuy, et al.6 cBioPortal: dlbcl_dfci_2018
- Canine B- and T-cell lymphoma Elvers, et al.7 BioProject: PRJNA247493
- Canine B-cell lymphoma White, et al.8 BioProject: PRJNA695534
- Canine mammary tumors Arendt, et al.9 ENA Project: PRJEB53653
- Canine glioma Amin, et al.10 BioProject: PRJNA579792; canineglioma.verhaaklab.com
- Canine hemangiosarcoma Megquier, et al.11 BioProject: PRJNA552034
- Canine osteosarcoma Sakthikumar, et al.12 BioProject: PRJNA391455
- Canine osteosarcoma Gardner, et al.13 BioProject: PRJNA525883
- Canine melanoma Prouteau et al.14 BioProject: PRJNA786469
- Cancer driver genes Catalogue Of Somatic Mutations In Cancer (COSMIC) Gene Census15,16 cancer.sanger.ac.uk/cosmic
- Druggable genes (OncoKB) OncoKB17,18 oncokb.org
- Hotspot mutations Hess, et al.19 N/A
- Hotspot mutations Database of Curated Mutations (DoCM) github.com/griffithlab/docm
- COSMIC single base substitution mutational signatures Catalogue Of Somatic Mutations In Cancer (COSMIC) cancer.sanger.ac.uk/cosmic
- Pathways Molecular Signatures Database (MSigDB)20 www.gsea-msigdb.org/gsea/msigdb/index.jsp
1. Brown, G.R., Hem, V., Katz, K.S., Ovetsky, M., Wallin, C., Ermolaeva, O., Tolstoy, I., Tatusova, T., Pruitt, K.D., Maglott, D.R., et al. (2015). Gene: a gene-centered information resource at NCBI. Nucleic Acids Res. 43, D36–D42.
2. Painter, C.A., Jain, E., Tomson, B.N., Dunphy, M., Stoddard, R.E., Thomas, B.S., Damon, A.L., Shah, S., Kim, D., Gómez Tejeda Zañudo, J., et al. (2020). The Angiosarcoma Project: enabling genomic and clinical discoveries in a rare cancer through patient-partnered research. Nat. Med. 26, 181–187.
3. Oberg, J.A., Glade Bender, J.L., Sulis, M.L., Pendrick, D., Sireci, A.N., Hsiao, S.J., Turk, A.T., Dela Cruz, F.S., Hibshoosh, H., Remotti, H., et al. (2016). Implementation of next generation sequencing into pediatric hematology-oncology practice: moving beyond actionable alterations. Genome Med. 8, 133.
4. Gröbner, S.N., Worst, B.C., Weischenfeldt, J., Buchhalter, I., Kleinheinz, K., Rudneva, V.A., Johann, P.D., Balasubramanian, G.P., Segura-Wang, M., Brabetz, S., et al. (2018). The landscape of genomic alterations across childhood cancers. Nature 555, 321–327.
5. Lohr, J.G., Stojanov, P., Lawrence, M.S., Auclair, D., Chapuy, B., Sougnez, C., Cruz-Gordillo, P., Knoechel, B., Asmann, Y.W., Slager, S.L., et al. (2012). Discovery and prioritization of somatic mutations in diffuse large B-cell lymphoma (DLBCL) by whole-exome sequencing. Proc. Natl. Acad. Sci. U. S. A. 109, 3879–3884.
6. Chapuy, B., Stewart, C., Dunford, A.J., Kim, J., Kamburov, A., Redd, R.A., Lawrence, M.S., Roemer, M.G.M., Li, A.J., Ziepert, M., et al. (2018). Molecular subtypes of diffuse large B cell lymphoma are associated with distinct pathogenic mechanisms and outcomes. Nat. Med. 24, 679–690.
7. Elvers, I., Turner-Maier, J., Swofford, R., Koltookian, M., Johnson, J., Stewart, C., Zhang, C.-Z., Schumacher, S.E., Beroukhim, R., Rosenberg, M., et al. (2015). Exome sequencing of lymphomas from three dog breeds reveals somatic mutation patterns reflecting genetic background. Genome Res. 25, 1634–1645.
8. White, M.E., Hayward, J.J., Hertafeld, S.R., Castelhano, M.G., Leung, W., Dave, S.S., Bhinder, B.H., Elemento, O.L., Boyko, A.R., Richards, K.L., et al. (2020). Consensus-based somatic variant-calling method correlates FBXW7 mutations with poor prognosis in canine B-cell lymphoma. bioRxiv, 2020.08.16.250100. https://doi.org/10.1101/2020.08.16.250100.
9. Arendt, M.L., Sakthikumar, S., Melin, M., Elvers, I., Rivera, P., Larsen, M., Sällström, S., Lingaas, F., Rönnberg, H., and Lindblad-Toh, K. (2022). PIK3CA is recurrently mutated in canine mammary tumors, similarly to in human mammary neoplasia. https://doi.org/10.21203/rs.3.rs-1801127/v1.
10. Amin, S.B., Anderson, K.J., Boudreau, C.E., Martinez-Ledesma, E., Kocakavuk, E., Johnson, K.C., Barthel, F.P., Varn, F.S., Kassab, C., Ling, X., et al. (2020). Comparative Molecular Life History of Spontaneous Canine and Human Gliomas. Cancer Cell 37, 243–257.e7.
11. Megquier, K., Turner-Maier, J., Swofford, R., Kim, J.-H., Sarver, A.L., Wang, C., Sakthikumar, S., Johnson, J., Koltookian, M., Lewellen, M., et al. (2019). Comparative genomics reveals shared mutational landscape in canine hemangiosarcoma and human angiosarcoma. Mol. Cancer Res. https://doi.org/10.1158/1541-7786.MCR-19-0221.
12. Sakthikumar, S., Elvers, I., Kim, J., Arendt, M.L., Thomas, R., Turner-Maier, J., Swofford, R., Johnson, J., Schumacher, S.E., Alföldi, J., et al. (2018). SETD2 Is Recurrently Mutated in Whole-Exome Sequenced Canine Osteosarcoma. Cancer Res. 78, 3421–3431.
13. Gardner, H.L., Sivaprakasam, K., Briones, N., Zismann, V., Perdigones, N., Drenner, K., Facista, S., Richholt, R., Liang, W., Aldrich, J., et al. (2019). Canine osteosarcoma genome sequencing identifies recurrent mutations in DMD and the histone methyltransferase gene SETD2. Commun. Biol. 2, 266.
14. Prouteau, A., Mottier, S., Primot, A., Cadieu, E., Bachelot, L., Botherel, N., Cabillic, F., Houel, A., Cornevin, L., Kergal, C., et al. (2022). Canine Oral Melanoma Genomic and Transcriptomic Study Defines Two Molecular Subgroups with Different Therapeutical Targets. Cancers 14. https://doi.org/10.3390/cancers14020276.
15. Sondka, Z., Bamford, S., Cole, C.G., Ward, S.A., Dunham, I., and Forbes, S.A. (2018). The COSMIC Cancer Gene Census: describing genetic dysfunction across all human cancers. Nat. Rev. Cancer 18, 696–705.
16. Tate, J.G., Bamford, S., Jubb, H.C., Sondka, Z., Beare, D.M., Bindal, N., Boutselakis, H., Cole, C.G., Creatore, C., Dawson, E., et al. (2019). COSMIC: the Catalogue Of Somatic Mutations In Cancer. Nucleic Acids Res. 47, D941–D947.
17. Chakravarty, D., Gao, J., Phillips, S.M., Kundra, R., Zhang, H., Wang, J., Rudolph, J.E., Yaeger, R., Soumerai, T., Nissan, M.H., et al. (2017). OncoKB: A Precision Oncology Knowledge Base. JCO Precis Oncol 2017. https://doi.org/10.1200/PO.17.00011.
18. Suehnholz, S.P., Kundra, R., Zhang, H., Nissan, M.H., Lu, C., Dhaneshwar, A., Fernandez, N., Nandakumar, S., Arcila, M.E., Ladanyi, M., et al. (2024). Tracking the FDA precision oncology drug approval landscape in OncoKB. J. Clin. Oncol. 42, e13507–e13507.
19. Hess, J.M., Bernards, A., Kim, J., Miller, M., Taylor-Weiner, A., Haradhvala, N.J., Lawrence, M.S., and Getz, G. (2019). Passenger hotspot mutations in cancer. Cancer Cell 36, 288–301.e14.
20. Liberzon, A., Subramanian, A., Pinchback, R., Thorvaldsdóttir, H., Tamayo, P., and Mesirov, J.P. (2011). Molecular signatures database (MSigDB) 3.0. Bioinformatics 27, 1739–1740.
