Supplementary data: The molecular evolution of cancer associated genes in mammals
Data files
Jun 26, 2025 version files 7.66 MB
-
Data_monkey_results.zip
6.56 MB
-
Gene_alignments.zip
1.06 MB
-
Phylogenetic_trees.zip
34.35 KB
-
README.md
4.76 KB
Abstract
Cancer is a disease that many multicellular organisms have faced for millions of years, and species have evolved various tumour suppression mechanisms to control oncogenesis. Although cancer occurs across the tree of life, cancer related mortality risks vary across mammalian orders, with Carnivorans particularly affected. Evolutionary theory predicts different selection pressures on genes associated with cancer progression and suppression, including oncogenes, tumour suppressor genes and immune genes. Therefore, we investigated the evolutionary history of cancer associated gene sequences across 384 mammalian taxa, to detect signatures of selection across categories of oncogenes (GRB2, FGL2 and CDC42), tumour suppressors (LITAF, Casp8 and BRCA2) and immune genes (IL2, CD274 and B2M). This approach allowed us to conduct a fine scale analysis of gene wide and site-specific signatures of selection across mammalian lineages under the lens of cancer susceptibility. Phylogenetic analyses revealed that for most species the evolution of cancer associated genes follows the species’ evolution. The gene wide selection analyses revealed oncogenes being the most conserved, tumour suppressor and immune genes having similar amounts of episodic diversifying selection. Despite BRCA2’s status as a key caretaker gene, episodic diversifying selection was detected across mammals. The site-specific selection analyses revealed that the two apoptosis associated domains of the Casp8 gene of bats (Chiroptera) are under opposing forces of selection (positive and negative respectively), highlighting the importance of site-specific selection analyses to understand the evolution of highly complex gene families. Our results highlighted the need to critically assess different types of selection pressure on cancer associated genes when investigating evolutionary adaptations to cancer across the tree of life. This study provides an extensive assessment of cancer associated genes in mammals with highly representative, and substantially large sample size for a comparative genomic analysis in the field and identifies various avenues for future research into the mechanisms of cancer resistance and susceptibility in mammals.
[Access this dataset on Dryad](Dataset DOI link)
The dataset contains supplementary files for the comparative genomic analysis conducted in our study. Analysis was conducted on 9 cancer associated genes across 868 mammalian species. Data files include sequence alignments, phylogenetic trees and values generated from analyses conducted on the data monkey webserver (https://www.datamonkey.org/): MEME and FUBAR.
Genes: Based on their roles associated with cancer development and progression, nine genes were selected for in depth analyses. Three oncogenes; Growth factor receptor-bound protein 2 (GRB2), Fibrinogen like protein 2 (FGL2) and Cell Division Control protein 42 homolog (CDC42) and three tumour suppressors; Lipopolysaccharide Induced Tumour Necrosis Factor (LITAF), Caspase 8 (Casp8) and Breast Cancer gene 2 (BRCA2), and three immune genes; Interleukin-2 (IL2), Cluster Differentiation 274 (CD274), and Beta-2-Microglobin (B2M)
Gene Sequence Alignments
Sequences were obtained from the NCBI genomic data based and manually aligned. The alignments were used as input to generate the phylogenetic trees and as input into the MEME and FUBAR analyses. Files are saved in the FASTA format. They can be viewed in alignment editing tools such as MEGA, Bioedit, Geneious, ect. Fasta files can also be viewed in text editors, e.g. notepad or Nano
Files:
B2M_Alignment.fas
BRCA2_Alignment.fas
Casp8_Alignment.fas
CD274_Alignment.fas
CDC42_Alignment.fas
FGL2_Alignment.fas
GRB2_Alignment.fas
IL2_Alignment.fas
LITAF_Alignment.fas
Phylogenetic trees
Phylogenetic trees were generated for each gene using MEGAX. Input for the trees are the sequence alignment FASTA files found in the "Gene Sequence Alignments" folder. Files are stored as netwicks. Files following the naming format of Gene_Tree_Netwick. The trees can be viewed using tools such as iTOL (https://itol.embl.de/) or Figtree (https://tree.bio.ed.ac.uk/software/figtree/).
Files:
B2M_Tree_Tree_Netwick
BRCA2_Tree_Netwick
Casp8_Tree_Netwick
CD274_Tree_Netwick
FGL2_Tree_Netwick
GRB2_Tree_Netwick
IL2_Tree_netwick
LITAF_Tree_Netwick
Data monkey
This folder contains results from two analyses conducted using the Datamonkey webserver. Each excel file follows the naming format of "Gene_FUBAR/MEME_Results"
Descriptions for each data column is taken directly from the Datamonkey webserver description for the data columns.
MEME (Mixed Effects Model of Evolution)
Site: Refers to the nucleotide in the gene sequence the data is for
Partition
α : Synonymous substitution rate at a site
β- : Non-synonymous substitution rate at a site for the negative/neutral evolution component
p- : Mixture distribution weight allocated to β+; loosely -- the proportion of the tree evolving neutrally or under negative selection
β+ : Non-synonymous substitution rate at a site for the positive/neutral evolution component
p+ : Mixture distribution weight allocated to β+; loosely -- the proportion of the tree evolving neutrally or under positive selection
LRT : Asymptotic p-value for episodic diversification, i.e., p+ 0 andbeta+ > alpha
p-value : Likelihood ratio test statistic for episodic diversification. i.e., p+ 0 andbeta+ >alpha
branches under selection : A rough estimate of the number of branches (here branches is either a single species or group of species) that have been under selection at this site. i.e. had an empirical bayes factor of 100 or more for the beta+ rate
Files:
B2M_MEME_Results.xlsx
BRCA2_MEME_Results.xlsx
Casp8_MEME_Results.xlsx
CD274_MEME_Results.xlsx
CDC42_MEME_Results.xlsx
FGL2_MEME_Results.xlsx
GRB2_MEME_Results.xlsx
IL2_MEME_Results.xlsx
FUBAR (Fast, UnconstrainedBayesian AppRoximation)
Site: Refers to the nucleotide in the gene sequence the data is for
Partition:
α : Mean posterior synonymous substitution rate at a site
β : Mean posterior non-synonymous substitution rate at a site
β-α : Mean posterior beta - alpha
Prob[α>β] : Posterior probability of negative selection at a site
Prob[α<β] : Posterior probability of positive selection at a site
BayesFactor[α<β] : Empirical Bayes Factor for positive selection at a site
Files:
B2M_FUBAR_Results.xlsx
BRCA2_FUBAR_Results.xlsx
CASP8_FUBAR_Results.xlsx
CD274_FUBAR_Results.xlsx
CDC42_FUBAR_Results.xlsx
FGL2_FUBAR_Results.xlsx
GRB2_FUBAR_Results.xlsx
IL2_FUBAR_Results.xlsx
LITAF_FUBAR_Results.xlsx
For information on sampling and analysis see: https://doi.org/10.1038/s41598-024-62425-0
