A major goal in translational cancer research is to identify biological signatures driving cancer progression and metastasis. A common technique applied in genomics research is to cluster patients using gene expression data from a candidate prognostic gene set, and if the resulting clusters show statistically significant outcome stratification, to associate the gene set with prognosis, suggesting its biological and clinical importance. Recent work has questioned the validity of this approach by showing in several breast cancer data sets that "random" gene sets tend to cluster patients into prognostically variable subgroups. This work suggests that new rigorous statistical methods are needed to identify biologically informative prognostic gene sets. To address this problem, we developed Significance Analysis of Prognostic Signatures (SAPS) which integrates standard prognostic tests with a new prognostic significance test based on stratifying patients into prognostic subtypes with random gene sets. SAPS ensures that a significant gene set is not only able to stratify patients into prognostically variable groups, but is also enriched for genes showing strong univariate associations with patient prognosis, and performs significantly better than random gene sets. We use SAPS to perform a large meta-analysis (the largest completed to date) of prognostic pathways in breast and ovarian cancer and their molecular subtypes. Our analyses show that only a small subset of the gene sets found statistically significant using standard measures achieve significance by SAPS. We identify new prognostic signatures in breast and ovarian cancer and their corresponding molecular subtypes, and we show that prognostic signatures in ER negative breast cancer are more similar to prognostic signatures in ovarian cancer than to prognostic signatures in ER positive breast cancer. SAPS is a powerful new method for deriving robust prognostic biological signatures from clinically annotated genomic datasets.
Breast Cancer Data
Breast cancer data. This R-workspace contains the objects: dat, dat.st, event, st, and time.
Breast.zip
Ovary_NonAngio_GSEA_Results
Results from GSEA Analysis in Non-Angiogenic subtype of ovarian cancer
Ovary_NonAngio.zip
Breast_Global_GSEA_Results
Results from GSEA Analysis in Global Breast Cancer Analysis
Breast_Global.zip
Breast_Her2_GSEA_Results
Results from GSEA Analysis in HER2+ subtype of breast cancer
Breast_Her2.zip
Ovary_Angio_GSEA_Results
Results from GSEA Analysis in Angiogenic subtype of ovarian cancer
Ovary_Angio.zip
Breast_ERHigh_GSEA_Results
Results from GSEA Analysis in ER+ high proliferation subtype of breast cancer
Breast_ERHigh.zip
Breast_ERNegHer2Neg_GSEA_Results
Results from GSEA Analysis in ER Neg HER2 Neg subtype of breast cancer
Breast_ERNegHer2Neg.zip
Ovary_Global_GSEA_Results
Results from GSEA Analysis in Global ovarian cancer analysis
Ovary_Global.zip
Breast_ERLow_GSEA_Results
Results from GSEA Analysis in ER+ low proliferation subtype of breast cancer
Breast_ERLow.zip
Ovarian Cancer Data
Ovarian cancer data. This R-workspace contains the objects: dat, dat.st, event, st, and time.
Ovary.zip
Breast.Ps.OnPermutedData.RData
Breast.Ps.OnPermutedData.RData contains the results of performing SAPS using permuted gene sets on the breast data. P_enrich, p_pure,p_rand are each 8 x 10000 x 6 arrays with P_enrich,P_pure, and P_random p values from permuted gene sets
Ovary.Ps.OnPermutedData.RData
ReadMe, Ovary.Ps.OnPermutedData.RData. Ovary.Ps.OnPermutedData.RData contains the results of performing SAPS using permuted gene sets on the ovarian data. P_enrich, p_pure,p_rand are arrays with P_enrich,P_pure, and P_random p values from permuted gene sets.
FinalOutput_Breast
FinalOutput_Breast.RData contains the results from the subtype-specific analysis in breast cancer, including the results of the permutation-based procedure to compute p values and q values for the SAPSscores.
FinalOutput_Ovary
FinalOutput_Ovary.RData contains the results from the traditional scaled data set in ovarian cancer, including the results of the permutation-based procedure to compute p values and q values for the SAPSScores.
molsigdb.v3.0.entrezForR.txt
molsigdb.v3.0.entrezForR contains the molsigdb, downloaded from the Broad Institute. The file is used to read the molsigdb.v3.0 gene sets into R.
BreastOutput_TradScaled
BreastOutput_TradScaled.RData is an R-workspace contains the objects: allPs, allPs.adj, sumTable. These were generated from applying the SAPS method to the breast cancer meta-data set scaled by transforming each feature into its Z score across all patients in a data-set prior to merging across data-sets.
BreastOutput_SubScaled
BreastOutput_SubScaled.RData is an R-workspace contains the objects: allPs, allPs.adj, sumTable. These were generated from applying the SAPS method to the breast cancer meta-data set scaled by transforming each feature into its Z score across all patients within a breast cancer subtype data-set prior to merging across data-sets.
BreastSubtypeSpecScaleRankDir
BreastSubtypeSpecScaleRankDir contains the ranked gene lists of concordance indices used to perform the GSEA in breast cancer.
OvaryOutput_TradScaled
OvaryOutput_TradScaled.RData contains the objects: allPs, allPs.adj, sumTable.BreastOutput_TradScaled.RData. These were generated from applying the SAPS method to the ovarian cancer meta-data set scaled by transforming each feature into its Z score across all patients in a data-set prior to merging across data-sets.
OvaryOutput_SubScaled
OvaryOutput_SubScaled.RData is an R-workspace contains the objects: allPs, allPs.adj, sumTable. These were generated from applying the SAPS method to the ovarian cancer meta-data set scaled by transforming each feature into its Z score across all patients within a ovarian cancer subtype data-set prior to merging across data-sets.
OvaryTradScaleRankDir
OvaryTradScaleRankDircontains the ranked gene lists of concordance indices used to perform the GSEA in ovarian cancer.
BreastOvary_HCv2
BreastOvary_HCv2.zip – This zip directory contains files to generate Figure 10 (Hierarchical clustering of breast and ovarian cancer subtypes based on SAPS scores) using JavaTreeView (http://jtreeview.sourceforge.net/)
runSAPSonPermutedData
runSAPSonPermutedData.R – This R script generates the P_pure, P_random, and P_enrichment on random gene sets. This "biologically null" set of SAPS scores is used to compute the SAPS_q_values on the msigdb gene sets.
saps
saps.R – This R script provides R commands for loading data, applying the SAPS method, and generating the SAPS p values. The script is interactive, and the user must specify the working directory, and if the analysis is on the ovarian or breast data.
sapsFigures
sapsFigures.R – This R script generates the figures, tables, and file used for clustering
computeSAPS.Permute.PValue.R
computeSAPS.Permute.PValue.R – This script generates permutation-based p and q values for the SAPSscores obtained in breast and ovarian cancer.