\(*\) Correspondence should be addressed to Huining Kang (HuKang@salud.unm.edu).
We recently developed a new statistical method for analyzing differentially expressed/spliced transcript isoforms, which appropriately account for the dependence between isoforms and multiple testing corrections for the multi-dimensional structure of at both the gene-and isoform-level. We proposed a linear mixed effects model-based approach for analyzing the complex alternative RNA splicing regulation patterns detected by whole-transcriptome RNA-Sequencing technologies. This approach thoroughly characterizes and differentiates three types of genes related to alternative RNA splicing events with distinct differential expression/splicing patterns. We applied the concept of appropriately controlling for the gene-level overall false discovery rate (OFDR) in this multi-dimensional alternative RNA splicing analysis utilizing a two-step hierarchical hypothesis testing framework. In the initial screening test we identify genes that have differentially expressed or spliced isoforms; in the subsequent confirmatory testing stage we examine only the isoforms of genes that have passed the screening tests. The methodology is described in detail in the manuscript titled Two-step mixed model approach to analyzing differential alternative RNA splicing submitted to PLOS ONE. We have provided an R program file (Tools.R) that included all the functions for performing the analysis with proposed method. The R program file can be downloaded here. We also wrote a Tutorial to show step-by-step how to use these functions in R to perform the analyses that we have proposed in our manuscript.
In the manuscript we have reported an application of our method to an RNA-sequencing study of Pediatric Acute Myeloid Leukemia (AML). We applied our method to identify differentially expressed/spliced isoforms that are associated survival outcome in a large cohort of pediatric AML from the NCI/COG TARGET-AML initiative (Farrar, et. al. 2016). The RNA-seq data at exon, isoform and gene levels as well as the clinical data for the AML patients are downloaded from NCI/COG TARGET website at https://ocg.cancer.gov/programs/target. We analyzed a subset of 234 patients who have clinical outcomes with 81 cases (relapsed within 3 years) and 153 controls (CCR for at least 3 years). After removing those with low abundances and those that are not in the coding region, we analyzed a total of 35397 isoforms representing 8058 genes. We converted the data set to an R list object named data in the format described in the Tutorial. This data set is available per request. We adhere to the principles of transparency in order to assure that our results are reproducible and present in this document the detailed steps and R codes for the analysis. The major steps of the analysis for this application are the same as we presented in our Tutorial except that the dataset is different. We strongly encourage readers to read the Tutorial before reading this document.
As suggested in our Tutorial we created a folder called AML as our working directory and under this folder we created two subfolders data and output. Aforementioned R object data is saved in a file named acc.RData in the subfolder data.
In the following (our primary analysis for AML data) we assumed the compound symmetry with unequal variances as the covariance structure and we set the statistical significance at OFDR = 0.05.
Let’s take a look at the data.
# load data to the memory
load(file="data/AML_isoform_outcome.RData")
# list the elements of object data
str(data)
## List of 6
## $ x : num [1:35397, 1:234] 3.808 2.794 1.887 3.566 0.253 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:35397] "ENST00000399788" "ENST00000540838" "ENST00000544777" "ENST00000536014" ...
## .. ..$ : chr [1:234] "PABHET-03A-02R" "PABHKY-03A-02R" "PABLDZ-09A-04R" "PACDZR-09A-02R" ...
## $ y : num [1:234] 1 1 2 2 2 2 2 2 2 1 ...
## $ gene : chr [1:35397] "KDM5A" "KDM5A" "KDM5A" "KDM5A" ...
## $ isoform : chr [1:35397] "ENST00000399788" "ENST00000540838" "ENST00000544777" "ENST00000536014" ...
## $ isoform_name: chr [1:35397] "KDM5A-001" "KDM5A-003" "KDM5A-008" "KDM5A-010" ...
## $ y.lab : chr [1:2] "Control" "Case"
The total number of isoforms is 35397 and the total number of genes is as follows
length(unique(data$gene))
## [1] 8058
The function find.nL()
defined in file Tools.R
can be used to calculate the number of isoforms for each gene. It returns a numeric vector with each element being the number of isoforms of one gene and the names of the elements being gene names.
source("Tools.R")
nL <- find.nL(data=data)
The numbers of isoforms for the first 5 genes are
head(nL, n=5)
## KDM5A NINJ2 RAD52 DCUN1D2 LINS
## 5 2 2 2 3
The numbers of isoforms for the last 5 genes are
tail(nL, n=5)
## PRIM2 SCAMP1 VPS11 EMG1 ALDH3B1
## 2 4 5 4 4
We can summarize all the numbers as follows:
sum.nL <- table(nL)
sum.nL
## nL
## 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
## 2350 1684 1233 837 581 411 300 214 120 99 62 40 32 19 10 15 10 7 10 5
## 22 23 25 27 28 31 33 34 35 44
## 4 3 2 4 1 1 1 1 1 1
The result indicates that there are 2350 genes that have 2 isoforms for each, 1684 genes that have 3 isoforms for each, and so on so forth.
As described in our Tutorial, the first step is to run the screening test, the likelihood ratio test (LRT) by calling function lrt.test.mclapply()
. We proceed this step by submitting following command:
res <- lrt.test.mclapply(data=data, VarCov.type="uneqvar", nCore=NULL, outfile="output/AML_outcome_uneqvar_type1.RData", screen.type=1)
This was the most time-consuming part. It returned a matrix object res
which was saved in the file AML_outcome_uneqvar_type1.RData
under the subfolder output. Now we load the object from the file and take a look.
load(file="output/AML_outcome_uneqvar_type1.RData")
The dimension of the object is
dim(res)
## [1] 8058 6
The rows on the top are
head(res)
## lrt df nIso p.value FDR Time
## KDM5A 6.948314 5 5 0.224512694 0.43585457 1.8104832
## NINJ2 6.885362 2 2 0.031978841 0.14273506 0.6409948
## RAD52 0.391820 2 2 0.822086207 0.90745762 0.5917048
## DCUN1D2 3.563903 2 2 0.168309379 0.36916044 0.6450491
## LINS 15.884635 3 3 0.001197441 0.02049538 0.8307495
## TMEM86B 3.572787 3 3 0.311443836 0.52378277 0.8714731
and on the bottom are
tail(res)
## lrt df nIso p.value FDR Time
## SHANK3 0.4641871 3 3 0.92668589 0.96262112 0.3201759
## PRIM2 0.0439027 2 2 0.97828783 0.98826907 0.3466947
## SCAMP1 12.3450460 4 4 0.01496164 0.09332533 0.9187062
## VPS11 3.8718428 5 5 0.56801122 0.74272356 1.4226265
## EMG1 2.6121073 4 4 0.62467988 0.78342889 0.7937691
## ALDH3B1 0.7047276 4 4 0.95074464 0.97582004 0.4763246
Each row of the matrix gives the test result for each gene and the row names indicate the names (IDs) of the genes. The columns from left to right are LRT statistic (lrt
), degree of freedom (df
), Number of isoforms (nIso
), p-values of LRT (p.value
), FDR (FDR
) and time used for fitting the linear mixed effects model in second (Time
). Note that there might be missing results for some genes because the corresponding linear mixed models procedures did not converge. Let’s see how many such genes there are.
p <- which(is.na(res[, 1]))
length(p)
## [1] 236
We see that there are 236 (2.93%) genes whose linear mixed effects models did not converge. The names of these genes are
rownames(res)[p]
## [1] "HDLBP" "GNL1" "PPP1R10" "GATAD2B" "PAXBP1" "ZMYM6" "SRP68" "H3F3B"
## [9] "ZC3HC1" "S100A6" "WBP2" "RGPD6" "CXorf65" "TPM3" "HK3" "CDK2AP1"
## [17] "EEF1D" "EPRS" "PLAGL1" "IDH2" "SDHB" "UBE2D3" "PI4KB" "MCTP1"
## [25] "CLINT1" "RPLP0" "NDUFB1" "GBA" "CLK2" "IER3" "DDX39B" "BAG6"
## [33] "DDAH2" "CLIC1" "EHMT2" "MRPS25" "ATF6B" "TAP2" "TAP1" "HLA-DMA"
## [41] "HLA-DPA1" "AES" "TAPBP" "USP34" "HNRNPH1" "C19orf70" "TCF20" "ADAM8"
## [49] "AKT2" "HLA-B" "BCAP31" "DNMT1" "CDKN2D" "GNB2L1" "ACIN1" "KDM5C"
## [57] "ARHGAP4" "NGLY1" "AGPAT1" "GPSM3" "CEP152" "ZNF468" "CUTA" "NDUFA6"
## [65] "FLNA" "VPS52" "AKT1" "MS4A4E" "LUZP1" "CHFR" "YWHAE" "SETD4"
## [73] "C4orf36" "DDX5" "RASGRP2" "B4GALT6" "RBM39" "TPT1" "ETFA" "NREP"
## [81] "TSPAN3" "REEP5" "CTSH" "RPL17" "CPVL" "NAP1L1" "LDHB" "ATP5G2"
## [89] "TMEM66" "FIBP" "C12orf50" "ATP6V0D1" "RSL1D1" "BET1" "FEZ2" "HSP90AA1"
## [97] "MCFD2" "JUP" "CRBN" "SMG1" "NDUFB11" "TLE3" "TLN1" "PKM"
## [105] "TUBA1A" "SOS1" "AIFM1" "IKBIP" "CPSF3L" "PRKRA" "CCNL2" "ATP5I"
## [113] "P4HB" "ACSL1" "CCT3" "KMT2C" "MEF2A" "CDC16" "RARS" "IGFBP2"
## [121] "HLA-A" "NUDT16" "HLA-E" "ABCF1" "TUBB" "EIF4A1" "CD44" "SAP30BP"
## [129] "LDHA" "UBR7" "TSPAN17" "ARL6IP4" "MATR3" "PTPRA" "DNM1" "RWDD1"
## [137] "PSMD9" "MRPS18C" "IL1RN" "MYO15B" "RNF10" "KPNB1" "RPS12" "TNF"
## [145] "AIF1" "PRRC2A" "HMHA1" "CSNK2B" "CIRBP" "MSH5" "MYB" "C6orf48"
## [153] "PPT2" "ZNF185" "PSMB9" "BRD2" "HLA-DPB1" "RING1" "RPS18" "IFNGR2"
## [161] "ADD1" "LST1" "SH3TC1" "STIM2" "SKIV2L" "SERF2" "RNF5" "TCN2"
## [169] "PHF1" "CTNND1" "ABCC10" "MPL" "MTHFD2" "PLD4" "ILF3" "SMARCA4"
## [177] "RAD23A" "ZSWIM4" "RRM1" "TPM4" "CDC42" "URI1" "ZNF507" "ILK"
## [185] "SRRM1" "SPINT2" "RAD54L2" "TEX264" "CLK3" "NOL11" "SNX33" "HNRNPH3"
## [193] "RCN2" "ANP32B" "ATXN2L" "CALCOCO2" "TAX1BP1" "PRR13" "MIA3" "ZDHHC17"
## [201] "POLI" "PSMD7" "INPP5D" "UBN1" "EIF4A2" "LGALS9" "RBM48" "MYCBPAP"
## [209] "BANF1" "MYL12A" "CNTLN" "LRRK2" "CREM" "GNAS" "PPM1A" "ZYX"
## [217] "MRPL48" "FUS" "HIF1A" "EIF5" "NAPG" "ARMC5" "C1orf162" "OGFOD1"
## [225] "NPIPB5" "HERPUD1" "TRIM56" "NEO1" "SETD5" "APEX1" "PIGG" "SNRPC"
## [233] "ATAD3A" "CHPT1" "SSFA2" "G3BP1"
Next let’s retain only Those called significant at level FDR = 0.05 and assign the results in a new object called sig.res1
sig.res1 <- res[!is.na(res[, 1]) & res[, "FDR"] <= 0.05, ]
dim(sig.res1)
## [1] 782 6
We see that there are total 782 genes that passed the screening test at level FDR = 0.05. Let’s sort these genes by the p-values and check those on the top and bottom,
head(sig.res1[order(sig.res1[, "p.value"]), ])
## lrt df nIso p.value FDR Time
## TARDBP 80.35772 8 8 4.141857e-14 3.239761e-10 5.152162
## SP100 86.48541 14 14 1.742530e-12 6.815035e-09 22.984989
## MYCBP2 63.25134 7 7 3.374746e-11 8.799088e-08 3.416999
## NCK1 57.59970 7 7 4.542234e-10 8.882339e-07 1.909047
## SPAG9 66.10922 11 11 6.661781e-10 1.042169e-06 12.294924
## UBXN1 69.51061 13 13 9.876279e-10 1.287538e-06 18.988426
tail(sig.res1[order(sig.res1[, "p.value"]), ])
## lrt df nIso p.value FDR Time
## MICB 10.62510 2 2 0.004929351 0.04952175 0.2858193
## SPN 14.89070 4 4 0.004933361 0.04952175 1.0132439
## CYB5R3 18.58010 6 6 0.004934811 0.04952175 3.0487463
## HMGXB4 14.88846 4 4 0.004938246 0.04952175 1.1805918
## EIF5A 21.98428 8 8 0.004945071 0.04952669 5.0797310
## PRRC1 14.86645 4 4 0.004986382 0.04987657 1.2450681
We saved the results in a worksheet named “Sig. G - Type 1” in a Excel file AML_uneqvar.xlsx using the following commands
require(xlsx)
## Loading required package: xlsx
write.xlsx(sig.res1[order(sig.res1[, "p.value"]), ], file="AML_uneqvar.xlsx", sheetName = "Sig. G - Type 1", row.names = T)
Next we perform the step 2 (confirmatory) test to identify the isoforms of these 782 genes that are differentially expressed using function step2.Ftest()
as follows
final.res1 <- step2.Ftest(data=data, file.step1="output/AML_outcome_uneqvar_type1.RData", alpha=0.05, method="holm")
## Number of significant genes: 782
## Cutoff for significant isoform is 0.004998722
Note that there is an argument called method in function step2.Ftest and the following functon mc.step2.Waldtest, which is the options for controlling the family-wise error rate required by our two step procedure. The options are methods of Bonferroni (“bonferroni”), Holm (“holm”), Hochberg (“hochberg”) and Hommel (“hommel”). We use Holm option throughout this document. The function returns the results of the confirmatory test for all the isoforms of the above 782 genes in a data.frame object called final.res1
. Let’s take a look at the results.
dim(final.res1)
## [1] 4792 7
head(final.res1)
## gene isoform FC step2.p.value adj.p threshold decision
## 1 LINS ENST00000314742 -1.059541 1.846141e-01 1.846141e-01 0.004998722 FALSE
## 2 LINS ENST00000559169 1.211968 4.744510e-03 1.423353e-02 0.001666241 FALSE
## 3 LINS ENST00000561073 1.208533 3.586204e-02 7.172409e-02 0.002499361 FALSE
## 4 UBE2S ENST00000589978 -2.427634 2.705008e-05 5.410016e-05 0.002499361 TRUE
## 5 UBE2S ENST00000264552 -1.049046 6.913375e-01 6.913375e-01 0.004998722 FALSE
## 6 TOPBP1 ENST00000505804 1.535638 1.030649e-04 3.091948e-04 0.001666241 TRUE
The columns from left to right are gene name (gene
), isoform ID (isoform
), Fold Change (FC
), p-value of the step two test (step2.p.value
), adjusted p-value (adj.p
), and threshold to determine if the step2.p.value
is significant, and decison (decision
). An isoform is called significant or considered differentially expressed if the decision is TRUE.
The isoforms that are called significant are as follows
sig.res1.I.t <- final.res1[final.res1$decision, ]
dim(sig.res1.I.t)
## [1] 265 7
head(sig.res1.I.t)
## gene isoform FC step2.p.value adj.p threshold decision
## 4 UBE2S ENST00000589978 -2.427634 2.705008e-05 5.410016e-05 0.002499361 TRUE
## 6 TOPBP1 ENST00000505804 1.535638 1.030649e-04 3.091948e-04 0.001666241 TRUE
## 17 PDCD10 ENST00000492396 1.444897 7.527957e-04 3.011183e-03 0.001249680 TRUE
## 55 INO80 ENST00000558357 -1.532673 4.038737e-04 1.211621e-03 0.001666241 TRUE
## 97 CAMSAP1 ENST00000409386 -1.350315 1.149588e-03 3.448764e-03 0.001666241 TRUE
## 101 PHF8 ENST00000338946 1.412894 9.481201e-06 3.792480e-05 0.001249680 TRUE
tail(sig.res1.I.t)
## gene isoform FC step2.p.value adj.p threshold decision
## 4619 BZW2 ENST00000452975 -1.757608 0.0004382129 0.0043821293 0.0004998722 TRUE
## 4645 MST4 ENST00000394335 1.520857 0.0001331171 0.0006655855 0.0009997443 TRUE
## 4646 NAA30 ENST00000554703 -1.374659 0.0013486235 0.0040458705 0.0016662405 TRUE
## 4686 METTL17 ENST00000553389 1.479945 0.0002133583 0.0006400749 0.0016662405 TRUE
## 4767 PTPN12 ENST00000248594 1.375450 0.0006907204 0.0034536022 0.0009997443 TRUE
## 4773 PHTF2 ENST00000307305 1.261118 0.0001464911 0.0005859646 0.0012496804 TRUE
We see that with the standard t-test option we identified 265 isoforms that are called significant. These isoforms belong to 214 genes, which are
unique(unique(sig.res1.I.t[, 1]))
## [1] "UBE2S" "TOPBP1" "PDCD10" "INO80" "CAMSAP1"
## [6] "PHF8" "SYNJ1" "PHF21A" "SHPRH" "RPGR"
## [11] "TOMM20" "MAP3K7" "SRSF2" "NHSL1" "CDKN1C"
## [16] "IFIH1" "NPM3" "EIF4G3" "DICER1" "PLSCR1"
## [21] "TRIT1" "TMX3" "MEF2C" "FNBP4" "RASSF2"
## [26] "MAN1A1" "ORMDL1" "EOGT" "FAM188A" "PRPF40A"
## [31] "VPS54" "TMEM128" "LPCAT1" "SH3BP5" "SIRT6"
## [36] "SATB1" "CCND3" "AZI2" "MAP3K5" "CYB5R1"
## [41] "TMEM87A" "TOP2B" "DEK" "TNKS1BP1" "SEPP1"
## [46] "BOD1L1" "SEL1L3" "MOB1A" "IBTK" "GFM2"
## [51] "AQR" "TPD52" "PPP1CC" "NUP133" "NDRG1"
## [56] "PEPD" "EAPP" "PPP2R3C" "CHMP1A" "FNBP1"
## [61] "GLRX" "POLG2" "PITPNB" "TRAPPC8" "SLC39A11"
## [66] "PARL" "TCEA1" "DSN1" "CYCS" "CSAD"
## [71] "RPUSD1" "IER3IP1" "BLCAP" "TRIP12" "ACRBP"
## [76] "SPPL2A" "DMXL2" "MRE11A" "CCPG1" "RGS10"
## [81] "ARHGAP12" "CCDC82" "BNIP2" "ASUN" "MYCBP2"
## [86] "ZDHHC6" "CCDC59" "TMEM123" "DCTN3" "RBM26"
## [91] "THOC1" "PTPN22" "ZRANB2" "ARHGAP21" "SPINK2"
## [96] "KARS" "PCYT1A" "UBXN1" "MICU2" "CENPC"
## [101] "DLG1" "CEBPZ" "PIGK" "LBR" "SLCO5A1"
## [106] "EDEM3" "GTF2H2" "DPM1" "MTMR6" "TPM2"
## [111] "IQCB1" "C9orf72" "ASH1L" "SLC2A5" "NOL8"
## [116] "LAMA5" "ROCK1" "ANKRA2" "LEMD2" "SNX13"
## [121] "FAM120AOS" "MAX" "POLR2J2" "TMED5" "GSAP"
## [126] "GRIPAP1" "CWC15" "NCK1" "CHD3" "S100PBP"
## [131] "GMNN" "MIS12" "VPS16" "TSPAN7" "EIF3M"
## [136] "RASA1" "METTL23" "SCOC" "GCA" "HMGN2"
## [141] "RPS6KA1" "TPP2" "C7orf55-LUC7L2" "TNFSF13B" "OXR1"
## [146] "FPGS" "RHOH" "TRH" "KNTC1" "CDC14A"
## [151] "DMTF1" "CCNJ" "EEF1B2" "MARCH7" "TBC1D23"
## [156] "MTRR" "RPL15" "VMA21" "PTGER4" "MARCH6"
## [161] "NPR3" "SERPING1" "BRIX1" "RUNX2" "MICB"
## [166] "CPNE3" "PNN" "GPR114" "BPTF" "TAF4B"
## [171] "RNF138" "TTC14" "EI24" "RAD51AP1" "TXNRD1"
## [176] "KIAA1033" "SNX10" "TGFBR1" "PPWD1" "CLN8"
## [181] "MED17" "SP100" "AEBP2" "GORAB" "CRAMP1L"
## [186] "KLHDC2" "SLC25A5" "RPS8" "UBAP1" "COMMD3"
## [191] "TLE4" "ATF7IP2" "SLC30A5" "CRCP" "ZMYM2"
## [196] "TGIF1" "PPFIA1" "KDM6A" "AKAP2" "SOCS2"
## [201] "ANKRD12" "ITPR1" "KLHL2" "RARRES3" "HNRNPA3"
## [206] "WRNIP1" "ATF1" "TNRC6A" "BZW2" "MST4"
## [211] "NAA30" "METTL17" "PTPN12" "PHTF2"
We saved the results in a worksheet named “Sig. I - Type 1 t-test” in the same Excel file AML_uneqvar.xlsx using the following commands
write.xlsx(sig.res1.I.t, file="AML_uneqvar.xlsx", sheetName = "Sig. I - Type 1 t-test", row.names = F, append = T)
Next we perform the step 2 (confirmatory) test to identify the isoforms of these 782 genes that are differentially expressed using function mc.step2.Waldtest
and save the object final.res1.wald
returned by the function in a file called AML_outcome_uneqvar_type1_step2_Wald.RData
under subfolder output
final.res1.wald <- mc.step2.Waldtest(data=data, VarCov.type="uneqvar", nCore = NULL, file.step1="output/AML_outcome_uneqvar_type1.RData", alpha=0.05, method="homl")
save(final.res1.wald, file="output/AML_outcome_uneqvar_type1_step2_Wald.RData")
This operation may take a little long time. Let’s load the object final.res1.wald
from the file and then take a look.
load(file="output/AML_outcome_uneqvar_type1_step2_Wald.RData")
dim(final.res1.wald)
## [1] 4792 6
head(final.res1.wald)
## gene isoform step2.p.value adj.p threshold decision
## 1 LINS ENST00000314742 1.894253e-01 1.894253e-01 0.004998722 FALSE
## 2 LINS ENST00000559169 4.279246e-03 1.283774e-02 0.001666241 FALSE
## 3 LINS ENST00000561073 2.984596e-02 5.969193e-02 0.002499361 FALSE
## 4 UBE2S ENST00000589978 1.699557e-05 3.399114e-05 0.002499361 TRUE
## 5 UBE2S ENST00000264552 6.897118e-01 6.897118e-01 0.004998722 FALSE
## 6 TOPBP1 ENST00000505804 1.552521e-05 4.657563e-05 0.001666241 TRUE
The columns from left to right are gene name (gene
), isoform ID (isoform
), Fold Change (FC
), p-value of the step two test (step2.p.value
), adjusted p-value (adj.p
), and threshold to determine if the step2.p.value
is significant, and descison (decision
). An isoform is called significant or considered differentially expressed if the decision is TRUE.
The isoforms that are called significant are as follows
sig.res1.I.wald <- final.res1.wald[final.res1.wald$decision, ]
dim(sig.res1.I.wald)
## [1] 291 6
head(sig.res1.I.wald)
## gene isoform step2.p.value adj.p threshold decision
## 4 UBE2S ENST00000589978 1.699557e-05 3.399114e-05 0.002499361 TRUE
## 6 TOPBP1 ENST00000505804 1.552521e-05 4.657563e-05 0.001666241 TRUE
## 17 PDCD10 ENST00000492396 2.562722e-04 1.025089e-03 0.001249680 TRUE
## 55 INO80 ENST00000558357 4.221384e-04 1.266415e-03 0.001666241 TRUE
## 97 CAMSAP1 ENST00000409386 9.455119e-04 2.836536e-03 0.001666241 TRUE
## 101 PHF8 ENST00000338946 2.589063e-04 1.035625e-03 0.001249680 TRUE
We see that with the Wlad-test option we identified 291 isoforms that are called significant. These isoforms belong to 232 genes, which are
unique(unique(sig.res1.I.wald[, 1]))
## [1] "UBE2S" "TOPBP1" "PDCD10" "INO80" "CAMSAP1" "PHF8" "SYNJ1"
## [8] "PHF21A" "SHPRH" "RPGR" "TOMM20" "MAP3K7" "NHSL1" "CDKN1C"
## [15] "IFIH1" "NPM3" "SBF2" "DICER1" "PLSCR1" "TRIT1" "CLK4"
## [22] "TMX3" "MEF2C" "FNBP4" "RASSF2" "MAN1A1" "ORMDL1" "ELOVL5"
## [29] "EOGT" "FAM188A" "EIF4E3" "PRPF40A" "VPS54" "TMEM128" "ACTN1"
## [36] "LPCAT1" "SH3BP5" "CLDND1" "SIRT6" "SATB1" "CCND3" "AZI2"
## [43] "MAP3K5" "CYB5R1" "TMEM87A" "GRIK5" "CCNL1" "TOP2B" "DEK"
## [50] "PCGF1" "TNKS1BP1" "SEPP1" "BOD1L1" "LILRB2" "SEL1L3" "MOB1A"
## [57] "IBTK" "GFM2" "AQR" "TPD52" "PPP1CC" "NUP133" "PEPD"
## [64] "EAPP" "PPP2R3C" "HMGN5" "ABR" "CHMP1A" "FNBP1" "GLRX"
## [71] "PITPNB" "SMYD3" "TRAPPC8" "HERC4" "SLC39A11" "RB1CC1" "PARL"
## [78] "FYN" "TCEA1" "DSN1" "CYCS" "ZFC3H1" "ATP5A1" "CSAD"
## [85] "IER3IP1" "BLCAP" "TAF1D" "ACRBP" "SPPL2A" "DMXL2" "MRE11A"
## [92] "CCPG1" "OSBPL8" "RGS10" "CCDC82" "BNIP2" "ASUN" "MYCBP2"
## [99] "ZDHHC6" "CCDC59" "TMEM123" "DCTN3" "RBM26" "PTPN22" "ZRANB2"
## [106] "ARHGAP21" "IKBKAP" "HOPX" "SPINK2" "KARS" "PCYT1A" "PLIN2"
## [113] "THOC2" "UBXN1" "CENPC" "DLG1" "CEBPZ" "PIGK" "NR2C1"
## [120] "SLCO5A1" "EDEM3" "GTF2H2" "DPM1" "MTMR6" "TPM2" "IQCB1"
## [127] "C9orf72" "TOB1" "SRSF7" "IARS" "VPS29" "SLC2A5" "NOL8"
## [134] "LAMA5" "HSPH1" "ANKRA2" "SNX13" "FAM120AOS" "POLR2J2" "TMED5"
## [141] "GSAP" "CWC15" "U2AF2" "NCK1" "S100PBP" "GMNN" "MIS12"
## [148] "SPNS3" "VPS16" "TSPAN7" "EIF3M" "RASA1" "NUCB2" "METTL23"
## [155] "RASA2" "SCOC" "GCA" "HMGN2" "TPP2" "TNFSF13B" "OXR1"
## [162] "CTNNB1" "RHOH" "MSL3" "TRH" "KNTC1" "NEK6" "CDC14A"
## [169] "CCNJ" "OXA1L" "EEF1B2" "MARCH7" "TBC1D23" "MTRR" "CPSF3"
## [176] "RPL15" "EMD" "VMA21" "PTGER4" "BAX" "NPR3" "SERPING1"
## [183] "BRIX1" "MICB" "CPNE3" "PNN" "GPR114" "TAF4B" "PPIP5K2"
## [190] "RNF138" "TTC14" "EI24" "RAD51AP1" "TXNRD1" "KIAA1033" "SNX10"
## [197] "MGST1" "CLN8" "MED17" "SP100" "AEBP2" "RAB2A" "GORAB"
## [204] "CRAMP1L" "ZEB1" "KLHDC2" "SLC25A5" "ARID4A" "RPS8" "COMMD3"
## [211] "MRPS35" "WRN" "ATF7IP2" "ZNF92" "SLC30A5" "ZMYM2" "TGIF1"
## [218] "AKAP2" "SOCS2" "CDCA7" "ANKRD12" "ITPR1" "LUC7L3" "RARRES3"
## [225] "SEH1L" "ATF1" "BZW2" "MST4" "NAA30" "METTL17" "RBM3"
## [232] "PHTF2"
We saved the results in a worksheet named “Sig. I - Type 1 Wald” in the same Excel file AML_uneqvar.xlsx using the following commands
write.xlsx(sig.res1.I.wald, file="AML_uneqvar.xlsx", sheetName = "Sig. I - Type 1 Wald", row.names = F, append = T)
The procedure for the analysis based on Type 2 screening test is the same as that based on Type 1 test except that the argument screen.type
of function lrt.test.mclapply
needs to be specified as 2 rather than 1. We perform the Type 2 screening test by submitting following command:
res <- lrt.test.mclapply(data=data, VarCov.type="uneqvar", nCore=NULL, outfile="output/AML_outcome_uneqvar_type2.RData", screen.type=2)
Let’s take a look at the results
load(file="output/AML_outcome_uneqvar_type2.RData")
The dimension of the object is
dim(res)
## [1] 8058 6
The rows on the top are
head(res)
## lrt df nIso p.value FDR Time
## KDM5A 6.8602998 4 5 0.1434574952 0.33378175 1.8200545
## NINJ2 1.2890201 1 2 0.2562286206 0.46720240 0.5789759
## RAD52 0.1514221 1 2 0.6971800786 0.82605803 0.6067107
## DCUN1D2 2.4249554 1 2 0.1194166404 0.29900001 0.6117477
## LINS 15.1641608 2 3 0.0005095001 0.01197555 0.8188503
## TMEM86B 3.2041603 2 3 0.2014769827 0.40538883 0.8531349
and on the bottom are
tail(res)
## lrt df nIso p.value FDR Time
## SHANK3 0.193505567 2 3 0.90778040 0.95295026 0.4346292
## PRIM2 0.004453407 1 2 0.94679358 0.97345110 0.5046623
## SCAMP1 10.551362594 3 4 0.01441649 0.08660597 0.9508400
## VPS11 3.821044922 4 5 0.43076617 0.63258336 1.5673914
## EMG1 1.609379097 3 4 0.65726505 0.79773188 0.7255163
## ALDH3B1 0.665525167 3 4 0.88128123 0.93764524 0.5521927
Let’s check how many genes with missing results.
p <- which(is.na(res[, 1]))
length(p)
## [1] 231
We see that there are 231 (2.87%) genes whose linear mixed effects models did not converge. The names of these genes are
rownames(res)[p]
## [1] "HDLBP" "GNL1" "PPP1R10" "GATAD2B" "PAXBP1" "ZMYM6" "SRP68" "H3F3B"
## [9] "ZC3HC1" "S100A6" "WBP2" "RGPD6" "TPM3" "CDK2AP1" "EEF1D" "PLAGL1"
## [17] "SPI1" "IDH2" "SDHB" "UBE2D3" "RAB43" "MCTP1" "CLINT1" "RPLP0"
## [25] "NDUFB1" "PABPC1" "VNN2" "GBA" "CLK2" "EIF4E3" "IER3" "DDX39B"
## [33] "BAG6" "DDAH2" "CLIC1" "EHMT2" "MRPS25" "ATF6B" "TAP2" "TAP1"
## [41] "HLA-DMA" "HLA-DPA1" "AES" "RGL2" "TAPBP" "USP34" "HNRNPH1" "C19orf70"
## [49] "TCF20" "AKT2" "HLA-B" "BCAP31" "DNMT1" "GNB2L1" "BARD1" "ACIN1"
## [57] "KDM5C" "ARHGAP4" "NGLY1" "GPSM3" "ZNF468" "CUTA" "FLNA" "VPS52"
## [65] "MS4A4E" "LUZP1" "CHFR" "YWHAE" "MCAM" "SETD4" "DDX5" "RASGRP2"
## [73] "RBM39" "TPT1" "ETFA" "NREP" "REEP5" "CTSH" "RPL17" "CPVL"
## [81] "NAP1L1" "LDHB" "ATP5G2" "TMEM66" "FIBP" "C12orf50" "ATP6V0D1" "RSL1D1"
## [89] "MPO" "FEZ2" "HSP90AA1" "MCFD2" "SMG1" "NDUFB11" "TLN1" "PKM"
## [97] "TUBA1A" "SOS1" "SYT15" "AIFM1" "IKBIP" "CPSF3L" "PRKRA" "POLR2J"
## [105] "CCNL2" "ATP5I" "P4HB" "ACSL1" "CCT3" "KMT2C" "CDC16" "RARS"
## [113] "IGFBP2" "ATP6V0E1" "HLA-A" "NUDT16" "HLA-E" "ABCF1" "TUBB" "EIF4A1"
## [121] "CD44" "LDHA" "UBR7" "TSPAN17" "ARL6IP4" "MATR3" "PTPRA" "DNM1"
## [129] "TIGD5" "RWDD1" "SEC24B" "SLC16A5" "PSMD9" "MRPS18C" "IL1RN" "MYO15B"
## [137] "RNF10" "KPNB1" "TNF" "AIF1" "PRRC2A" "HMHA1" "CSNK2B" "CIRBP"
## [145] "MSH5" "MYB" "C6orf48" "ZNF185" "PSMB9" "BRD2" "HLA-DPB1" "RING1"
## [153] "RPS18" "NDUFS6" "IFNGR2" "ADD1" "LST1" "SH3TC1" "STIM2" "SKIV2L"
## [161] "SERF2" "TCN2" "PHF1" "CTNND1" "ABCC10" "MTHFD2" "ILF3" "SMARCA4"
## [169] "RAD23A" "RRM1" "TPM4" "CDC42" "URI1" "ZNF507" "SRRM1" "SPINT2"
## [177] "MPZL1" "TEX264" "NOL11" "SNX33" "HNRNPH3" "RCN2" "ANP32B" "ATXN2L"
## [185] "C18orf25" "CALCOCO2" "TAX1BP1" "PTK2B" "PRR13" "PCBP2" "MIA3" "ZDHHC17"
## [193] "POLI" "PSMD7" "PSMD3" "INPP5D" "ACSL5" "UBN1" "EIF4A2" "LGALS9"
## [201] "HM13" "BANF1" "MYL12A" "CNTLN" "LRRK2" "CREM" "GNAS" "PPM1A"
## [209] "ZYX" "CCNY" "MRPL48" "FUS" "HIF1A" "IFIT5" "EIF5" "NAPG"
## [217] "ARMC5" "TGFB1I1" "ACSM2A" "C1orf162" "OGFOD1" "NPIPB5" "TRIM56" "NEO1"
## [225] "APEX1" "PIGG" "SNRPC" "CDC73" "CHPT1" "SSFA2" "G3BP1"
Next let’s retain only Those called significant at level FDR = 0.05 and assign the results in a new object called sig.res2
sig.res2 <- res[!is.na(res[, 1]) & res[, "FDR"] <= 0.05, ]
dim(sig.res2)
## [1] 857 6
head(sig.res2)
## lrt df nIso p.value FDR Time
## LINS 15.16416 2 3 5.095001e-04 0.011975548 0.8188503
## UBE2S 17.55095 1 2 2.797109e-05 0.001931850 0.5785339
## GTF3C3 14.99945 4 5 4.702353e-03 0.046752963 1.2817483
## LCP2 18.35829 6 7 5.396489e-03 0.049692143 3.3005738
## MSL2 23.36706 4 5 1.069418e-04 0.004676164 1.8034894
## CLK1 45.14836 12 13 9.718996e-06 0.001056536 15.3574758
We see that there are total 857 genes that passed the screening test at level FDR = 0.05. Let’s sort these genes by the p-values and check those on the top and bottom,
head(sig.res2[order(sig.res2[, "p.value"]), ])
## lrt df nIso p.value FDR Time
## TARDBP 79.33817 7 8 1.879383e-14 1.470993e-10 5.962395
## MYCBP2 62.47023 6 7 1.415165e-11 5.538250e-08 3.677209
## SP100 77.93700 13 14 2.691443e-11 7.021975e-08 26.179079
## NCK1 57.09093 6 7 1.751101e-10 3.426467e-07 2.296355
## UBXN1 69.49333 12 13 3.983799e-10 6.091691e-07 20.073373
## SPAG9 64.67847 10 11 4.669752e-10 6.091691e-07 13.680007
tail(sig.res2[order(sig.res2[, "p.value"]), ])
## lrt df nIso p.value FDR Time
## TCEAL1 10.42689 2 3 0.005442885 0.04990939 0.3515098
## ISCU 18.33694 6 7 0.005443097 0.04990939 1.6581669
## DDX11 16.54635 5 6 0.005445589 0.04990939 1.8544092
## NUMB 21.72313 8 9 0.005455429 0.04993720 6.9271486
## RABEP1 12.64634 3 4 0.005467310 0.04993720 1.1467869
## SRPK1 16.53666 5 6 0.005467763 0.04993720 2.4430540
We saved the results in a worksheet named “Sig. G - Type 2” in a Excel file AML_uneqvar.xlsx using the following commands
write.xlsx(sig.res2[order(sig.res2[, "p.value"]), ], file="AML_uneqvar.xlsx", sheetName = "Sig. G - Type 2", row.names = T, append = T)
Next we perform the step 2 (confirmatory) test to identify the isoforms of these 857 genes that are differentially expressed using function step2.Ftest()
as follows
final.res2 <- step2.Ftest(data=data, file.step1="output/AML_outcome_uneqvar_type2.RData", alpha=0.05, method="holm")
## Number of significant genes: 857
## Cutoff for significant isoform is 0.005474639
The function returns the results of the confirmatory test for all the isoforms of the above 857 genes in a data.frame object called final.res2
. Let’s take a look at the results.
dim(final.res2)
## [1] 5185 7
head(final.res2)
## gene isoform FC step2.p.value adj.p threshold decision
## 1 LINS ENST00000314742 -1.059541 1.846141e-01 1.846141e-01 0.005474639 FALSE
## 2 LINS ENST00000559169 1.211968 4.744510e-03 1.423353e-02 0.001824880 FALSE
## 3 LINS ENST00000561073 1.208533 3.586204e-02 7.172409e-02 0.002737320 FALSE
## 4 UBE2S ENST00000589978 -2.427634 2.705008e-05 5.410016e-05 0.002737320 TRUE
## 5 UBE2S ENST00000264552 -1.049046 6.913375e-01 6.913375e-01 0.005474639 FALSE
## 6 GTF3C3 ENST00000481098 1.380074 9.821877e-04 4.910938e-03 0.001094928 TRUE
The columns from left to right are gene name (gene
), isoform ID (isoform
), Fold Change (FC
), p-value of the step two test (step2.p.value
), adjusted p-value (adj.p
), and threshold to determine if the step2.p.value
is significant, and descison (decision
). An isoform is called significant or considered differentially expressed if the decision is TRUE.
The isoforms that are called significant are as follows
sig.res2.I.t <- final.res2[final.res2$decision, ]
dim(sig.res2.I.t)
## [1] 201 7
head(sig.res2.I.t)
## gene isoform FC step2.p.value adj.p threshold decision
## 4 UBE2S ENST00000589978 -2.427634 2.705008e-05 5.410016e-05 0.0027373195 TRUE
## 6 GTF3C3 ENST00000481098 1.380074 9.821877e-04 4.910938e-03 0.0010949278 TRUE
## 64 INO80 ENST00000558357 -1.532673 4.038737e-04 1.211621e-03 0.0018248797 TRUE
## 108 PHF8 ENST00000338946 1.412894 9.481201e-06 3.792480e-05 0.0013686598 TRUE
## 121 PHF21A ENST00000534766 1.404136 6.531195e-04 4.571837e-03 0.0007820913 TRUE
## 154 SHPRH ENST00000367505 1.234300 3.793518e-04 1.138055e-03 0.0018248797 TRUE
tail(sig.res2.I.t)
## gene isoform FC step2.p.value adj.p threshold decision
## 4998 BZW2 ENST00000452975 -1.757608 0.0004382129 0.0043821293 0.0005474639 TRUE
## 5033 MST4 ENST00000394335 1.520857 0.0001331171 0.0006655855 0.0010949278 TRUE
## 5034 NAA30 ENST00000554703 -1.374659 0.0013486235 0.0040458705 0.0018248797 TRUE
## 5074 METTL17 ENST00000553389 1.479945 0.0002133583 0.0006400749 0.0018248797 TRUE
## 5155 PTPN12 ENST00000248594 1.375450 0.0006907204 0.0034536022 0.0010949278 TRUE
## 5161 PHTF2 ENST00000307305 1.261118 0.0001464911 0.0005859646 0.0013686598 TRUE
We see that with the standard t-test option we identified 201 isoforms that are called significant. These isoforms belong to 171 genes, which are
unique(unique(sig.res2.I.t[, 1]))
## [1] "UBE2S" "GTF3C3" "INO80" "PHF8" "PHF21A"
## [6] "SHPRH" "RPGR" "TOMM20" "MAP3K7" "SRSF2"
## [11] "CDKN1C" "DDX24" "NPM3" "EIF4G3" "DICER1"
## [16] "TRIT1" "FNBP4" "EOGT" "PRPF40A" "VPS54"
## [21] "TMEM128" "LPCAT1" "SATB1" "CCND3" "TMEM87A"
## [26] "TOP2B" "TNKS1BP1" "BOD1L1" "CEP152" "MOB1A"
## [31] "IBTK" "AQR" "PPP1CC" "NUP133" "NDRG1"
## [36] "PEPD" "EAPP" "CHMP1A" "FNBP1" "GLRX"
## [41] "TP53INP1" "POLG2" "PITPNB" "TRAPPC8" "PARL"
## [46] "TCEA1" "DSN1" "CYCS" "CSAD" "RPUSD1"
## [51] "IER3IP1" "BLCAP" "TRIP12" "ACRBP" "DMXL2"
## [56] "CCPG1" "BNIP2" "NEMF" "ASUN" "MYCBP2"
## [61] "ZDHHC6" "CCDC59" "RBM26" "THOC1" "ZRANB2"
## [66] "SPINK2" "KARS" "STARD7" "PCYT1A" "THOC2"
## [71] "UBXN1" "MICU2" "CENPC" "DLG1" "SACS"
## [76] "LBR" "GTF2H2" "DPM1" "MTMR6" "TPM2"
## [81] "IQCB1" "C9orf72" "ASH1L" "VPS29" "SLC2A5"
## [86] "NOL8" "ROCK1" "ANKRA2" "LEMD2" "SNX13"
## [91] "FAM120AOS" "MAX" "POLR2J2" "TMED5" "GRIPAP1"
## [96] "NCK1" "CHD3" "S100PBP" "GMNN" "MIS12"
## [101] "VPS16" "TSPAN7" "EIF3M" "RASA1" "RASA2"
## [106] "SCOC" "GCA" "HMGN2" "TPP2" "C7orf55-LUC7L2"
## [111] "OXR1" "FPGS" "RHOH" "TRH" "KNTC1"
## [116] "CDC14A" "DMTF1" "EEF1B2" "MARCH7" "SAFB"
## [121] "TBC1D23" "MTRR" "RPL15" "VMA21" "PTGER4"
## [126] "MARCH6" "TTLL5" "NPR3" "BRIX1" "RUNX2"
## [131] "UBE4B" "CPNE3" "PNN" "GPR114" "BPTF"
## [136] "RNF138" "TTC14" "TXNRD1" "KDM3A" "KIAA1033"
## [141] "TGFBR1" "PPWD1" "MED17" "SP100" "GORAB"
## [146] "SLC25A5" "RPS8" "COMMD3" "TLE4" "ATF7IP2"
## [151] "ZNF92" "SLC30A5" "CRCP" "ZMYM2" "TGIF1"
## [156] "PPFIA1" "SAP18" "KDM6A" "IKZF4" "SOCS2"
## [161] "ANKRD12" "KLHL2" "HNRNPA3" "WRNIP1" "TNRC6A"
## [166] "BZW2" "MST4" "NAA30" "METTL17" "PTPN12"
## [171] "PHTF2"
We saved the results in a worksheet named “Sig. I - Type 2 t-test” in the same Excel file AML_uneqvar.xlsx using the following commands
write.xlsx(sig.res2.I.t, file="AML_uneqvar.xlsx", sheetName = "Sig. I - Type 2 t-test", row.names = F, append = T)
Next we perform the step 2 (confirmatory) test to identify the isoforms of these 857 genes that are differentially expressed using function mc.step2.Waldtest
and save the object final.res1.wald
returned by the function in a file called AML_outcome_uneqvar_type2_step2_Wald.RData
under subfolder output
final.res2.wald <- mc.step2.Waldtest(data=data, VarCov.type="uneqvar", nCore = NULL, file.step1="output/AML_outcome_uneqvar_type2.RData", alpha=0.05, method="homl")
save(final.res1.wald, file="output/AML_outcome_uneqvar_type2_step2_Wald.RData")
This operation may take a little long time. Let’s load the object final.res2.wald
from the file and then take a look.
load(file="output/AML_outcome_uneqvar_type2_step2_Wald.RData")
dim(final.res2.wald)
## [1] 5185 6
head(final.res2.wald)
## gene isoform step2.p.value adj.p threshold decision
## 1 LINS ENST00000314742 1.894253e-01 1.894253e-01 0.005474639 FALSE
## 2 LINS ENST00000559169 4.279246e-03 1.283774e-02 0.001824880 FALSE
## 3 LINS ENST00000561073 2.984596e-02 5.969193e-02 0.002737320 FALSE
## 4 UBE2S ENST00000589978 1.699557e-05 3.399114e-05 0.002737320 TRUE
## 5 UBE2S ENST00000264552 6.897118e-01 6.897118e-01 0.005474639 FALSE
## 6 GTF3C3 ENST00000481098 5.040540e-04 2.520270e-03 0.001094928 TRUE
The columns from left to right are gene name (gene
), isoform ID (isoform
), Fold Change (FC
), p-value of the step two test (step2.p.value
), adjusted p-value (adj.p
), and threshold to determine if the step2.p.value
is significant, and descison (decision
). An isoform is called significant or considered differentially expressed if the decision is TRUE.
The isoforms that are called significant are as follows
sig.res2.I.wald <- final.res2.wald[final.res2.wald$decision, ]
dim(sig.res2.I.wald)
## [1] 221 6
head(sig.res2.I.wald)
## gene isoform step2.p.value adj.p threshold decision
## 4 UBE2S ENST00000589978 1.699557e-05 3.399114e-05 0.0027373195 TRUE
## 6 GTF3C3 ENST00000481098 5.040540e-04 2.520270e-03 0.0010949278 TRUE
## 64 INO80 ENST00000558357 4.221384e-04 1.266415e-03 0.0018248797 TRUE
## 108 PHF8 ENST00000338946 2.589063e-04 1.035625e-03 0.0013686598 TRUE
## 121 PHF21A ENST00000534766 2.837656e-05 1.986359e-04 0.0007820913 TRUE
## 123 PHF21A ENST00000531959 2.668982e-04 1.601389e-03 0.0009124398 TRUE
We see that with the Wlad-test option we identified 221 isoforms that are called significant. These isoforms belong to 182 genes, which are
unique(unique(sig.res2.I.wald[, 1]))
## [1] "UBE2S" "GTF3C3" "INO80" "PHF8" "PHF21A" "SHPRH" "RPGR"
## [8] "MED14" "TOMM20" "MAP3K7" "CDKN1C" "NPM3" "ZNF195" "SBF2"
## [15] "DICER1" "TRIT1" "CLK4" "FNBP4" "FAM208A" "ZNF800" "EOGT"
## [22] "PRPF40A" "VPS54" "TMEM128" "ACTN1" "LPCAT1" "CLDND1" "SATB1"
## [29] "CCND3" "TMEM87A" "CCNL1" "TOP2B" "TNKS1BP1" "BOD1L1" "MOB1A"
## [36] "IBTK" "AQR" "PPP1CC" "NUP133" "ZNF91" "PEPD" "EAPP"
## [43] "CHMP1A" "FNBP1" "GLRX" "PITPNB" "SMYD3" "TRAPPC8" "HERC4"
## [50] "RB1CC1" "PARL" "FYN" "TCEA1" "DSN1" "CYCS" "ZFC3H1"
## [57] "ATP5A1" "CSAD" "RPUSD1" "IER3IP1" "BLCAP" "TAF1D" "ACRBP"
## [64] "TRIM23" "DMXL2" "CCPG1" "OSBPL8" "BNIP2" "ASUN" "MYCBP2"
## [71] "ZDHHC6" "CCDC59" "RBM26" "ZRANB2" "HOPX" "SPINK2" "KARS"
## [78] "STARD7" "PCYT1A" "THOC2" "UBXN1" "MICU2" "CENPC" "DLG1"
## [85] "SIKE1" "NR2C1" "GTF2H2" "DPM1" "MTMR6" "TPM2" "IQCB1"
## [92] "C9orf72" "SRSF7" "IARS" "VPS29" "SLC2A5" "NOL8" "HSPH1"
## [99] "ANKRA2" "BAD" "SNX13" "FAM120AOS" "MAX" "POLR2J2" "TMED5"
## [106] "FRA10AC1" "U2AF2" "PRKCI" "NCK1" "S100PBP" "GMNN" "ZDHHC13"
## [113] "MIS12" "SPNS3" "VPS16" "TSPAN7" "EIF3M" "RASA1" "NUCB2"
## [120] "RASA2" "SCOC" "GCA" "HMGN2" "TPP2" "OXR1" "CTNNB1"
## [127] "RHOH" "MSL3" "TRH" "KNTC1" "NEK6" "CDC14A" "OXA1L"
## [134] "EEF1B2" "MARCH7" "TBC1D23" "MTRR" "CPSF3" "RPL15" "VMA21"
## [141] "PTGER4" "TTLL5" "NPR3" "BRIX1" "CPNE3" "PNN" "GPR114"
## [148] "PPIP5K2" "RNF138" "TTC14" "TXNRD1" "EIF4B" "KIAA1033" "MGST1"
## [155] "MED17" "SP100" "RAB2A" "GORAB" "SLC25A5" "ARID4A" "RPS8"
## [162] "COMMD3" "MRPS35" "WRN" "ATF7IP2" "ZNF92" "SLC30A5" "ZMYM2"
## [169] "TGIF1" "IKZF4" "SOCS2" "TTC7A" "CDCA7" "ANKRD12" "LUC7L3"
## [176] "SEH1L" "BZW2" "MST4" "NAA30" "METTL17" "RBM3" "PHTF2"
We saved the results in a worksheet named “Sig. I - Type 2 Wald” in the same Excel file AML_uneqvar.xlsx using the following commands
write.xlsx(sig.res1.I.wald, file="AML_uneqvar.xlsx", sheetName = "Sig. I - Type 2 Wald", row.names = F, append = T)
we makes the isoforms plots for two genes EEF1B2 and RUNX2 using function plot.gene()
.
par(mfrow=c(1, 2))
dat<-plot.gene(data=data, gene="EEF1B2", Phenotype="Tumor Status", xlim = c(-0.05, 1.5), isoform.id.opt=1)
dat<-plot.gene(data=data, gene="RUNX2", Phenotype="Tumor Status", xlim = c(-0.05, 1.5), isoform.id.opt=1)
Both genes EEF1B2 and RUNX2 passed Types 1 and 2 screening tests. We see an interaction pattern in each plot, where some of the isoforms are up-regulated in cases and down-regulated in Control while other genes are opposite. There might be isoform switching events for these two genes.
The only thing we need to do is to replace uneqvar with unstr for argument VarCov.type in functions lrt.test.mclapply and mc.step2.Waldtest and repeat above analysis
As described in our Tutorial, the first step is to run the screening test, the likelihood ratio test (LRT) by calling function lrt.test.mclapply()
. We proceed this step by submitting following command:
res <- lrt.test.mclapply(data=data, VarCov.type="unstr", nCore=NULL, outfile="output/AML_outcome_unstr_type1.RData", screen.type=1)
This was the most time-consuming part. It returned a matrix object res
which was saved in the file AML_outcome_unstr_type1.RData
under the subfolder output. Now we load the object from the file and take a look.
load(file="data/AML_isoform_outcome.RData")
load(file="output/AML_outcome_unstr_type1.RData")
The dimension of the object is
dim(res)
## [1] 8058 6
The rows on the top are
head(res)
## lrt df nIso p.value FDR Time
## KDM5A 6.221037 5 5 0.285301376 0.47868725 12.0417387
## NINJ2 6.885362 2 2 0.031978841 0.17265499 0.5221369
## RAD52 0.391820 2 2 0.822086207 0.89383631 0.5172834
## DCUN1D2 3.563903 2 2 0.168309380 0.36231660 0.5619822
## LINS 12.289044 3 3 0.006455844 0.09648172 0.9797537
## TMEM86B 3.768324 3 3 0.287592145 0.48101316 1.5405085
and on the bottom are
tail(res)
## lrt df nIso p.value FDR Time
## SHANK3 0.55028089 3 3 0.90771392 0.9437296 1.6155670
## PRIM2 0.04645812 2 2 0.97703866 0.9859620 0.1500692
## SCAMP1 11.92651508 4 4 0.01790617 0.1378276 1.5928967
## VPS11 4.65820475 5 5 0.45899911 0.6364122 2.9066205
## EMG1 2.98019908 4 4 0.56114449 0.7144731 2.2928150
## ALDH3B1 1.17007513 4 4 0.88300055 0.9295491 2.1293185
Each row of the matrix gives the test result for each gene and the row names indicate the names (IDs) of the genes. The columns from left to right are LRT statistic (lrt
), degree of freedom (df
), Number of isoforms (nIso
), p-values of LRT (p.value
), FDR (FDR
) and time used for fitting the linear mixed effects model in second (Time
). Note that there might be missing results for some genes because the corresponding linear mixed models procedures did not converge. Let’s see how many such genes there are.
p <- which(is.na(res[, 1]))
length(p)
## [1] 197
We see that there are 197 (2.44%) genes whose linear mixed effects models did not converge. The names of these genes are
rownames(res)[p]
## [1] "CLK1" "GNL1" "PPP1R10" "GATAD2B" "DHX16" "PPP1R18" "MDC1" "PER1"
## [9] "PAXBP1" "CSF3R" "RPL8" "WBP2" "RPS13" "TPM3" "PRPF8" "EEF1D"
## [17] "EIF4G2" "SEC31A" "NCOR1" "UBE2D3" "UBR4" "RSRC2" "RPLP0" "PABPC1"
## [25] "GBA" "FAM189B" "CLK2" "IER3" "HLA-C" "HBS1L" "DDX39B" "BAG6"
## [33] "DDAH2" "CLIC1" "LSM2" "EHMT2" "ATF6B" "HLA-DQB1" "TAP1" "HLA-DMB"
## [41] "HLA-DMA" "HLA-DPA1" "RGL2" "TAPBP" "CSNK1D" "RBMX" "HNRNPH1" "TCF20"
## [49] "MGAT4B" "HLA-B" "DNMT1" "GNB2L1" "CCNL1" "TAPT1" "ACIN1" "RPL18"
## [57] "AGPAT1" "WDR1" "PBX2" "GPSM3" "CUTA" "SREBF1" "FLNA" "VPS52"
## [65] "FLII" "MS4A6A" "MS4A4E" "C1orf63" "DENND5A" "DDX5" "XPNPEP1" "SF1"
## [73] "RBM39" "TPT1" "FAU" "TRIP12" "TAF1D" "DDB1" "LTBP3" "NAP1L1"
## [81] "RPS2" "RNPS1" "RSL1D1" "RPL4" "HSP90AA1" "WDR26" "CSDE1" "DAGLB"
## [89] "NXF1" "MKNK1" "PKM" "NPIPB4" "CCNL2" "EFTUD2" "TCP1" "P4HB"
## [97] "HNRNPC" "CCT3" "CDC16" "DOCK2" "RPL37A" "ARPC2" "SEPT2" "HLA-A"
## [105] "HLA-E" "ABCF1" "INTS3" "TUBB" "EIF4A1" "CD44" "ANKHD1" "SAP30BP"
## [113] "SSBP1" "LDHA" "ACADVL" "RBM5" "CTNNA1" "MATR3" "MACF1" "APLP2"
## [121] "KNTC1" "MYO15B" "RNF10" "KPNB1" "VIM" "TNF" "AIF1" "PRRC2A"
## [129] "CSNK2B" "CIRBP" "MSH5" "C6orf48" "HLA-DRA" "PSMB9" "BRD2" "HLA-DPB1"
## [137] "SLC39A7" "RING1" "RPS18" "PTPRC" "IFNGR2" "SAFB" "CANX" "SRSF5"
## [145] "KDM3B" "LST1" "ATG4D" "CCNDBP1" "SKIV2L" "FOS" "RNF5" "PHF1"
## [153] "MTHFD2" "MTA1" "LSM14A" "RPL27A" "DENND3" "TCF25" "CLK3" "NOL11"
## [161] "EWSR1" "CSK" "DOCK8" "ATXN2L" "MORF4L1" "PPWD1" "UBE2I" "PSMD2"
## [169] "EIF4G1" "PCBP2" "PTMA" "NDUFV1" "INPP5D" "EIF4A2" "USP3" "SRSF11"
## [177] "AMPD2" "FBRS" "MYL12A" "GNAS" "PPM1A" "SRRM2" "ZYX" "TTC7A"
## [185] "FUS" "HIF1A" "ARMC5" "CDK16" "KLC1" "INPPL1" "PCM1" "SENP6"
## [193] "OTUB1" "ERGIC3" "PIGG" "SSFA2" "LMNA"
Next let’s retain only Those called significant at level FDR = 0.05 and assign the results in a new object called sig.res1.unstr
sig.res1.unstr <- res[!is.na(res[, 1]) & res[, "FDR"] <= 0.05, ]
dim(sig.res1.unstr)
## [1] 93 6
We see that there are total 93 genes that passed the screening test at level FDR = 0.05. Let’s sort these genes by the p-values and check those on the top and bottom,
head(sig.res1.unstr[order(sig.res1.unstr[, "p.value"]), ])
## lrt df nIso p.value FDR Time
## BNIP2 42.12067 9 9 3.123765e-06 0.02455591 80.0752339
## KDM5B 27.71828 4 4 1.422479e-05 0.02785048 3.4180684
## RPS8 36.06552 8 8 1.708174e-05 0.02785048 63.9374423
## ANKRD10 37.28777 9 9 2.338120e-05 0.02785048 147.6992440
## LPCAT1 21.01381 2 2 2.734694e-05 0.02785048 0.2510765
## RUNX2 30.23851 6 6 3.540942e-05 0.02785048 15.6507425
tail(sig.res1.unstr[order(sig.res1.unstr[, "p.value"]), ])
## lrt df nIso p.value FDR Time
## SLC25A5 19.90253 4 4 0.0005220184 0.04614520 1.701707
## RAD23B 17.63746 3 3 0.0005224428 0.04614520 1.001509
## GORAB 15.05812 2 2 0.0005372435 0.04643790 0.170177
## SPPL2A 17.57728 3 3 0.0005375715 0.04643790 1.283314
## GOLGA4 21.88731 5 5 0.0005500644 0.04700061 6.670376
## TOP2B 23.72432 6 6 0.0005868739 0.04960662 20.298992
We saved the results in a worksheet named “Sig. G - Type 1” in a Excel file AML_unstr.xlsx using the following commands
require(xlsx)
write.xlsx(sig.res1.unstr[order(sig.res1.unstr[, "p.value"]), ], file="AML_unstr.xlsx", sheetName = "Sig. G - Type 1", row.names = T)
Next we perform the step 2 (confirmatory) test to identify the isoforms of these 93 genes that are differentially expressed using function step2.Ftest()
as follows
final.res1 <- step2.Ftest(data=data, file.step1="output/AML_outcome_unstr_type1.RData", alpha=0.05, method="holm")
## Number of significant genes: 93
## Cutoff for significant isoform is 0.0005915278
Note that there is an argument called method in function step2.Ftest and the following functon mc.step2.Waldtest, which is the options for controlling the family-wise error rate required by our two step procedure. The options are methods of Bonferroni (“bonferroni”), Holm (“holm”), Hochberg (“hochberg”) and Hommel (“hommel”). We use Holm option throughout this document. The function returns the results of the confirmatory test for all the isoforms of the above 93 genes in a data.frame object called final.res1
. Let’s take a look at the results.
dim(final.res1)
## [1] 456 7
head(final.res1)
## gene isoform FC step2.p.value adj.p threshold decision
## 1 UBE2S ENST00000589978 -2.427634 2.705008e-05 5.410016e-05 0.0002957639 TRUE
## 2 UBE2S ENST00000264552 -1.049046 6.913375e-01 6.913375e-01 0.0005915278 FALSE
## 3 FBXO3 ENST00000526785 -1.002767 9.715154e-01 9.715154e-01 0.0005915278 FALSE
## 4 FBXO3 ENST00000530401 1.071855 3.345607e-01 6.691213e-01 0.0002957639 FALSE
## 5 FBXO3 ENST00000534136 1.197882 7.787110e-03 2.336133e-02 0.0001971759 FALSE
## 6 FBXO3 ENST00000448981 -1.319633 2.182887e-03 8.731547e-03 0.0001478819 FALSE
The columns from left to right are gene name (gene
), isoform ID (isoform
), Fold Change (FC
), p-value of the step two test (step2.p.value
), adjusted p-value (adj.p
), and threshold to determine if the step2.p.value
is significant, and descison (decision
). An isoform is called significant or considered differentially expressed if the decision is TRUE.
The isoforms that are called significant are as follows
sig.res1.I.t <- final.res1[final.res1$decision, ]
dim(sig.res1.I.t)
## [1] 28 7
head(sig.res1.I.t)
## gene isoform FC step2.p.value adj.p threshold decision
## 1 UBE2S ENST00000589978 -2.427634 2.705008e-05 5.410016e-05 0.0002957639 TRUE
## 13 PHF8 ENST00000338946 1.412894 9.481201e-06 3.792480e-05 0.0001478819 TRUE
## 29 CDKN1C ENST00000414822 -1.565790 5.414922e-05 2.165969e-04 0.0001478819 TRUE
## 32 TMX3 ENST00000566887 1.298018 7.545970e-05 1.509194e-04 0.0002957639 TRUE
## 33 TMX3 ENST00000299608 1.270793 1.538746e-04 1.538746e-04 0.0005915278 TRUE
## 62 VPS54 ENST00000409558 1.323597 6.284013e-05 2.513605e-04 0.0001478819 TRUE
tail(sig.res1.I.t)
## gene isoform FC step2.p.value adj.p threshold decision
## 313 CCNJ ENST00000465148 -1.293933 1.868237e-04 3.736475e-04 2.957639e-04 TRUE
## 326 EEF1B2 ENST00000435123 -3.253894 1.033427e-05 8.267413e-05 7.394097e-05 TRUE
## 382 TXNRD1 ENST00000427956 -1.697822 9.090531e-07 8.181478e-06 6.572531e-05 TRUE
## 433 ATF7IP2 ENST00000570163 1.893709 1.698652e-05 8.493261e-05 1.183056e-04 TRUE
## 439 AKAP2 ENST00000480388 -1.429504 3.507530e-05 7.015059e-05 2.957639e-04 TRUE
## 455 PHTF2 ENST00000307305 1.261118 1.464911e-04 5.859646e-04 1.478819e-04 TRUE
We see that with the standard t-test option we identified 28 isoforms that are called significant. These isoforms belong to 25 genes, which are
unique(unique(sig.res1.I.t[, 1]))
## [1] "UBE2S" "PHF8" "CDKN1C" "TMX3" "VPS54" "LPCAT1" "AZI2" "PEPD" "DSN1"
## [10] "CSAD" "BLCAP" "SPPL2A" "BNIP2" "ZDHHC6" "PTPN22" "CENPC" "POLR2J2" "GMNN"
## [19] "GCA" "CCNJ" "EEF1B2" "TXNRD1" "ATF7IP2" "AKAP2" "PHTF2"
We saved the results in a worksheet named “Sig. I - Type 1 t-test” in the same Excel file AML_unstr.xlsx using the following commands
write.xlsx(sig.res1.I.t, file="AML_unstr.xlsx", sheetName = "Sig. I - Type 1 t-test", row.names = F, append = T)
Next we perform the step 2 (confirmatory) test to identify the isoforms of these 93 genes that are differentially expressed using function mc.step2.Waldtest
and save the object final.res1.wald.unstr
returned by the function in a file called AML_outcome_unstr_type1_step2_Wald.RData
under subfolder output
final.res1.wald.unstr <- mc.step2.Waldtest(data=data, VarCov.type="unstr", nCore = NULL, file.step1="output/AML_outcome_unstr_type1.RData", alpha=0.05, method="homl")
save(final.res1.wald.unstr, file="output/AML_outcome_unstr_type1_step2_Wald.RData")
This operation may take a little long time. Let’s load the object final.res1.wald.unstr
from the file and then take a look.
load(file="output/AML_outcome_unstr_type1_step2_Wald.RData")
dim(final.res1.wald.unstr)
## [1] 456 6
head(final.res1.wald.unstr)
## gene isoform step2.p.value adj.p threshold decision
## 1 UBE2S ENST00000589978 1.699557e-05 3.399114e-05 0.0002957639 TRUE
## 2 UBE2S ENST00000264552 6.897118e-01 6.897118e-01 0.0005915278 FALSE
## 3 FBXO3 ENST00000526785 9.691171e-01 9.691171e-01 0.0005915278 FALSE
## 4 FBXO3 ENST00000530401 3.697829e-01 7.395658e-01 0.0002957639 FALSE
## 5 FBXO3 ENST00000534136 6.467948e-03 1.940384e-02 0.0001971759 FALSE
## 6 FBXO3 ENST00000448981 2.644229e-03 1.057692e-02 0.0001478819 FALSE
The columns from left to right are gene name (gene
), isoform ID (isoform
), Fold Change (FC
), p-value of the step two test (step2.p.value
), adjusted p-value (adj.p
), and threshold to determine if the step2.p.value
is significant, and descison (decision
). An isoform is called significant or considered differentially expressed if the decision is TRUE.
The isoforms that are called significant are as follows
sig.res1.I.wald.unstr <- final.res1.wald.unstr[final.res1.wald.unstr$decision, ]
dim(sig.res1.I.wald.unstr)
## [1] 34 6
head(sig.res1.I.wald.unstr)
## gene isoform step2.p.value adj.p threshold decision
## 1 UBE2S ENST00000589978 1.699557e-05 3.399114e-05 0.0002957639 TRUE
## 32 TMX3 ENST00000566887 5.166484e-05 1.033297e-04 0.0002957639 TRUE
## 33 TMX3 ENST00000299608 1.112203e-04 1.112203e-04 0.0005915278 TRUE
## 60 VPS54 ENST00000416400 4.921435e-05 1.968574e-04 0.0001478819 TRUE
## 67 LPCAT1 ENST00000503252 3.445858e-06 6.891715e-06 0.0002957639 TRUE
## 72 AZI2 ENST00000476174 1.000483e-04 4.001932e-04 0.0001478819 TRUE
We see that with the Wlad-test option we identified 34 isoforms that are called significant. These isoforms belong to 29 genes, which are
unique(unique(sig.res1.I.wald.unstr[, 1]))
## [1] "UBE2S" "TMX3" "VPS54" "LPCAT1" "AZI2" "TOP2B" "PEPD" "FNBP1" "DSN1"
## [10] "CSAD" "BLCAP" "SPPL2A" "BNIP2" "ZDHHC6" "PTPN22" "PCYT1A" "CENPC" "EDEM3"
## [19] "POLR2J2" "GMNN" "GCA" "CCNJ" "EEF1B2" "NPR3" "TXNRD1" "ATF7IP2" "AKAP2"
## [28] "SOCS2" "PHTF2"
We saved the results in a worksheet named “Sig. I - Type 1 Wald” in the same Excel file AML_unstr.xlsx using the following commands
write.xlsx(sig.res1.I.wald.unstr, file="AML_unstr.xlsx", sheetName = "Sig. I - Type 1 Wald", row.names = F, append = T)
The procedure for the analysis based on Type 2 screening test is the same as that based on Type 1 test except that the argument screen.type
of function lrt.test.mclapply
needs to be specified as 2 rather than 1. We perform the Type 2 screening test by submitting following command:
res <- lrt.test.mclapply(data=data, VarCov.type="unstr", nCore=NULL, outfile="output/AML_outcome_unstr_type2.RData", screen.type=2)
Let’s take a look at the results
load(file="output/AML_outcome_unstr_type2.RData")
The dimension of the object is
dim(res)
## [1] 8058 6
The rows on the top are
head(res)
## lrt df nIso p.value FDR Time
## KDM5A 6.1870684 4 5 0.185606678 0.38552850 14.2327073
## NINJ2 1.2890201 1 2 0.256228621 0.45805453 0.5930238
## RAD52 0.1514221 1 2 0.697180079 0.81332505 0.6523602
## DCUN1D2 2.4249554 1 2 0.119416641 0.30670736 0.7199266
## LINS 12.0569108 2 3 0.002409212 0.06680779 1.4953113
## TMEM86B 3.5274635 2 3 0.171404036 0.36748128 1.6962082
and on the bottom are
tail(res)
## lrt df nIso p.value FDR Time
## SHANK3 0.224051413 2 3 0.89402127 0.9393367 1.6598437
## PRIM2 0.004453358 1 2 0.94679388 0.9711213 0.3595948
## SCAMP1 10.832604452 3 4 0.01266636 0.1154942 1.7084463
## VPS11 4.406684725 4 5 0.35375609 0.5516434 6.4329040
## EMG1 2.569740698 3 4 0.46281911 0.6431311 3.5995891
## ALDH3B1 0.632919850 3 4 0.88885648 0.9362399 2.0013216
Let’s check how many genes with missing results.
p <- which(is.na(res[, 1]))
length(p)
## [1] 194
We see that there are 194 (2.41%) genes whose linear mixed effects models did not converge. The names of these genes are
rownames(res)[p]
## [1] "CLK1" "RNF44" "GNL1" "PPP1R10" "GATAD2B" "DHX16" "PPP1R18" "MDC1"
## [9] "PER1" "PAXBP1" "CSF3R" "RPL8" "WBP2" "RPS13" "TPM3" "PRPF8"
## [17] "EEF1D" "EIF4G2" "SEC31A" "NCOR1" "UBE2D3" "RSRC2" "RPLP0" "PABPC1"
## [25] "GBA" "FAM189B" "CLK2" "IER3" "HLA-C" "HBS1L" "DDX39B" "BAG6"
## [33] "DDAH2" "CLIC1" "LSM2" "EHMT2" "ATF6B" "HLA-DQB1" "TAP1" "HLA-DMB"
## [41] "HLA-DMA" "HLA-DPA1" "RGL2" "TAPBP" "CSNK1D" "RBMX" "HNRNPH1" "TCF20"
## [49] "MGAT4B" "HLA-B" "DNMT1" "GNB2L1" "CCNL1" "TAPT1" "ACIN1" "RPL18"
## [57] "AGPAT1" "WDR1" "PBX2" "GPSM3" "ZNF468" "CUTA" "SREBF1" "FLNA"
## [65] "VPS52" "FLII" "MS4A6A" "MS4A4E" "C1orf63" "DENND5A" "DDX5" "XPNPEP1"
## [73] "SF1" "RBM39" "TPT1" "FAU" "TRIP12" "TAF1D" "DDB1" "LTBP3"
## [81] "NAP1L1" "RPS2" "RNPS1" "RSL1D1" "RPL4" "HSP90AA1" "CSDE1" "DAGLB"
## [89] "NXF1" "MKNK1" "LMBR1L" "PKM" "NPIPB4" "IKBIP" "CCNL2" "EFTUD2"
## [97] "TCP1" "P4HB" "HNRNPC" "CCT3" "CDC16" "DOCK2" "ARPC2" "SEPT2"
## [105] "HLA-A" "HLA-E" "ABCF1" "INTS3" "TUBB" "EIF4A1" "CD44" "ANKHD1"
## [113] "SAP30BP" "SSBP1" "LDHA" "ACADVL" "RBM5" "CTNNA1" "MATR3" "MACF1"
## [121] "APLP2" "KNTC1" "MYO15B" "RNF10" "KPNB1" "VIM" "TNF" "AIF1"
## [129] "PRRC2A" "CSNK2B" "CIRBP" "MSH5" "C6orf48" "HLA-DRA" "PSMB9" "BRD2"
## [137] "HLA-DPB1" "SLC39A7" "RING1" "RPS18" "PTPRC" "IFNGR2" "SAFB" "CANX"
## [145] "SRSF5" "KDM3B" "LST1" "ATG4D" "CCNDBP1" "SKIV2L" "FOS" "RNF5"
## [153] "PHF1" "MTHFD2" "MTA1" "LSM14A" "RPL27A" "DENND3" "TCF25" "CLK3"
## [161] "NOL11" "EWSR1" "CSK" "DOCK8" "ATXN2L" "GAPDH" "ADRBK1" "UBE2I"
## [169] "PSMD2" "EIF4G1" "PCBP2" "PTMA" "NDUFV1" "INPP5D" "EIF4A2" "USP3"
## [177] "SRSF11" "AMPD2" "FBRS" "MYL12A" "GNAS" "PPM1A" "SRRM2" "ZYX"
## [185] "TTC7A" "HIF1A" "ARMC5" "KLC1" "INPPL1" "PCM1" "ERGIC3" "PIGG"
## [193] "SSFA2" "LMNA"
Next let’s retain only Those called significant at level FDR = 0.05 and assign the results in a new object called sig.res2.unstr
sig.res2.unstr <- res[!is.na(res[, 1]) & res[, "FDR"] <= 0.05, ]
dim(sig.res2.unstr)
## [1] 102 6
head(sig.res2.unstr)
## lrt df nIso p.value FDR Time
## UBE2S 17.55095 1 2 2.797109e-05 0.02086329 0.6880627
## INO80 14.83799 2 3 5.997514e-04 0.04879397 1.3273273
## FBXO3 22.53701 3 4 5.042727e-05 0.02643313 3.3897028
## RPGR 15.15625 2 3 5.115181e-04 0.04749958 1.6220014
## MED14 16.12015 2 3 3.159028e-04 0.04516836 1.5281327
## ANTXR2 22.23701 4 5 1.797810e-04 0.03850275 17.0965638
We see that there are total 102 genes that passed the screening test at level FDR = 0.05. Let’s sort these genes by the p-values and check those on the top and bottom,
head(sig.res2.unstr[order(sig.res2.unstr[, "p.value"]), ])
## lrt df nIso p.value FDR Time
## BNIP2 38.21052 8 9 6.883128e-06 0.01944874 56.656606
## KDM5B 26.14703 3 4 8.884721e-06 0.01944874 5.739940
## ANKRD10 37.06555 8 9 1.119377e-05 0.01944874 220.359912
## RPS8 34.58682 7 8 1.337205e-05 0.01944874 136.565551
## RPL17-C18orf32 22.24619 2 3 1.476729e-05 0.01944874 1.486135
## RUNX2 29.98654 5 6 1.483882e-05 0.01944874 19.768445
tail(sig.res2.unstr[order(sig.res2.unstr[, "p.value"]), ])
## lrt df nIso p.value FDR Time
## IKZF5 14.83098 2 3 0.0006018584 0.04879397 1.6080716
## GTF2H2 11.72566 1 2 0.0006164424 0.04946636 0.2989671
## ASUN 14.73104 2 3 0.0006326953 0.04997282 0.7070322
## KIAA1109 23.52474 6 7 0.0006384974 0.04997282 22.3730366
## PRKCSH 27.22480 8 9 0.0006464838 0.04997282 216.3108540
## DICER1 23.48910 6 7 0.0006481724 0.04997282 39.3696904
We saved the results in a worksheet named “Sig. G - Type 2” in a Excel file AML_unstr.xlsx using the following commands
write.xlsx(sig.res2.unstr[order(sig.res2.unstr[, "p.value"]), ], file="AML_unstr.xlsx", sheetName = "Sig. G - Type 2", row.names = T, append = T)
Next we perform the step 2 (confirmatory) test to identify the isoforms of these 102 genes that are differentially expressed using function step2.Ftest()
as follows
final.res2 <- step2.Ftest(data=data, file.step1="output/AML_outcome_unstr_type2.RData", alpha=0.05, method="holm")
## Number of significant genes: 102
## Cutoff for significant isoform is 0.0006485249
The function returns the results of the confirmatory test for all the isoforms of the above 102 genes in a data.frame object called final.res2
. Let’s take a look at the results.
dim(final.res2)
## [1] 475 7
head(final.res2)
## gene isoform FC step2.p.value adj.p threshold decision
## 1 UBE2S ENST00000589978 -2.427634 2.705008e-05 5.410016e-05 0.0003242625 TRUE
## 2 UBE2S ENST00000264552 -1.049046 6.913375e-01 6.913375e-01 0.0006485249 FALSE
## 3 INO80 ENST00000361937 1.005198 9.484275e-01 9.484275e-01 0.0006485249 FALSE
## 4 INO80 ENST00000558357 -1.532673 4.038737e-04 1.211621e-03 0.0002161750 FALSE
## 5 INO80 ENST00000401393 1.095610 2.441340e-01 4.882680e-01 0.0003242625 FALSE
## 6 FBXO3 ENST00000526785 -1.002767 9.715154e-01 9.715154e-01 0.0006485249 FALSE
The columns from left to right are gene name (gene
), isoform ID (isoform
), Fold Change (FC
), p-value of the step two test (step2.p.value
), adjusted p-value (adj.p
), and threshold to determine if the step2.p.value
is significant, and descison (decision
). An isoform is called significant or considered differentially expressed if the decision is TRUE.
The isoforms that are called significant are as follows
sig.res2.I.t <- final.res2[final.res2$decision, ]
dim(sig.res2.I.t)
## [1] 14 7
head(sig.res2.I.t)
## gene isoform FC step2.p.value adj.p threshold decision
## 1 UBE2S ENST00000589978 -2.427634 2.705008e-05 5.410016e-05 3.242625e-04 TRUE
## 25 CDKN1C ENST00000414822 -1.565790 5.414922e-05 2.165969e-04 1.621312e-04 TRUE
## 52 LPCAT1 ENST00000503252 1.641507 6.297745e-06 1.259549e-05 3.242625e-04 TRUE
## 142 DSN1 ENST00000373740 1.487508 4.393664e-05 1.318099e-04 2.161750e-04 TRUE
## 165 BLCAP ENST00000397137 1.798735 1.814551e-05 3.629101e-05 3.242625e-04 TRUE
## 172 BNIP2 ENST00000439052 2.381825 1.504972e-05 1.354475e-04 7.205832e-05 TRUE
tail(sig.res2.I.t)
## gene isoform FC step2.p.value adj.p threshold decision
## 254 POLR2J2 ENST00000591000 -2.085772 2.400970e-05 7.202909e-05 2.161750e-04 TRUE
## 276 GCA ENST00000487445 1.545836 1.254005e-04 3.762016e-04 2.161750e-04 TRUE
## 321 EEF1B2 ENST00000435123 -3.253894 1.033427e-05 8.267413e-05 8.106562e-05 TRUE
## 406 TXNRD1 ENST00000427956 -1.697822 9.090531e-07 8.181478e-06 7.205832e-05 TRUE
## 438 ATF7IP2 ENST00000570163 1.893709 1.698652e-05 8.493261e-05 1.297050e-04 TRUE
## 474 PHTF2 ENST00000307305 1.261118 1.464911e-04 5.859646e-04 1.621312e-04 TRUE
We see that with the standard t-test option we identified 14 isoforms that are called significant. These isoforms belong to 14 genes, which are
unique(unique(sig.res2.I.t[, 1]))
## [1] "UBE2S" "CDKN1C" "LPCAT1" "DSN1" "BLCAP" "BNIP2" "ZDHHC6" "CENPC" "POLR2J2"
## [10] "GCA" "EEF1B2" "TXNRD1" "ATF7IP2" "PHTF2"
We saved the results in a worksheet named “Sig. I - Type 2 t-test” in the same Excel file AML_unstr.xlsx using the following commands
write.xlsx(sig.res2.I.t, file="AML_unstr.xlsx", sheetName = "Sig. I - Type 2 t-test", row.names = F, append = T)
Next we perform the step 2 (confirmatory) test to identify the isoforms of these 102 genes that are differentially expressed using function mc.step2.Waldtest
and save the object final.res2.wald.unstr
returned by the function in a file called AML_outcome_unstr_type2_step2_Wald.RData
under subfolder output
final.res2.wald.unstr <- mc.step2.Waldtest(data=data, VarCov.type="unstr", nCore = NULL, file.step1="output/AML_outcome_unstr_type2.RData", alpha=0.05, method="homl")
save(final.res2.wald.unstr, file="output/AML_outcome_unstr_type2_step2_Wald.RData")
This operation may take a little long time. Let’s load the object final.res2.wald.unstr
from the file and then take a look.
load(file="output/AML_outcome_unstr_type2_step2_Wald.RData")
dim(final.res2.wald.unstr)
## [1] 475 6
head(final.res2.wald.unstr)
## gene isoform step2.p.value adj.p threshold decision
## 1 UBE2S ENST00000589978 1.699557e-05 3.399114e-05 0.0003242625 TRUE
## 2 UBE2S ENST00000264552 6.897118e-01 6.897118e-01 0.0006485249 FALSE
## 3 INO80 ENST00000361937 9.481698e-01 9.481698e-01 0.0006485249 FALSE
## 4 INO80 ENST00000558357 4.221384e-04 1.266415e-03 0.0002161750 FALSE
## 5 INO80 ENST00000401393 2.324067e-01 4.648134e-01 0.0003242625 FALSE
## 6 FBXO3 ENST00000526785 9.691171e-01 9.691171e-01 0.0006485249 FALSE
The columns from left to right are gene name (gene
), isoform ID (isoform
), Fold Change (FC
), p-value of the step two test (step2.p.value
), adjusted p-value (adj.p
), and threshold to determine if the step2.p.value
is significant, and descison (decision
). An isoform is called significant or considered differentially expressed if the decision is TRUE.
The isoforms that are called significant are as follows
sig.res2.I.wald.unstr <- final.res2.wald.unstr[final.res2.wald.unstr$decision, ]
dim(sig.res2.I.wald.unstr)
## [1] 23 6
head(sig.res2.I.wald.unstr)
## gene isoform step2.p.value adj.p threshold decision
## 1 UBE2S ENST00000589978 1.699557e-05 3.399114e-05 3.242625e-04 TRUE
## 34 DICER1 ENST00000393063 5.474378e-05 3.832065e-04 9.264642e-05 TRUE
## 52 LPCAT1 ENST00000503252 3.445858e-06 6.891715e-06 3.242625e-04 TRUE
## 75 TOP2B ENST00000542520 1.703146e-05 1.021888e-04 1.080875e-04 TRUE
## 119 FNBP1 ENST00000478129 3.703202e-05 2.221921e-04 1.080875e-04 TRUE
## 142 DSN1 ENST00000373740 5.528330e-05 1.658499e-04 2.161750e-04 TRUE
We see that with the Wlad-test option we identified 23 isoforms that are called significant. These isoforms belong to 21 genes, which are
unique(unique(sig.res2.I.wald.unstr[, 1]))
## [1] "UBE2S" "DICER1" "LPCAT1" "TOP2B" "FNBP1" "DSN1" "BLCAP" "BNIP2" "ZDHHC6"
## [10] "ZRANB2" "PCYT1A" "CENPC" "POLR2J2" "GCA" "EEF1B2" "NPR3" "TXNRD1" "ATF7IP2"
## [19] "ZMYM2" "SOCS2" "PHTF2"
We saved the results in a worksheet named “Sig. I - Type 2 Wald” in the same Excel file AML_unstr.xlsx using the following commands
write.xlsx(sig.res1.I.wald.unstr, file="AML_unstr.xlsx", sheetName = "Sig. I - Type 2 Wald", row.names = F, append = T)