\(*\) Correspondence should be addressed to Huining Kang (HuKang@salud.unm.edu).

Introduction

We recently developed a new statistical method for analyzing differentially expressed/spliced transcript isoforms, which appropriately account for the dependence between isoforms and multiple testing corrections for the multi-dimensional structure of at both the gene-and isoform-level. We proposed a linear mixed effects model-based approach for analyzing the complex alternative RNA splicing regulation patterns detected by whole-transcriptome RNA-Sequencing technologies. This approach thoroughly characterizes and differentiates three types of genes related to alternative RNA splicing events with distinct differential expression/splicing patterns. We applied the concept of appropriately controlling for the gene-level overall false discovery rate (OFDR) in this multi-dimensional alternative RNA splicing analysis utilizing a two-step hierarchical hypothesis testing framework. In the initial screening test we identify genes that have differentially expressed or spliced isoforms; in the subsequent confirmatory testing stage we examine only the isoforms of genes that have passed the screening tests. The methodology is described in detail in the manuscript titled Two-step mixed model approach to analyzing differential alternative RNA splicing submitted to PLOS ONE. We have provided an R program file (Tools.R) that included all the functions for performing the analysis with proposed method. The R program file can be downloaded here. We also wrote a Tutorial to show step-by-step how to use these functions in R to perform the analyses that we have proposed in our manuscript.

In the manuscript we have reported an application of our method to an RNA-sequencing study of Pediatric Acute Myeloid Leukemia (AML). We applied our method to identify differentially expressed/spliced isoforms that are associated survival outcome in a large cohort of pediatric AML from the NCI/COG TARGET-AML initiative (Farrar, et. al. 2016). The RNA-seq data at exon, isoform and gene levels as well as the clinical data for the AML patients are downloaded from NCI/COG TARGET website at https://ocg.cancer.gov/programs/target. We analyzed a subset of 234 patients who have clinical outcomes with 81 cases (relapsed within 3 years) and 153 controls (CCR for at least 3 years). After removing those with low abundances and those that are not in the coding region, we analyzed a total of 35397 isoforms representing 8058 genes. We converted the data set to an R list object named data in the format described in the Tutorial. This data set is available per request. We adhere to the principles of transparency in order to assure that our results are reproducible and present in this document the detailed steps and R codes for the analysis. The major steps of the analysis for this application are the same as we presented in our Tutorial except that the dataset is different. We strongly encourage readers to read the Tutorial before reading this document.

Performing Analyses Assuming Compound Symmetry with Unequal Variances for Covariance Structure

As suggested in our Tutorial we created a folder called AML as our working directory and under this folder we created two subfolders data and output. Aforementioned R object data is saved in a file named acc.RData in the subfolder data.

In the following (our primary analysis for AML data) we assumed the compound symmetry with unequal variances as the covariance structure and we set the statistical significance at OFDR = 0.05.

Let’s take a look at the data.

# load data to the memory
load(file="data/AML_isoform_outcome.RData")
# list the elements of object data
str(data)
## List of 6
##  $ x           : num [1:35397, 1:234] 3.808 2.794 1.887 3.566 0.253 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:35397] "ENST00000399788" "ENST00000540838" "ENST00000544777" "ENST00000536014" ...
##   .. ..$ : chr [1:234] "PABHET-03A-02R" "PABHKY-03A-02R" "PABLDZ-09A-04R" "PACDZR-09A-02R" ...
##  $ y           : num [1:234] 1 1 2 2 2 2 2 2 2 1 ...
##  $ gene        : chr [1:35397] "KDM5A" "KDM5A" "KDM5A" "KDM5A" ...
##  $ isoform     : chr [1:35397] "ENST00000399788" "ENST00000540838" "ENST00000544777" "ENST00000536014" ...
##  $ isoform_name: chr [1:35397] "KDM5A-001" "KDM5A-003" "KDM5A-008" "KDM5A-010" ...
##  $ y.lab       : chr [1:2] "Control" "Case"

The total number of isoforms is 35397 and the total number of genes is as follows

length(unique(data$gene))
## [1] 8058

The function find.nL() defined in file Tools.R can be used to calculate the number of isoforms for each gene. It returns a numeric vector with each element being the number of isoforms of one gene and the names of the elements being gene names.

source("Tools.R")
nL <- find.nL(data=data)

The numbers of isoforms for the first 5 genes are

head(nL, n=5)
##   KDM5A   NINJ2   RAD52 DCUN1D2    LINS 
##       5       2       2       2       3

The numbers of isoforms for the last 5 genes are

tail(nL, n=5)
##   PRIM2  SCAMP1   VPS11    EMG1 ALDH3B1 
##       2       4       5       4       4

We can summarize all the numbers as follows:

sum.nL <- table(nL)
sum.nL
## nL
##    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18   19   20   21 
## 2350 1684 1233  837  581  411  300  214  120   99   62   40   32   19   10   15   10    7   10    5 
##   22   23   25   27   28   31   33   34   35   44 
##    4    3    2    4    1    1    1    1    1    1

The result indicates that there are 2350 genes that have 2 isoforms for each, 1684 genes that have 3 isoforms for each, and so on so forth.

Analysis based on Type 1 screening test

Performing step 1 test

As described in our Tutorial, the first step is to run the screening test, the likelihood ratio test (LRT) by calling function lrt.test.mclapply(). We proceed this step by submitting following command:

res <- lrt.test.mclapply(data=data, VarCov.type="uneqvar", nCore=NULL, outfile="output/AML_outcome_uneqvar_type1.RData", screen.type=1)

This was the most time-consuming part. It returned a matrix object res which was saved in the file AML_outcome_uneqvar_type1.RData under the subfolder output. Now we load the object from the file and take a look.

load(file="output/AML_outcome_uneqvar_type1.RData")

The dimension of the object is

dim(res)
## [1] 8058    6

The rows on the top are

head(res)
##               lrt df nIso     p.value        FDR      Time
## KDM5A    6.948314  5    5 0.224512694 0.43585457 1.8104832
## NINJ2    6.885362  2    2 0.031978841 0.14273506 0.6409948
## RAD52    0.391820  2    2 0.822086207 0.90745762 0.5917048
## DCUN1D2  3.563903  2    2 0.168309379 0.36916044 0.6450491
## LINS    15.884635  3    3 0.001197441 0.02049538 0.8307495
## TMEM86B  3.572787  3    3 0.311443836 0.52378277 0.8714731

and on the bottom are

tail(res)
##                lrt df nIso    p.value        FDR      Time
## SHANK3   0.4641871  3    3 0.92668589 0.96262112 0.3201759
## PRIM2    0.0439027  2    2 0.97828783 0.98826907 0.3466947
## SCAMP1  12.3450460  4    4 0.01496164 0.09332533 0.9187062
## VPS11    3.8718428  5    5 0.56801122 0.74272356 1.4226265
## EMG1     2.6121073  4    4 0.62467988 0.78342889 0.7937691
## ALDH3B1  0.7047276  4    4 0.95074464 0.97582004 0.4763246

Each row of the matrix gives the test result for each gene and the row names indicate the names (IDs) of the genes. The columns from left to right are LRT statistic (lrt), degree of freedom (df), Number of isoforms (nIso), p-values of LRT (p.value), FDR (FDR) and time used for fitting the linear mixed effects model in second (Time). Note that there might be missing results for some genes because the corresponding linear mixed models procedures did not converge. Let’s see how many such genes there are.

p <- which(is.na(res[, 1]))
length(p)
## [1] 236

We see that there are 236 (2.93%) genes whose linear mixed effects models did not converge. The names of these genes are

rownames(res)[p]
##   [1] "HDLBP"    "GNL1"     "PPP1R10"  "GATAD2B"  "PAXBP1"   "ZMYM6"    "SRP68"    "H3F3B"   
##   [9] "ZC3HC1"   "S100A6"   "WBP2"     "RGPD6"    "CXorf65"  "TPM3"     "HK3"      "CDK2AP1" 
##  [17] "EEF1D"    "EPRS"     "PLAGL1"   "IDH2"     "SDHB"     "UBE2D3"   "PI4KB"    "MCTP1"   
##  [25] "CLINT1"   "RPLP0"    "NDUFB1"   "GBA"      "CLK2"     "IER3"     "DDX39B"   "BAG6"    
##  [33] "DDAH2"    "CLIC1"    "EHMT2"    "MRPS25"   "ATF6B"    "TAP2"     "TAP1"     "HLA-DMA" 
##  [41] "HLA-DPA1" "AES"      "TAPBP"    "USP34"    "HNRNPH1"  "C19orf70" "TCF20"    "ADAM8"   
##  [49] "AKT2"     "HLA-B"    "BCAP31"   "DNMT1"    "CDKN2D"   "GNB2L1"   "ACIN1"    "KDM5C"   
##  [57] "ARHGAP4"  "NGLY1"    "AGPAT1"   "GPSM3"    "CEP152"   "ZNF468"   "CUTA"     "NDUFA6"  
##  [65] "FLNA"     "VPS52"    "AKT1"     "MS4A4E"   "LUZP1"    "CHFR"     "YWHAE"    "SETD4"   
##  [73] "C4orf36"  "DDX5"     "RASGRP2"  "B4GALT6"  "RBM39"    "TPT1"     "ETFA"     "NREP"    
##  [81] "TSPAN3"   "REEP5"    "CTSH"     "RPL17"    "CPVL"     "NAP1L1"   "LDHB"     "ATP5G2"  
##  [89] "TMEM66"   "FIBP"     "C12orf50" "ATP6V0D1" "RSL1D1"   "BET1"     "FEZ2"     "HSP90AA1"
##  [97] "MCFD2"    "JUP"      "CRBN"     "SMG1"     "NDUFB11"  "TLE3"     "TLN1"     "PKM"     
## [105] "TUBA1A"   "SOS1"     "AIFM1"    "IKBIP"    "CPSF3L"   "PRKRA"    "CCNL2"    "ATP5I"   
## [113] "P4HB"     "ACSL1"    "CCT3"     "KMT2C"    "MEF2A"    "CDC16"    "RARS"     "IGFBP2"  
## [121] "HLA-A"    "NUDT16"   "HLA-E"    "ABCF1"    "TUBB"     "EIF4A1"   "CD44"     "SAP30BP" 
## [129] "LDHA"     "UBR7"     "TSPAN17"  "ARL6IP4"  "MATR3"    "PTPRA"    "DNM1"     "RWDD1"   
## [137] "PSMD9"    "MRPS18C"  "IL1RN"    "MYO15B"   "RNF10"    "KPNB1"    "RPS12"    "TNF"     
## [145] "AIF1"     "PRRC2A"   "HMHA1"    "CSNK2B"   "CIRBP"    "MSH5"     "MYB"      "C6orf48" 
## [153] "PPT2"     "ZNF185"   "PSMB9"    "BRD2"     "HLA-DPB1" "RING1"    "RPS18"    "IFNGR2"  
## [161] "ADD1"     "LST1"     "SH3TC1"   "STIM2"    "SKIV2L"   "SERF2"    "RNF5"     "TCN2"    
## [169] "PHF1"     "CTNND1"   "ABCC10"   "MPL"      "MTHFD2"   "PLD4"     "ILF3"     "SMARCA4" 
## [177] "RAD23A"   "ZSWIM4"   "RRM1"     "TPM4"     "CDC42"    "URI1"     "ZNF507"   "ILK"     
## [185] "SRRM1"    "SPINT2"   "RAD54L2"  "TEX264"   "CLK3"     "NOL11"    "SNX33"    "HNRNPH3" 
## [193] "RCN2"     "ANP32B"   "ATXN2L"   "CALCOCO2" "TAX1BP1"  "PRR13"    "MIA3"     "ZDHHC17" 
## [201] "POLI"     "PSMD7"    "INPP5D"   "UBN1"     "EIF4A2"   "LGALS9"   "RBM48"    "MYCBPAP" 
## [209] "BANF1"    "MYL12A"   "CNTLN"    "LRRK2"    "CREM"     "GNAS"     "PPM1A"    "ZYX"     
## [217] "MRPL48"   "FUS"      "HIF1A"    "EIF5"     "NAPG"     "ARMC5"    "C1orf162" "OGFOD1"  
## [225] "NPIPB5"   "HERPUD1"  "TRIM56"   "NEO1"     "SETD5"    "APEX1"    "PIGG"     "SNRPC"   
## [233] "ATAD3A"   "CHPT1"    "SSFA2"    "G3BP1"

Next let’s retain only Those called significant at level FDR = 0.05 and assign the results in a new object called sig.res1

sig.res1 <- res[!is.na(res[, 1]) & res[, "FDR"] <= 0.05, ]
dim(sig.res1)
## [1] 782   6

We see that there are total 782 genes that passed the screening test at level FDR = 0.05. Let’s sort these genes by the p-values and check those on the top and bottom,

head(sig.res1[order(sig.res1[, "p.value"]), ])
##             lrt df nIso      p.value          FDR      Time
## TARDBP 80.35772  8    8 4.141857e-14 3.239761e-10  5.152162
## SP100  86.48541 14   14 1.742530e-12 6.815035e-09 22.984989
## MYCBP2 63.25134  7    7 3.374746e-11 8.799088e-08  3.416999
## NCK1   57.59970  7    7 4.542234e-10 8.882339e-07  1.909047
## SPAG9  66.10922 11   11 6.661781e-10 1.042169e-06 12.294924
## UBXN1  69.51061 13   13 9.876279e-10 1.287538e-06 18.988426
tail(sig.res1[order(sig.res1[, "p.value"]), ])
##             lrt df nIso     p.value        FDR      Time
## MICB   10.62510  2    2 0.004929351 0.04952175 0.2858193
## SPN    14.89070  4    4 0.004933361 0.04952175 1.0132439
## CYB5R3 18.58010  6    6 0.004934811 0.04952175 3.0487463
## HMGXB4 14.88846  4    4 0.004938246 0.04952175 1.1805918
## EIF5A  21.98428  8    8 0.004945071 0.04952669 5.0797310
## PRRC1  14.86645  4    4 0.004986382 0.04987657 1.2450681

We saved the results in a worksheet named “Sig. G - Type 1” in a Excel file AML_uneqvar.xlsx using the following commands

require(xlsx)
## Loading required package: xlsx
write.xlsx(sig.res1[order(sig.res1[, "p.value"]), ], file="AML_uneqvar.xlsx", sheetName = "Sig. G - Type 1", row.names = T)

Performing step 2 test using the standard t-test

Next we perform the step 2 (confirmatory) test to identify the isoforms of these 782 genes that are differentially expressed using function step2.Ftest() as follows

final.res1 <- step2.Ftest(data=data, file.step1="output/AML_outcome_uneqvar_type1.RData", alpha=0.05, method="holm")
## Number of significant genes: 782 
## Cutoff for significant isoform is 0.004998722

Note that there is an argument called method in function step2.Ftest and the following functon mc.step2.Waldtest, which is the options for controlling the family-wise error rate required by our two step procedure. The options are methods of Bonferroni (“bonferroni”), Holm (“holm”), Hochberg (“hochberg”) and Hommel (“hommel”). We use Holm option throughout this document. The function returns the results of the confirmatory test for all the isoforms of the above 782 genes in a data.frame object called final.res1. Let’s take a look at the results.

dim(final.res1)
## [1] 4792    7
head(final.res1)
##     gene         isoform        FC step2.p.value        adj.p   threshold decision
## 1   LINS ENST00000314742 -1.059541  1.846141e-01 1.846141e-01 0.004998722    FALSE
## 2   LINS ENST00000559169  1.211968  4.744510e-03 1.423353e-02 0.001666241    FALSE
## 3   LINS ENST00000561073  1.208533  3.586204e-02 7.172409e-02 0.002499361    FALSE
## 4  UBE2S ENST00000589978 -2.427634  2.705008e-05 5.410016e-05 0.002499361     TRUE
## 5  UBE2S ENST00000264552 -1.049046  6.913375e-01 6.913375e-01 0.004998722    FALSE
## 6 TOPBP1 ENST00000505804  1.535638  1.030649e-04 3.091948e-04 0.001666241     TRUE

The columns from left to right are gene name (gene), isoform ID (isoform), Fold Change (FC), p-value of the step two test (step2.p.value), adjusted p-value (adj.p), and threshold to determine if the step2.p.value is significant, and decison (decision). An isoform is called significant or considered differentially expressed if the decision is TRUE.

The isoforms that are called significant are as follows

sig.res1.I.t <- final.res1[final.res1$decision, ]
dim(sig.res1.I.t)
## [1] 265   7
head(sig.res1.I.t)
##        gene         isoform        FC step2.p.value        adj.p   threshold decision
## 4     UBE2S ENST00000589978 -2.427634  2.705008e-05 5.410016e-05 0.002499361     TRUE
## 6    TOPBP1 ENST00000505804  1.535638  1.030649e-04 3.091948e-04 0.001666241     TRUE
## 17   PDCD10 ENST00000492396  1.444897  7.527957e-04 3.011183e-03 0.001249680     TRUE
## 55    INO80 ENST00000558357 -1.532673  4.038737e-04 1.211621e-03 0.001666241     TRUE
## 97  CAMSAP1 ENST00000409386 -1.350315  1.149588e-03 3.448764e-03 0.001666241     TRUE
## 101    PHF8 ENST00000338946  1.412894  9.481201e-06 3.792480e-05 0.001249680     TRUE
tail(sig.res1.I.t)
##         gene         isoform        FC step2.p.value        adj.p    threshold decision
## 4619    BZW2 ENST00000452975 -1.757608  0.0004382129 0.0043821293 0.0004998722     TRUE
## 4645    MST4 ENST00000394335  1.520857  0.0001331171 0.0006655855 0.0009997443     TRUE
## 4646   NAA30 ENST00000554703 -1.374659  0.0013486235 0.0040458705 0.0016662405     TRUE
## 4686 METTL17 ENST00000553389  1.479945  0.0002133583 0.0006400749 0.0016662405     TRUE
## 4767  PTPN12 ENST00000248594  1.375450  0.0006907204 0.0034536022 0.0009997443     TRUE
## 4773   PHTF2 ENST00000307305  1.261118  0.0001464911 0.0005859646 0.0012496804     TRUE

We see that with the standard t-test option we identified 265 isoforms that are called significant. These isoforms belong to 214 genes, which are

unique(unique(sig.res1.I.t[, 1]))
##   [1] "UBE2S"          "TOPBP1"         "PDCD10"         "INO80"          "CAMSAP1"       
##   [6] "PHF8"           "SYNJ1"          "PHF21A"         "SHPRH"          "RPGR"          
##  [11] "TOMM20"         "MAP3K7"         "SRSF2"          "NHSL1"          "CDKN1C"        
##  [16] "IFIH1"          "NPM3"           "EIF4G3"         "DICER1"         "PLSCR1"        
##  [21] "TRIT1"          "TMX3"           "MEF2C"          "FNBP4"          "RASSF2"        
##  [26] "MAN1A1"         "ORMDL1"         "EOGT"           "FAM188A"        "PRPF40A"       
##  [31] "VPS54"          "TMEM128"        "LPCAT1"         "SH3BP5"         "SIRT6"         
##  [36] "SATB1"          "CCND3"          "AZI2"           "MAP3K5"         "CYB5R1"        
##  [41] "TMEM87A"        "TOP2B"          "DEK"            "TNKS1BP1"       "SEPP1"         
##  [46] "BOD1L1"         "SEL1L3"         "MOB1A"          "IBTK"           "GFM2"          
##  [51] "AQR"            "TPD52"          "PPP1CC"         "NUP133"         "NDRG1"         
##  [56] "PEPD"           "EAPP"           "PPP2R3C"        "CHMP1A"         "FNBP1"         
##  [61] "GLRX"           "POLG2"          "PITPNB"         "TRAPPC8"        "SLC39A11"      
##  [66] "PARL"           "TCEA1"          "DSN1"           "CYCS"           "CSAD"          
##  [71] "RPUSD1"         "IER3IP1"        "BLCAP"          "TRIP12"         "ACRBP"         
##  [76] "SPPL2A"         "DMXL2"          "MRE11A"         "CCPG1"          "RGS10"         
##  [81] "ARHGAP12"       "CCDC82"         "BNIP2"          "ASUN"           "MYCBP2"        
##  [86] "ZDHHC6"         "CCDC59"         "TMEM123"        "DCTN3"          "RBM26"         
##  [91] "THOC1"          "PTPN22"         "ZRANB2"         "ARHGAP21"       "SPINK2"        
##  [96] "KARS"           "PCYT1A"         "UBXN1"          "MICU2"          "CENPC"         
## [101] "DLG1"           "CEBPZ"          "PIGK"           "LBR"            "SLCO5A1"       
## [106] "EDEM3"          "GTF2H2"         "DPM1"           "MTMR6"          "TPM2"          
## [111] "IQCB1"          "C9orf72"        "ASH1L"          "SLC2A5"         "NOL8"          
## [116] "LAMA5"          "ROCK1"          "ANKRA2"         "LEMD2"          "SNX13"         
## [121] "FAM120AOS"      "MAX"            "POLR2J2"        "TMED5"          "GSAP"          
## [126] "GRIPAP1"        "CWC15"          "NCK1"           "CHD3"           "S100PBP"       
## [131] "GMNN"           "MIS12"          "VPS16"          "TSPAN7"         "EIF3M"         
## [136] "RASA1"          "METTL23"        "SCOC"           "GCA"            "HMGN2"         
## [141] "RPS6KA1"        "TPP2"           "C7orf55-LUC7L2" "TNFSF13B"       "OXR1"          
## [146] "FPGS"           "RHOH"           "TRH"            "KNTC1"          "CDC14A"        
## [151] "DMTF1"          "CCNJ"           "EEF1B2"         "MARCH7"         "TBC1D23"       
## [156] "MTRR"           "RPL15"          "VMA21"          "PTGER4"         "MARCH6"        
## [161] "NPR3"           "SERPING1"       "BRIX1"          "RUNX2"          "MICB"          
## [166] "CPNE3"          "PNN"            "GPR114"         "BPTF"           "TAF4B"         
## [171] "RNF138"         "TTC14"          "EI24"           "RAD51AP1"       "TXNRD1"        
## [176] "KIAA1033"       "SNX10"          "TGFBR1"         "PPWD1"          "CLN8"          
## [181] "MED17"          "SP100"          "AEBP2"          "GORAB"          "CRAMP1L"       
## [186] "KLHDC2"         "SLC25A5"        "RPS8"           "UBAP1"          "COMMD3"        
## [191] "TLE4"           "ATF7IP2"        "SLC30A5"        "CRCP"           "ZMYM2"         
## [196] "TGIF1"          "PPFIA1"         "KDM6A"          "AKAP2"          "SOCS2"         
## [201] "ANKRD12"        "ITPR1"          "KLHL2"          "RARRES3"        "HNRNPA3"       
## [206] "WRNIP1"         "ATF1"           "TNRC6A"         "BZW2"           "MST4"          
## [211] "NAA30"          "METTL17"        "PTPN12"         "PHTF2"

We saved the results in a worksheet named “Sig. I - Type 1 t-test” in the same Excel file AML_uneqvar.xlsx using the following commands

write.xlsx(sig.res1.I.t, file="AML_uneqvar.xlsx", sheetName = "Sig. I - Type 1 t-test", row.names = F, append = T)

Performing step 2 test using the the Wald test based on the linear mixed-effects model

Next we perform the step 2 (confirmatory) test to identify the isoforms of these 782 genes that are differentially expressed using function mc.step2.Waldtest and save the object final.res1.wald returned by the function in a file called AML_outcome_uneqvar_type1_step2_Wald.RData under subfolder output

final.res1.wald <- mc.step2.Waldtest(data=data, VarCov.type="uneqvar", nCore = NULL, file.step1="output/AML_outcome_uneqvar_type1.RData", alpha=0.05, method="homl") save(final.res1.wald, file="output/AML_outcome_uneqvar_type1_step2_Wald.RData")

This operation may take a little long time. Let’s load the object final.res1.wald from the file and then take a look.

load(file="output/AML_outcome_uneqvar_type1_step2_Wald.RData")
dim(final.res1.wald)
## [1] 4792    6
head(final.res1.wald)
##     gene         isoform step2.p.value        adj.p   threshold decision
## 1   LINS ENST00000314742  1.894253e-01 1.894253e-01 0.004998722    FALSE
## 2   LINS ENST00000559169  4.279246e-03 1.283774e-02 0.001666241    FALSE
## 3   LINS ENST00000561073  2.984596e-02 5.969193e-02 0.002499361    FALSE
## 4  UBE2S ENST00000589978  1.699557e-05 3.399114e-05 0.002499361     TRUE
## 5  UBE2S ENST00000264552  6.897118e-01 6.897118e-01 0.004998722    FALSE
## 6 TOPBP1 ENST00000505804  1.552521e-05 4.657563e-05 0.001666241     TRUE

The columns from left to right are gene name (gene), isoform ID (isoform), Fold Change (FC), p-value of the step two test (step2.p.value), adjusted p-value (adj.p), and threshold to determine if the step2.p.value is significant, and descison (decision). An isoform is called significant or considered differentially expressed if the decision is TRUE.

The isoforms that are called significant are as follows

sig.res1.I.wald <- final.res1.wald[final.res1.wald$decision, ]
dim(sig.res1.I.wald)
## [1] 291   6
head(sig.res1.I.wald)
##        gene         isoform step2.p.value        adj.p   threshold decision
## 4     UBE2S ENST00000589978  1.699557e-05 3.399114e-05 0.002499361     TRUE
## 6    TOPBP1 ENST00000505804  1.552521e-05 4.657563e-05 0.001666241     TRUE
## 17   PDCD10 ENST00000492396  2.562722e-04 1.025089e-03 0.001249680     TRUE
## 55    INO80 ENST00000558357  4.221384e-04 1.266415e-03 0.001666241     TRUE
## 97  CAMSAP1 ENST00000409386  9.455119e-04 2.836536e-03 0.001666241     TRUE
## 101    PHF8 ENST00000338946  2.589063e-04 1.035625e-03 0.001249680     TRUE

We see that with the Wlad-test option we identified 291 isoforms that are called significant. These isoforms belong to 232 genes, which are

unique(unique(sig.res1.I.wald[, 1]))
##   [1] "UBE2S"     "TOPBP1"    "PDCD10"    "INO80"     "CAMSAP1"   "PHF8"      "SYNJ1"    
##   [8] "PHF21A"    "SHPRH"     "RPGR"      "TOMM20"    "MAP3K7"    "NHSL1"     "CDKN1C"   
##  [15] "IFIH1"     "NPM3"      "SBF2"      "DICER1"    "PLSCR1"    "TRIT1"     "CLK4"     
##  [22] "TMX3"      "MEF2C"     "FNBP4"     "RASSF2"    "MAN1A1"    "ORMDL1"    "ELOVL5"   
##  [29] "EOGT"      "FAM188A"   "EIF4E3"    "PRPF40A"   "VPS54"     "TMEM128"   "ACTN1"    
##  [36] "LPCAT1"    "SH3BP5"    "CLDND1"    "SIRT6"     "SATB1"     "CCND3"     "AZI2"     
##  [43] "MAP3K5"    "CYB5R1"    "TMEM87A"   "GRIK5"     "CCNL1"     "TOP2B"     "DEK"      
##  [50] "PCGF1"     "TNKS1BP1"  "SEPP1"     "BOD1L1"    "LILRB2"    "SEL1L3"    "MOB1A"    
##  [57] "IBTK"      "GFM2"      "AQR"       "TPD52"     "PPP1CC"    "NUP133"    "PEPD"     
##  [64] "EAPP"      "PPP2R3C"   "HMGN5"     "ABR"       "CHMP1A"    "FNBP1"     "GLRX"     
##  [71] "PITPNB"    "SMYD3"     "TRAPPC8"   "HERC4"     "SLC39A11"  "RB1CC1"    "PARL"     
##  [78] "FYN"       "TCEA1"     "DSN1"      "CYCS"      "ZFC3H1"    "ATP5A1"    "CSAD"     
##  [85] "IER3IP1"   "BLCAP"     "TAF1D"     "ACRBP"     "SPPL2A"    "DMXL2"     "MRE11A"   
##  [92] "CCPG1"     "OSBPL8"    "RGS10"     "CCDC82"    "BNIP2"     "ASUN"      "MYCBP2"   
##  [99] "ZDHHC6"    "CCDC59"    "TMEM123"   "DCTN3"     "RBM26"     "PTPN22"    "ZRANB2"   
## [106] "ARHGAP21"  "IKBKAP"    "HOPX"      "SPINK2"    "KARS"      "PCYT1A"    "PLIN2"    
## [113] "THOC2"     "UBXN1"     "CENPC"     "DLG1"      "CEBPZ"     "PIGK"      "NR2C1"    
## [120] "SLCO5A1"   "EDEM3"     "GTF2H2"    "DPM1"      "MTMR6"     "TPM2"      "IQCB1"    
## [127] "C9orf72"   "TOB1"      "SRSF7"     "IARS"      "VPS29"     "SLC2A5"    "NOL8"     
## [134] "LAMA5"     "HSPH1"     "ANKRA2"    "SNX13"     "FAM120AOS" "POLR2J2"   "TMED5"    
## [141] "GSAP"      "CWC15"     "U2AF2"     "NCK1"      "S100PBP"   "GMNN"      "MIS12"    
## [148] "SPNS3"     "VPS16"     "TSPAN7"    "EIF3M"     "RASA1"     "NUCB2"     "METTL23"  
## [155] "RASA2"     "SCOC"      "GCA"       "HMGN2"     "TPP2"      "TNFSF13B"  "OXR1"     
## [162] "CTNNB1"    "RHOH"      "MSL3"      "TRH"       "KNTC1"     "NEK6"      "CDC14A"   
## [169] "CCNJ"      "OXA1L"     "EEF1B2"    "MARCH7"    "TBC1D23"   "MTRR"      "CPSF3"    
## [176] "RPL15"     "EMD"       "VMA21"     "PTGER4"    "BAX"       "NPR3"      "SERPING1" 
## [183] "BRIX1"     "MICB"      "CPNE3"     "PNN"       "GPR114"    "TAF4B"     "PPIP5K2"  
## [190] "RNF138"    "TTC14"     "EI24"      "RAD51AP1"  "TXNRD1"    "KIAA1033"  "SNX10"    
## [197] "MGST1"     "CLN8"      "MED17"     "SP100"     "AEBP2"     "RAB2A"     "GORAB"    
## [204] "CRAMP1L"   "ZEB1"      "KLHDC2"    "SLC25A5"   "ARID4A"    "RPS8"      "COMMD3"   
## [211] "MRPS35"    "WRN"       "ATF7IP2"   "ZNF92"     "SLC30A5"   "ZMYM2"     "TGIF1"    
## [218] "AKAP2"     "SOCS2"     "CDCA7"     "ANKRD12"   "ITPR1"     "LUC7L3"    "RARRES3"  
## [225] "SEH1L"     "ATF1"      "BZW2"      "MST4"      "NAA30"     "METTL17"   "RBM3"     
## [232] "PHTF2"

We saved the results in a worksheet named “Sig. I - Type 1 Wald” in the same Excel file AML_uneqvar.xlsx using the following commands

write.xlsx(sig.res1.I.wald, file="AML_uneqvar.xlsx", sheetName = "Sig. I - Type 1 Wald", row.names = F, append = T)

Analysis based on Type 2 screening test

Performing step 1 test

The procedure for the analysis based on Type 2 screening test is the same as that based on Type 1 test except that the argument screen.type of function lrt.test.mclapply needs to be specified as 2 rather than 1. We perform the Type 2 screening test by submitting following command:

res <- lrt.test.mclapply(data=data, VarCov.type="uneqvar", nCore=NULL, outfile="output/AML_outcome_uneqvar_type2.RData", screen.type=2)

Let’s take a look at the results

load(file="output/AML_outcome_uneqvar_type2.RData")

The dimension of the object is

dim(res)
## [1] 8058    6

The rows on the top are

head(res)
##                lrt df nIso      p.value        FDR      Time
## KDM5A    6.8602998  4    5 0.1434574952 0.33378175 1.8200545
## NINJ2    1.2890201  1    2 0.2562286206 0.46720240 0.5789759
## RAD52    0.1514221  1    2 0.6971800786 0.82605803 0.6067107
## DCUN1D2  2.4249554  1    2 0.1194166404 0.29900001 0.6117477
## LINS    15.1641608  2    3 0.0005095001 0.01197555 0.8188503
## TMEM86B  3.2041603  2    3 0.2014769827 0.40538883 0.8531349

and on the bottom are

tail(res)
##                  lrt df nIso    p.value        FDR      Time
## SHANK3   0.193505567  2    3 0.90778040 0.95295026 0.4346292
## PRIM2    0.004453407  1    2 0.94679358 0.97345110 0.5046623
## SCAMP1  10.551362594  3    4 0.01441649 0.08660597 0.9508400
## VPS11    3.821044922  4    5 0.43076617 0.63258336 1.5673914
## EMG1     1.609379097  3    4 0.65726505 0.79773188 0.7255163
## ALDH3B1  0.665525167  3    4 0.88128123 0.93764524 0.5521927

Let’s check how many genes with missing results.

p <- which(is.na(res[, 1]))
length(p)
## [1] 231

We see that there are 231 (2.87%) genes whose linear mixed effects models did not converge. The names of these genes are

rownames(res)[p]
##   [1] "HDLBP"    "GNL1"     "PPP1R10"  "GATAD2B"  "PAXBP1"   "ZMYM6"    "SRP68"    "H3F3B"   
##   [9] "ZC3HC1"   "S100A6"   "WBP2"     "RGPD6"    "TPM3"     "CDK2AP1"  "EEF1D"    "PLAGL1"  
##  [17] "SPI1"     "IDH2"     "SDHB"     "UBE2D3"   "RAB43"    "MCTP1"    "CLINT1"   "RPLP0"   
##  [25] "NDUFB1"   "PABPC1"   "VNN2"     "GBA"      "CLK2"     "EIF4E3"   "IER3"     "DDX39B"  
##  [33] "BAG6"     "DDAH2"    "CLIC1"    "EHMT2"    "MRPS25"   "ATF6B"    "TAP2"     "TAP1"    
##  [41] "HLA-DMA"  "HLA-DPA1" "AES"      "RGL2"     "TAPBP"    "USP34"    "HNRNPH1"  "C19orf70"
##  [49] "TCF20"    "AKT2"     "HLA-B"    "BCAP31"   "DNMT1"    "GNB2L1"   "BARD1"    "ACIN1"   
##  [57] "KDM5C"    "ARHGAP4"  "NGLY1"    "GPSM3"    "ZNF468"   "CUTA"     "FLNA"     "VPS52"   
##  [65] "MS4A4E"   "LUZP1"    "CHFR"     "YWHAE"    "MCAM"     "SETD4"    "DDX5"     "RASGRP2" 
##  [73] "RBM39"    "TPT1"     "ETFA"     "NREP"     "REEP5"    "CTSH"     "RPL17"    "CPVL"    
##  [81] "NAP1L1"   "LDHB"     "ATP5G2"   "TMEM66"   "FIBP"     "C12orf50" "ATP6V0D1" "RSL1D1"  
##  [89] "MPO"      "FEZ2"     "HSP90AA1" "MCFD2"    "SMG1"     "NDUFB11"  "TLN1"     "PKM"     
##  [97] "TUBA1A"   "SOS1"     "SYT15"    "AIFM1"    "IKBIP"    "CPSF3L"   "PRKRA"    "POLR2J"  
## [105] "CCNL2"    "ATP5I"    "P4HB"     "ACSL1"    "CCT3"     "KMT2C"    "CDC16"    "RARS"    
## [113] "IGFBP2"   "ATP6V0E1" "HLA-A"    "NUDT16"   "HLA-E"    "ABCF1"    "TUBB"     "EIF4A1"  
## [121] "CD44"     "LDHA"     "UBR7"     "TSPAN17"  "ARL6IP4"  "MATR3"    "PTPRA"    "DNM1"    
## [129] "TIGD5"    "RWDD1"    "SEC24B"   "SLC16A5"  "PSMD9"    "MRPS18C"  "IL1RN"    "MYO15B"  
## [137] "RNF10"    "KPNB1"    "TNF"      "AIF1"     "PRRC2A"   "HMHA1"    "CSNK2B"   "CIRBP"   
## [145] "MSH5"     "MYB"      "C6orf48"  "ZNF185"   "PSMB9"    "BRD2"     "HLA-DPB1" "RING1"   
## [153] "RPS18"    "NDUFS6"   "IFNGR2"   "ADD1"     "LST1"     "SH3TC1"   "STIM2"    "SKIV2L"  
## [161] "SERF2"    "TCN2"     "PHF1"     "CTNND1"   "ABCC10"   "MTHFD2"   "ILF3"     "SMARCA4" 
## [169] "RAD23A"   "RRM1"     "TPM4"     "CDC42"    "URI1"     "ZNF507"   "SRRM1"    "SPINT2"  
## [177] "MPZL1"    "TEX264"   "NOL11"    "SNX33"    "HNRNPH3"  "RCN2"     "ANP32B"   "ATXN2L"  
## [185] "C18orf25" "CALCOCO2" "TAX1BP1"  "PTK2B"    "PRR13"    "PCBP2"    "MIA3"     "ZDHHC17" 
## [193] "POLI"     "PSMD7"    "PSMD3"    "INPP5D"   "ACSL5"    "UBN1"     "EIF4A2"   "LGALS9"  
## [201] "HM13"     "BANF1"    "MYL12A"   "CNTLN"    "LRRK2"    "CREM"     "GNAS"     "PPM1A"   
## [209] "ZYX"      "CCNY"     "MRPL48"   "FUS"      "HIF1A"    "IFIT5"    "EIF5"     "NAPG"    
## [217] "ARMC5"    "TGFB1I1"  "ACSM2A"   "C1orf162" "OGFOD1"   "NPIPB5"   "TRIM56"   "NEO1"    
## [225] "APEX1"    "PIGG"     "SNRPC"    "CDC73"    "CHPT1"    "SSFA2"    "G3BP1"

Next let’s retain only Those called significant at level FDR = 0.05 and assign the results in a new object called sig.res2

sig.res2 <- res[!is.na(res[, 1]) & res[, "FDR"] <= 0.05, ]
dim(sig.res2)
## [1] 857   6
head(sig.res2)
##             lrt df nIso      p.value         FDR       Time
## LINS   15.16416  2    3 5.095001e-04 0.011975548  0.8188503
## UBE2S  17.55095  1    2 2.797109e-05 0.001931850  0.5785339
## GTF3C3 14.99945  4    5 4.702353e-03 0.046752963  1.2817483
## LCP2   18.35829  6    7 5.396489e-03 0.049692143  3.3005738
## MSL2   23.36706  4    5 1.069418e-04 0.004676164  1.8034894
## CLK1   45.14836 12   13 9.718996e-06 0.001056536 15.3574758

We see that there are total 857 genes that passed the screening test at level FDR = 0.05. Let’s sort these genes by the p-values and check those on the top and bottom,

head(sig.res2[order(sig.res2[, "p.value"]), ])
##             lrt df nIso      p.value          FDR      Time
## TARDBP 79.33817  7    8 1.879383e-14 1.470993e-10  5.962395
## MYCBP2 62.47023  6    7 1.415165e-11 5.538250e-08  3.677209
## SP100  77.93700 13   14 2.691443e-11 7.021975e-08 26.179079
## NCK1   57.09093  6    7 1.751101e-10 3.426467e-07  2.296355
## UBXN1  69.49333 12   13 3.983799e-10 6.091691e-07 20.073373
## SPAG9  64.67847 10   11 4.669752e-10 6.091691e-07 13.680007
tail(sig.res2[order(sig.res2[, "p.value"]), ])
##             lrt df nIso     p.value        FDR      Time
## TCEAL1 10.42689  2    3 0.005442885 0.04990939 0.3515098
## ISCU   18.33694  6    7 0.005443097 0.04990939 1.6581669
## DDX11  16.54635  5    6 0.005445589 0.04990939 1.8544092
## NUMB   21.72313  8    9 0.005455429 0.04993720 6.9271486
## RABEP1 12.64634  3    4 0.005467310 0.04993720 1.1467869
## SRPK1  16.53666  5    6 0.005467763 0.04993720 2.4430540

We saved the results in a worksheet named “Sig. G - Type 2” in a Excel file AML_uneqvar.xlsx using the following commands

write.xlsx(sig.res2[order(sig.res2[, "p.value"]), ], file="AML_uneqvar.xlsx", sheetName = "Sig. G - Type 2", row.names = T, append = T)

Performing step 2 test using the standard t-test

Next we perform the step 2 (confirmatory) test to identify the isoforms of these 857 genes that are differentially expressed using function step2.Ftest() as follows

final.res2 <- step2.Ftest(data=data, file.step1="output/AML_outcome_uneqvar_type2.RData", alpha=0.05, method="holm")
## Number of significant genes: 857 
## Cutoff for significant isoform is 0.005474639

The function returns the results of the confirmatory test for all the isoforms of the above 857 genes in a data.frame object called final.res2. Let’s take a look at the results.

dim(final.res2)
## [1] 5185    7
head(final.res2)
##     gene         isoform        FC step2.p.value        adj.p   threshold decision
## 1   LINS ENST00000314742 -1.059541  1.846141e-01 1.846141e-01 0.005474639    FALSE
## 2   LINS ENST00000559169  1.211968  4.744510e-03 1.423353e-02 0.001824880    FALSE
## 3   LINS ENST00000561073  1.208533  3.586204e-02 7.172409e-02 0.002737320    FALSE
## 4  UBE2S ENST00000589978 -2.427634  2.705008e-05 5.410016e-05 0.002737320     TRUE
## 5  UBE2S ENST00000264552 -1.049046  6.913375e-01 6.913375e-01 0.005474639    FALSE
## 6 GTF3C3 ENST00000481098  1.380074  9.821877e-04 4.910938e-03 0.001094928     TRUE

The columns from left to right are gene name (gene), isoform ID (isoform), Fold Change (FC), p-value of the step two test (step2.p.value), adjusted p-value (adj.p), and threshold to determine if the step2.p.value is significant, and descison (decision). An isoform is called significant or considered differentially expressed if the decision is TRUE.

The isoforms that are called significant are as follows

sig.res2.I.t <- final.res2[final.res2$decision, ]
dim(sig.res2.I.t)
## [1] 201   7
head(sig.res2.I.t)
##       gene         isoform        FC step2.p.value        adj.p    threshold decision
## 4    UBE2S ENST00000589978 -2.427634  2.705008e-05 5.410016e-05 0.0027373195     TRUE
## 6   GTF3C3 ENST00000481098  1.380074  9.821877e-04 4.910938e-03 0.0010949278     TRUE
## 64   INO80 ENST00000558357 -1.532673  4.038737e-04 1.211621e-03 0.0018248797     TRUE
## 108   PHF8 ENST00000338946  1.412894  9.481201e-06 3.792480e-05 0.0013686598     TRUE
## 121 PHF21A ENST00000534766  1.404136  6.531195e-04 4.571837e-03 0.0007820913     TRUE
## 154  SHPRH ENST00000367505  1.234300  3.793518e-04 1.138055e-03 0.0018248797     TRUE
tail(sig.res2.I.t)
##         gene         isoform        FC step2.p.value        adj.p    threshold decision
## 4998    BZW2 ENST00000452975 -1.757608  0.0004382129 0.0043821293 0.0005474639     TRUE
## 5033    MST4 ENST00000394335  1.520857  0.0001331171 0.0006655855 0.0010949278     TRUE
## 5034   NAA30 ENST00000554703 -1.374659  0.0013486235 0.0040458705 0.0018248797     TRUE
## 5074 METTL17 ENST00000553389  1.479945  0.0002133583 0.0006400749 0.0018248797     TRUE
## 5155  PTPN12 ENST00000248594  1.375450  0.0006907204 0.0034536022 0.0010949278     TRUE
## 5161   PHTF2 ENST00000307305  1.261118  0.0001464911 0.0005859646 0.0013686598     TRUE

We see that with the standard t-test option we identified 201 isoforms that are called significant. These isoforms belong to 171 genes, which are

unique(unique(sig.res2.I.t[, 1]))
##   [1] "UBE2S"          "GTF3C3"         "INO80"          "PHF8"           "PHF21A"        
##   [6] "SHPRH"          "RPGR"           "TOMM20"         "MAP3K7"         "SRSF2"         
##  [11] "CDKN1C"         "DDX24"          "NPM3"           "EIF4G3"         "DICER1"        
##  [16] "TRIT1"          "FNBP4"          "EOGT"           "PRPF40A"        "VPS54"         
##  [21] "TMEM128"        "LPCAT1"         "SATB1"          "CCND3"          "TMEM87A"       
##  [26] "TOP2B"          "TNKS1BP1"       "BOD1L1"         "CEP152"         "MOB1A"         
##  [31] "IBTK"           "AQR"            "PPP1CC"         "NUP133"         "NDRG1"         
##  [36] "PEPD"           "EAPP"           "CHMP1A"         "FNBP1"          "GLRX"          
##  [41] "TP53INP1"       "POLG2"          "PITPNB"         "TRAPPC8"        "PARL"          
##  [46] "TCEA1"          "DSN1"           "CYCS"           "CSAD"           "RPUSD1"        
##  [51] "IER3IP1"        "BLCAP"          "TRIP12"         "ACRBP"          "DMXL2"         
##  [56] "CCPG1"          "BNIP2"          "NEMF"           "ASUN"           "MYCBP2"        
##  [61] "ZDHHC6"         "CCDC59"         "RBM26"          "THOC1"          "ZRANB2"        
##  [66] "SPINK2"         "KARS"           "STARD7"         "PCYT1A"         "THOC2"         
##  [71] "UBXN1"          "MICU2"          "CENPC"          "DLG1"           "SACS"          
##  [76] "LBR"            "GTF2H2"         "DPM1"           "MTMR6"          "TPM2"          
##  [81] "IQCB1"          "C9orf72"        "ASH1L"          "VPS29"          "SLC2A5"        
##  [86] "NOL8"           "ROCK1"          "ANKRA2"         "LEMD2"          "SNX13"         
##  [91] "FAM120AOS"      "MAX"            "POLR2J2"        "TMED5"          "GRIPAP1"       
##  [96] "NCK1"           "CHD3"           "S100PBP"        "GMNN"           "MIS12"         
## [101] "VPS16"          "TSPAN7"         "EIF3M"          "RASA1"          "RASA2"         
## [106] "SCOC"           "GCA"            "HMGN2"          "TPP2"           "C7orf55-LUC7L2"
## [111] "OXR1"           "FPGS"           "RHOH"           "TRH"            "KNTC1"         
## [116] "CDC14A"         "DMTF1"          "EEF1B2"         "MARCH7"         "SAFB"          
## [121] "TBC1D23"        "MTRR"           "RPL15"          "VMA21"          "PTGER4"        
## [126] "MARCH6"         "TTLL5"          "NPR3"           "BRIX1"          "RUNX2"         
## [131] "UBE4B"          "CPNE3"          "PNN"            "GPR114"         "BPTF"          
## [136] "RNF138"         "TTC14"          "TXNRD1"         "KDM3A"          "KIAA1033"      
## [141] "TGFBR1"         "PPWD1"          "MED17"          "SP100"          "GORAB"         
## [146] "SLC25A5"        "RPS8"           "COMMD3"         "TLE4"           "ATF7IP2"       
## [151] "ZNF92"          "SLC30A5"        "CRCP"           "ZMYM2"          "TGIF1"         
## [156] "PPFIA1"         "SAP18"          "KDM6A"          "IKZF4"          "SOCS2"         
## [161] "ANKRD12"        "KLHL2"          "HNRNPA3"        "WRNIP1"         "TNRC6A"        
## [166] "BZW2"           "MST4"           "NAA30"          "METTL17"        "PTPN12"        
## [171] "PHTF2"

We saved the results in a worksheet named “Sig. I - Type 2 t-test” in the same Excel file AML_uneqvar.xlsx using the following commands

write.xlsx(sig.res2.I.t, file="AML_uneqvar.xlsx", sheetName = "Sig. I - Type 2 t-test", row.names = F, append = T)

Performing step 2 test using the the Wald test based on the linear mixed-effects model

Next we perform the step 2 (confirmatory) test to identify the isoforms of these 857 genes that are differentially expressed using function mc.step2.Waldtest and save the object final.res1.wald returned by the function in a file called AML_outcome_uneqvar_type2_step2_Wald.RData under subfolder output

final.res2.wald <- mc.step2.Waldtest(data=data, VarCov.type="uneqvar", nCore = NULL, file.step1="output/AML_outcome_uneqvar_type2.RData", alpha=0.05, method="homl") save(final.res1.wald, file="output/AML_outcome_uneqvar_type2_step2_Wald.RData")

This operation may take a little long time. Let’s load the object final.res2.wald from the file and then take a look.

load(file="output/AML_outcome_uneqvar_type2_step2_Wald.RData")
dim(final.res2.wald)
## [1] 5185    6
head(final.res2.wald)
##     gene         isoform step2.p.value        adj.p   threshold decision
## 1   LINS ENST00000314742  1.894253e-01 1.894253e-01 0.005474639    FALSE
## 2   LINS ENST00000559169  4.279246e-03 1.283774e-02 0.001824880    FALSE
## 3   LINS ENST00000561073  2.984596e-02 5.969193e-02 0.002737320    FALSE
## 4  UBE2S ENST00000589978  1.699557e-05 3.399114e-05 0.002737320     TRUE
## 5  UBE2S ENST00000264552  6.897118e-01 6.897118e-01 0.005474639    FALSE
## 6 GTF3C3 ENST00000481098  5.040540e-04 2.520270e-03 0.001094928     TRUE

The columns from left to right are gene name (gene), isoform ID (isoform), Fold Change (FC), p-value of the step two test (step2.p.value), adjusted p-value (adj.p), and threshold to determine if the step2.p.value is significant, and descison (decision). An isoform is called significant or considered differentially expressed if the decision is TRUE.

The isoforms that are called significant are as follows

sig.res2.I.wald <- final.res2.wald[final.res2.wald$decision, ]
dim(sig.res2.I.wald)
## [1] 221   6
head(sig.res2.I.wald)
##       gene         isoform step2.p.value        adj.p    threshold decision
## 4    UBE2S ENST00000589978  1.699557e-05 3.399114e-05 0.0027373195     TRUE
## 6   GTF3C3 ENST00000481098  5.040540e-04 2.520270e-03 0.0010949278     TRUE
## 64   INO80 ENST00000558357  4.221384e-04 1.266415e-03 0.0018248797     TRUE
## 108   PHF8 ENST00000338946  2.589063e-04 1.035625e-03 0.0013686598     TRUE
## 121 PHF21A ENST00000534766  2.837656e-05 1.986359e-04 0.0007820913     TRUE
## 123 PHF21A ENST00000531959  2.668982e-04 1.601389e-03 0.0009124398     TRUE

We see that with the Wlad-test option we identified 221 isoforms that are called significant. These isoforms belong to 182 genes, which are

unique(unique(sig.res2.I.wald[, 1]))
##   [1] "UBE2S"     "GTF3C3"    "INO80"     "PHF8"      "PHF21A"    "SHPRH"     "RPGR"     
##   [8] "MED14"     "TOMM20"    "MAP3K7"    "CDKN1C"    "NPM3"      "ZNF195"    "SBF2"     
##  [15] "DICER1"    "TRIT1"     "CLK4"      "FNBP4"     "FAM208A"   "ZNF800"    "EOGT"     
##  [22] "PRPF40A"   "VPS54"     "TMEM128"   "ACTN1"     "LPCAT1"    "CLDND1"    "SATB1"    
##  [29] "CCND3"     "TMEM87A"   "CCNL1"     "TOP2B"     "TNKS1BP1"  "BOD1L1"    "MOB1A"    
##  [36] "IBTK"      "AQR"       "PPP1CC"    "NUP133"    "ZNF91"     "PEPD"      "EAPP"     
##  [43] "CHMP1A"    "FNBP1"     "GLRX"      "PITPNB"    "SMYD3"     "TRAPPC8"   "HERC4"    
##  [50] "RB1CC1"    "PARL"      "FYN"       "TCEA1"     "DSN1"      "CYCS"      "ZFC3H1"   
##  [57] "ATP5A1"    "CSAD"      "RPUSD1"    "IER3IP1"   "BLCAP"     "TAF1D"     "ACRBP"    
##  [64] "TRIM23"    "DMXL2"     "CCPG1"     "OSBPL8"    "BNIP2"     "ASUN"      "MYCBP2"   
##  [71] "ZDHHC6"    "CCDC59"    "RBM26"     "ZRANB2"    "HOPX"      "SPINK2"    "KARS"     
##  [78] "STARD7"    "PCYT1A"    "THOC2"     "UBXN1"     "MICU2"     "CENPC"     "DLG1"     
##  [85] "SIKE1"     "NR2C1"     "GTF2H2"    "DPM1"      "MTMR6"     "TPM2"      "IQCB1"    
##  [92] "C9orf72"   "SRSF7"     "IARS"      "VPS29"     "SLC2A5"    "NOL8"      "HSPH1"    
##  [99] "ANKRA2"    "BAD"       "SNX13"     "FAM120AOS" "MAX"       "POLR2J2"   "TMED5"    
## [106] "FRA10AC1"  "U2AF2"     "PRKCI"     "NCK1"      "S100PBP"   "GMNN"      "ZDHHC13"  
## [113] "MIS12"     "SPNS3"     "VPS16"     "TSPAN7"    "EIF3M"     "RASA1"     "NUCB2"    
## [120] "RASA2"     "SCOC"      "GCA"       "HMGN2"     "TPP2"      "OXR1"      "CTNNB1"   
## [127] "RHOH"      "MSL3"      "TRH"       "KNTC1"     "NEK6"      "CDC14A"    "OXA1L"    
## [134] "EEF1B2"    "MARCH7"    "TBC1D23"   "MTRR"      "CPSF3"     "RPL15"     "VMA21"    
## [141] "PTGER4"    "TTLL5"     "NPR3"      "BRIX1"     "CPNE3"     "PNN"       "GPR114"   
## [148] "PPIP5K2"   "RNF138"    "TTC14"     "TXNRD1"    "EIF4B"     "KIAA1033"  "MGST1"    
## [155] "MED17"     "SP100"     "RAB2A"     "GORAB"     "SLC25A5"   "ARID4A"    "RPS8"     
## [162] "COMMD3"    "MRPS35"    "WRN"       "ATF7IP2"   "ZNF92"     "SLC30A5"   "ZMYM2"    
## [169] "TGIF1"     "IKZF4"     "SOCS2"     "TTC7A"     "CDCA7"     "ANKRD12"   "LUC7L3"   
## [176] "SEH1L"     "BZW2"      "MST4"      "NAA30"     "METTL17"   "RBM3"      "PHTF2"

We saved the results in a worksheet named “Sig. I - Type 2 Wald” in the same Excel file AML_uneqvar.xlsx using the following commands

write.xlsx(sig.res1.I.wald, file="AML_uneqvar.xlsx", sheetName = "Sig. I - Type 2 Wald", row.names = F, append = T)

Plots of isoform abundances

we makes the isoforms plots for two genes EEF1B2 and RUNX2 using function plot.gene().

par(mfrow=c(1, 2))
dat<-plot.gene(data=data, gene="EEF1B2", Phenotype="Tumor Status", xlim = c(-0.05, 1.5), isoform.id.opt=1)
dat<-plot.gene(data=data, gene="RUNX2", Phenotype="Tumor Status", xlim = c(-0.05, 1.5), isoform.id.opt=1)

Both genes EEF1B2 and RUNX2 passed Types 1 and 2 screening tests. We see an interaction pattern in each plot, where some of the isoforms are up-regulated in cases and down-regulated in Control while other genes are opposite. There might be isoform switching events for these two genes.

Performing Analyses Assuming Unstructured Covariance Structure

The only thing we need to do is to replace uneqvar with unstr for argument VarCov.type in functions lrt.test.mclapply and mc.step2.Waldtest and repeat above analysis

Analysis based on Type 1 screening test

Performing step 1 test

As described in our Tutorial, the first step is to run the screening test, the likelihood ratio test (LRT) by calling function lrt.test.mclapply(). We proceed this step by submitting following command:

res <- lrt.test.mclapply(data=data, VarCov.type="unstr", nCore=NULL, outfile="output/AML_outcome_unstr_type1.RData", screen.type=1)

This was the most time-consuming part. It returned a matrix object res which was saved in the file AML_outcome_unstr_type1.RData under the subfolder output. Now we load the object from the file and take a look.

load(file="data/AML_isoform_outcome.RData")
load(file="output/AML_outcome_unstr_type1.RData")

The dimension of the object is

dim(res)
## [1] 8058    6

The rows on the top are

head(res)
##               lrt df nIso     p.value        FDR       Time
## KDM5A    6.221037  5    5 0.285301376 0.47868725 12.0417387
## NINJ2    6.885362  2    2 0.031978841 0.17265499  0.5221369
## RAD52    0.391820  2    2 0.822086207 0.89383631  0.5172834
## DCUN1D2  3.563903  2    2 0.168309380 0.36231660  0.5619822
## LINS    12.289044  3    3 0.006455844 0.09648172  0.9797537
## TMEM86B  3.768324  3    3 0.287592145 0.48101316  1.5405085

and on the bottom are

tail(res)
##                 lrt df nIso    p.value       FDR      Time
## SHANK3   0.55028089  3    3 0.90771392 0.9437296 1.6155670
## PRIM2    0.04645812  2    2 0.97703866 0.9859620 0.1500692
## SCAMP1  11.92651508  4    4 0.01790617 0.1378276 1.5928967
## VPS11    4.65820475  5    5 0.45899911 0.6364122 2.9066205
## EMG1     2.98019908  4    4 0.56114449 0.7144731 2.2928150
## ALDH3B1  1.17007513  4    4 0.88300055 0.9295491 2.1293185

Each row of the matrix gives the test result for each gene and the row names indicate the names (IDs) of the genes. The columns from left to right are LRT statistic (lrt), degree of freedom (df), Number of isoforms (nIso), p-values of LRT (p.value), FDR (FDR) and time used for fitting the linear mixed effects model in second (Time). Note that there might be missing results for some genes because the corresponding linear mixed models procedures did not converge. Let’s see how many such genes there are.

p <- which(is.na(res[, 1]))
length(p)
## [1] 197

We see that there are 197 (2.44%) genes whose linear mixed effects models did not converge. The names of these genes are

rownames(res)[p]
##   [1] "CLK1"     "GNL1"     "PPP1R10"  "GATAD2B"  "DHX16"    "PPP1R18"  "MDC1"     "PER1"    
##   [9] "PAXBP1"   "CSF3R"    "RPL8"     "WBP2"     "RPS13"    "TPM3"     "PRPF8"    "EEF1D"   
##  [17] "EIF4G2"   "SEC31A"   "NCOR1"    "UBE2D3"   "UBR4"     "RSRC2"    "RPLP0"    "PABPC1"  
##  [25] "GBA"      "FAM189B"  "CLK2"     "IER3"     "HLA-C"    "HBS1L"    "DDX39B"   "BAG6"    
##  [33] "DDAH2"    "CLIC1"    "LSM2"     "EHMT2"    "ATF6B"    "HLA-DQB1" "TAP1"     "HLA-DMB" 
##  [41] "HLA-DMA"  "HLA-DPA1" "RGL2"     "TAPBP"    "CSNK1D"   "RBMX"     "HNRNPH1"  "TCF20"   
##  [49] "MGAT4B"   "HLA-B"    "DNMT1"    "GNB2L1"   "CCNL1"    "TAPT1"    "ACIN1"    "RPL18"   
##  [57] "AGPAT1"   "WDR1"     "PBX2"     "GPSM3"    "CUTA"     "SREBF1"   "FLNA"     "VPS52"   
##  [65] "FLII"     "MS4A6A"   "MS4A4E"   "C1orf63"  "DENND5A"  "DDX5"     "XPNPEP1"  "SF1"     
##  [73] "RBM39"    "TPT1"     "FAU"      "TRIP12"   "TAF1D"    "DDB1"     "LTBP3"    "NAP1L1"  
##  [81] "RPS2"     "RNPS1"    "RSL1D1"   "RPL4"     "HSP90AA1" "WDR26"    "CSDE1"    "DAGLB"   
##  [89] "NXF1"     "MKNK1"    "PKM"      "NPIPB4"   "CCNL2"    "EFTUD2"   "TCP1"     "P4HB"    
##  [97] "HNRNPC"   "CCT3"     "CDC16"    "DOCK2"    "RPL37A"   "ARPC2"    "SEPT2"    "HLA-A"   
## [105] "HLA-E"    "ABCF1"    "INTS3"    "TUBB"     "EIF4A1"   "CD44"     "ANKHD1"   "SAP30BP" 
## [113] "SSBP1"    "LDHA"     "ACADVL"   "RBM5"     "CTNNA1"   "MATR3"    "MACF1"    "APLP2"   
## [121] "KNTC1"    "MYO15B"   "RNF10"    "KPNB1"    "VIM"      "TNF"      "AIF1"     "PRRC2A"  
## [129] "CSNK2B"   "CIRBP"    "MSH5"     "C6orf48"  "HLA-DRA"  "PSMB9"    "BRD2"     "HLA-DPB1"
## [137] "SLC39A7"  "RING1"    "RPS18"    "PTPRC"    "IFNGR2"   "SAFB"     "CANX"     "SRSF5"   
## [145] "KDM3B"    "LST1"     "ATG4D"    "CCNDBP1"  "SKIV2L"   "FOS"      "RNF5"     "PHF1"    
## [153] "MTHFD2"   "MTA1"     "LSM14A"   "RPL27A"   "DENND3"   "TCF25"    "CLK3"     "NOL11"   
## [161] "EWSR1"    "CSK"      "DOCK8"    "ATXN2L"   "MORF4L1"  "PPWD1"    "UBE2I"    "PSMD2"   
## [169] "EIF4G1"   "PCBP2"    "PTMA"     "NDUFV1"   "INPP5D"   "EIF4A2"   "USP3"     "SRSF11"  
## [177] "AMPD2"    "FBRS"     "MYL12A"   "GNAS"     "PPM1A"    "SRRM2"    "ZYX"      "TTC7A"   
## [185] "FUS"      "HIF1A"    "ARMC5"    "CDK16"    "KLC1"     "INPPL1"   "PCM1"     "SENP6"   
## [193] "OTUB1"    "ERGIC3"   "PIGG"     "SSFA2"    "LMNA"

Next let’s retain only Those called significant at level FDR = 0.05 and assign the results in a new object called sig.res1.unstr

sig.res1.unstr <- res[!is.na(res[, 1]) & res[, "FDR"] <= 0.05, ]
dim(sig.res1.unstr)
## [1] 93  6

We see that there are total 93 genes that passed the screening test at level FDR = 0.05. Let’s sort these genes by the p-values and check those on the top and bottom,

head(sig.res1.unstr[order(sig.res1.unstr[, "p.value"]), ])
##              lrt df nIso      p.value        FDR        Time
## BNIP2   42.12067  9    9 3.123765e-06 0.02455591  80.0752339
## KDM5B   27.71828  4    4 1.422479e-05 0.02785048   3.4180684
## RPS8    36.06552  8    8 1.708174e-05 0.02785048  63.9374423
## ANKRD10 37.28777  9    9 2.338120e-05 0.02785048 147.6992440
## LPCAT1  21.01381  2    2 2.734694e-05 0.02785048   0.2510765
## RUNX2   30.23851  6    6 3.540942e-05 0.02785048  15.6507425
tail(sig.res1.unstr[order(sig.res1.unstr[, "p.value"]), ])
##              lrt df nIso      p.value        FDR      Time
## SLC25A5 19.90253  4    4 0.0005220184 0.04614520  1.701707
## RAD23B  17.63746  3    3 0.0005224428 0.04614520  1.001509
## GORAB   15.05812  2    2 0.0005372435 0.04643790  0.170177
## SPPL2A  17.57728  3    3 0.0005375715 0.04643790  1.283314
## GOLGA4  21.88731  5    5 0.0005500644 0.04700061  6.670376
## TOP2B   23.72432  6    6 0.0005868739 0.04960662 20.298992

We saved the results in a worksheet named “Sig. G - Type 1” in a Excel file AML_unstr.xlsx using the following commands

require(xlsx)
write.xlsx(sig.res1.unstr[order(sig.res1.unstr[, "p.value"]), ], file="AML_unstr.xlsx", sheetName = "Sig. G - Type 1", row.names = T)

Performing step 2 test using the standard t-test

Next we perform the step 2 (confirmatory) test to identify the isoforms of these 93 genes that are differentially expressed using function step2.Ftest() as follows

final.res1 <- step2.Ftest(data=data, file.step1="output/AML_outcome_unstr_type1.RData", alpha=0.05, method="holm")
## Number of significant genes: 93 
## Cutoff for significant isoform is 0.0005915278

Note that there is an argument called method in function step2.Ftest and the following functon mc.step2.Waldtest, which is the options for controlling the family-wise error rate required by our two step procedure. The options are methods of Bonferroni (“bonferroni”), Holm (“holm”), Hochberg (“hochberg”) and Hommel (“hommel”). We use Holm option throughout this document. The function returns the results of the confirmatory test for all the isoforms of the above 93 genes in a data.frame object called final.res1. Let’s take a look at the results.

dim(final.res1)
## [1] 456   7
head(final.res1)
##    gene         isoform        FC step2.p.value        adj.p    threshold decision
## 1 UBE2S ENST00000589978 -2.427634  2.705008e-05 5.410016e-05 0.0002957639     TRUE
## 2 UBE2S ENST00000264552 -1.049046  6.913375e-01 6.913375e-01 0.0005915278    FALSE
## 3 FBXO3 ENST00000526785 -1.002767  9.715154e-01 9.715154e-01 0.0005915278    FALSE
## 4 FBXO3 ENST00000530401  1.071855  3.345607e-01 6.691213e-01 0.0002957639    FALSE
## 5 FBXO3 ENST00000534136  1.197882  7.787110e-03 2.336133e-02 0.0001971759    FALSE
## 6 FBXO3 ENST00000448981 -1.319633  2.182887e-03 8.731547e-03 0.0001478819    FALSE

The columns from left to right are gene name (gene), isoform ID (isoform), Fold Change (FC), p-value of the step two test (step2.p.value), adjusted p-value (adj.p), and threshold to determine if the step2.p.value is significant, and descison (decision). An isoform is called significant or considered differentially expressed if the decision is TRUE.

The isoforms that are called significant are as follows

sig.res1.I.t <- final.res1[final.res1$decision, ]
dim(sig.res1.I.t)
## [1] 28  7
head(sig.res1.I.t)
##      gene         isoform        FC step2.p.value        adj.p    threshold decision
## 1   UBE2S ENST00000589978 -2.427634  2.705008e-05 5.410016e-05 0.0002957639     TRUE
## 13   PHF8 ENST00000338946  1.412894  9.481201e-06 3.792480e-05 0.0001478819     TRUE
## 29 CDKN1C ENST00000414822 -1.565790  5.414922e-05 2.165969e-04 0.0001478819     TRUE
## 32   TMX3 ENST00000566887  1.298018  7.545970e-05 1.509194e-04 0.0002957639     TRUE
## 33   TMX3 ENST00000299608  1.270793  1.538746e-04 1.538746e-04 0.0005915278     TRUE
## 62  VPS54 ENST00000409558  1.323597  6.284013e-05 2.513605e-04 0.0001478819     TRUE
tail(sig.res1.I.t)
##        gene         isoform        FC step2.p.value        adj.p    threshold decision
## 313    CCNJ ENST00000465148 -1.293933  1.868237e-04 3.736475e-04 2.957639e-04     TRUE
## 326  EEF1B2 ENST00000435123 -3.253894  1.033427e-05 8.267413e-05 7.394097e-05     TRUE
## 382  TXNRD1 ENST00000427956 -1.697822  9.090531e-07 8.181478e-06 6.572531e-05     TRUE
## 433 ATF7IP2 ENST00000570163  1.893709  1.698652e-05 8.493261e-05 1.183056e-04     TRUE
## 439   AKAP2 ENST00000480388 -1.429504  3.507530e-05 7.015059e-05 2.957639e-04     TRUE
## 455   PHTF2 ENST00000307305  1.261118  1.464911e-04 5.859646e-04 1.478819e-04     TRUE

We see that with the standard t-test option we identified 28 isoforms that are called significant. These isoforms belong to 25 genes, which are

unique(unique(sig.res1.I.t[, 1]))
##  [1] "UBE2S"   "PHF8"    "CDKN1C"  "TMX3"    "VPS54"   "LPCAT1"  "AZI2"    "PEPD"    "DSN1"   
## [10] "CSAD"    "BLCAP"   "SPPL2A"  "BNIP2"   "ZDHHC6"  "PTPN22"  "CENPC"   "POLR2J2" "GMNN"   
## [19] "GCA"     "CCNJ"    "EEF1B2"  "TXNRD1"  "ATF7IP2" "AKAP2"   "PHTF2"

We saved the results in a worksheet named “Sig. I - Type 1 t-test” in the same Excel file AML_unstr.xlsx using the following commands

write.xlsx(sig.res1.I.t, file="AML_unstr.xlsx", sheetName = "Sig. I - Type 1 t-test", row.names = F, append = T)

Performing step 2 test using the the Wald test based on the linear mixed-effects model

Next we perform the step 2 (confirmatory) test to identify the isoforms of these 93 genes that are differentially expressed using function mc.step2.Waldtest and save the object final.res1.wald.unstr returned by the function in a file called AML_outcome_unstr_type1_step2_Wald.RData under subfolder output

final.res1.wald.unstr <- mc.step2.Waldtest(data=data, VarCov.type="unstr", nCore = NULL, file.step1="output/AML_outcome_unstr_type1.RData", alpha=0.05, method="homl") save(final.res1.wald.unstr, file="output/AML_outcome_unstr_type1_step2_Wald.RData")

This operation may take a little long time. Let’s load the object final.res1.wald.unstr from the file and then take a look.

load(file="output/AML_outcome_unstr_type1_step2_Wald.RData")
dim(final.res1.wald.unstr)
## [1] 456   6
head(final.res1.wald.unstr)
##    gene         isoform step2.p.value        adj.p    threshold decision
## 1 UBE2S ENST00000589978  1.699557e-05 3.399114e-05 0.0002957639     TRUE
## 2 UBE2S ENST00000264552  6.897118e-01 6.897118e-01 0.0005915278    FALSE
## 3 FBXO3 ENST00000526785  9.691171e-01 9.691171e-01 0.0005915278    FALSE
## 4 FBXO3 ENST00000530401  3.697829e-01 7.395658e-01 0.0002957639    FALSE
## 5 FBXO3 ENST00000534136  6.467948e-03 1.940384e-02 0.0001971759    FALSE
## 6 FBXO3 ENST00000448981  2.644229e-03 1.057692e-02 0.0001478819    FALSE

The columns from left to right are gene name (gene), isoform ID (isoform), Fold Change (FC), p-value of the step two test (step2.p.value), adjusted p-value (adj.p), and threshold to determine if the step2.p.value is significant, and descison (decision). An isoform is called significant or considered differentially expressed if the decision is TRUE.

The isoforms that are called significant are as follows

sig.res1.I.wald.unstr <- final.res1.wald.unstr[final.res1.wald.unstr$decision, ]
dim(sig.res1.I.wald.unstr)
## [1] 34  6
head(sig.res1.I.wald.unstr)
##      gene         isoform step2.p.value        adj.p    threshold decision
## 1   UBE2S ENST00000589978  1.699557e-05 3.399114e-05 0.0002957639     TRUE
## 32   TMX3 ENST00000566887  5.166484e-05 1.033297e-04 0.0002957639     TRUE
## 33   TMX3 ENST00000299608  1.112203e-04 1.112203e-04 0.0005915278     TRUE
## 60  VPS54 ENST00000416400  4.921435e-05 1.968574e-04 0.0001478819     TRUE
## 67 LPCAT1 ENST00000503252  3.445858e-06 6.891715e-06 0.0002957639     TRUE
## 72   AZI2 ENST00000476174  1.000483e-04 4.001932e-04 0.0001478819     TRUE

We see that with the Wlad-test option we identified 34 isoforms that are called significant. These isoforms belong to 29 genes, which are

unique(unique(sig.res1.I.wald.unstr[, 1]))
##  [1] "UBE2S"   "TMX3"    "VPS54"   "LPCAT1"  "AZI2"    "TOP2B"   "PEPD"    "FNBP1"   "DSN1"   
## [10] "CSAD"    "BLCAP"   "SPPL2A"  "BNIP2"   "ZDHHC6"  "PTPN22"  "PCYT1A"  "CENPC"   "EDEM3"  
## [19] "POLR2J2" "GMNN"    "GCA"     "CCNJ"    "EEF1B2"  "NPR3"    "TXNRD1"  "ATF7IP2" "AKAP2"  
## [28] "SOCS2"   "PHTF2"

We saved the results in a worksheet named “Sig. I - Type 1 Wald” in the same Excel file AML_unstr.xlsx using the following commands

write.xlsx(sig.res1.I.wald.unstr, file="AML_unstr.xlsx", sheetName = "Sig. I - Type 1 Wald", row.names = F, append = T)

Analysis based on Type 2 screening test

Performing step 1 test

The procedure for the analysis based on Type 2 screening test is the same as that based on Type 1 test except that the argument screen.type of function lrt.test.mclapply needs to be specified as 2 rather than 1. We perform the Type 2 screening test by submitting following command:

res <- lrt.test.mclapply(data=data, VarCov.type="unstr", nCore=NULL, outfile="output/AML_outcome_unstr_type2.RData", screen.type=2)

Let’s take a look at the results

load(file="output/AML_outcome_unstr_type2.RData")

The dimension of the object is

dim(res)
## [1] 8058    6

The rows on the top are

head(res)
##                lrt df nIso     p.value        FDR       Time
## KDM5A    6.1870684  4    5 0.185606678 0.38552850 14.2327073
## NINJ2    1.2890201  1    2 0.256228621 0.45805453  0.5930238
## RAD52    0.1514221  1    2 0.697180079 0.81332505  0.6523602
## DCUN1D2  2.4249554  1    2 0.119416641 0.30670736  0.7199266
## LINS    12.0569108  2    3 0.002409212 0.06680779  1.4953113
## TMEM86B  3.5274635  2    3 0.171404036 0.36748128  1.6962082

and on the bottom are

tail(res)
##                  lrt df nIso    p.value       FDR      Time
## SHANK3   0.224051413  2    3 0.89402127 0.9393367 1.6598437
## PRIM2    0.004453358  1    2 0.94679388 0.9711213 0.3595948
## SCAMP1  10.832604452  3    4 0.01266636 0.1154942 1.7084463
## VPS11    4.406684725  4    5 0.35375609 0.5516434 6.4329040
## EMG1     2.569740698  3    4 0.46281911 0.6431311 3.5995891
## ALDH3B1  0.632919850  3    4 0.88885648 0.9362399 2.0013216

Let’s check how many genes with missing results.

p <- which(is.na(res[, 1]))
length(p)
## [1] 194

We see that there are 194 (2.41%) genes whose linear mixed effects models did not converge. The names of these genes are

rownames(res)[p]
##   [1] "CLK1"     "RNF44"    "GNL1"     "PPP1R10"  "GATAD2B"  "DHX16"    "PPP1R18"  "MDC1"    
##   [9] "PER1"     "PAXBP1"   "CSF3R"    "RPL8"     "WBP2"     "RPS13"    "TPM3"     "PRPF8"   
##  [17] "EEF1D"    "EIF4G2"   "SEC31A"   "NCOR1"    "UBE2D3"   "RSRC2"    "RPLP0"    "PABPC1"  
##  [25] "GBA"      "FAM189B"  "CLK2"     "IER3"     "HLA-C"    "HBS1L"    "DDX39B"   "BAG6"    
##  [33] "DDAH2"    "CLIC1"    "LSM2"     "EHMT2"    "ATF6B"    "HLA-DQB1" "TAP1"     "HLA-DMB" 
##  [41] "HLA-DMA"  "HLA-DPA1" "RGL2"     "TAPBP"    "CSNK1D"   "RBMX"     "HNRNPH1"  "TCF20"   
##  [49] "MGAT4B"   "HLA-B"    "DNMT1"    "GNB2L1"   "CCNL1"    "TAPT1"    "ACIN1"    "RPL18"   
##  [57] "AGPAT1"   "WDR1"     "PBX2"     "GPSM3"    "ZNF468"   "CUTA"     "SREBF1"   "FLNA"    
##  [65] "VPS52"    "FLII"     "MS4A6A"   "MS4A4E"   "C1orf63"  "DENND5A"  "DDX5"     "XPNPEP1" 
##  [73] "SF1"      "RBM39"    "TPT1"     "FAU"      "TRIP12"   "TAF1D"    "DDB1"     "LTBP3"   
##  [81] "NAP1L1"   "RPS2"     "RNPS1"    "RSL1D1"   "RPL4"     "HSP90AA1" "CSDE1"    "DAGLB"   
##  [89] "NXF1"     "MKNK1"    "LMBR1L"   "PKM"      "NPIPB4"   "IKBIP"    "CCNL2"    "EFTUD2"  
##  [97] "TCP1"     "P4HB"     "HNRNPC"   "CCT3"     "CDC16"    "DOCK2"    "ARPC2"    "SEPT2"   
## [105] "HLA-A"    "HLA-E"    "ABCF1"    "INTS3"    "TUBB"     "EIF4A1"   "CD44"     "ANKHD1"  
## [113] "SAP30BP"  "SSBP1"    "LDHA"     "ACADVL"   "RBM5"     "CTNNA1"   "MATR3"    "MACF1"   
## [121] "APLP2"    "KNTC1"    "MYO15B"   "RNF10"    "KPNB1"    "VIM"      "TNF"      "AIF1"    
## [129] "PRRC2A"   "CSNK2B"   "CIRBP"    "MSH5"     "C6orf48"  "HLA-DRA"  "PSMB9"    "BRD2"    
## [137] "HLA-DPB1" "SLC39A7"  "RING1"    "RPS18"    "PTPRC"    "IFNGR2"   "SAFB"     "CANX"    
## [145] "SRSF5"    "KDM3B"    "LST1"     "ATG4D"    "CCNDBP1"  "SKIV2L"   "FOS"      "RNF5"    
## [153] "PHF1"     "MTHFD2"   "MTA1"     "LSM14A"   "RPL27A"   "DENND3"   "TCF25"    "CLK3"    
## [161] "NOL11"    "EWSR1"    "CSK"      "DOCK8"    "ATXN2L"   "GAPDH"    "ADRBK1"   "UBE2I"   
## [169] "PSMD2"    "EIF4G1"   "PCBP2"    "PTMA"     "NDUFV1"   "INPP5D"   "EIF4A2"   "USP3"    
## [177] "SRSF11"   "AMPD2"    "FBRS"     "MYL12A"   "GNAS"     "PPM1A"    "SRRM2"    "ZYX"     
## [185] "TTC7A"    "HIF1A"    "ARMC5"    "KLC1"     "INPPL1"   "PCM1"     "ERGIC3"   "PIGG"    
## [193] "SSFA2"    "LMNA"

Next let’s retain only Those called significant at level FDR = 0.05 and assign the results in a new object called sig.res2.unstr

sig.res2.unstr <- res[!is.na(res[, 1]) & res[, "FDR"] <= 0.05, ]
dim(sig.res2.unstr)
## [1] 102   6
head(sig.res2.unstr)
##             lrt df nIso      p.value        FDR       Time
## UBE2S  17.55095  1    2 2.797109e-05 0.02086329  0.6880627
## INO80  14.83799  2    3 5.997514e-04 0.04879397  1.3273273
## FBXO3  22.53701  3    4 5.042727e-05 0.02643313  3.3897028
## RPGR   15.15625  2    3 5.115181e-04 0.04749958  1.6220014
## MED14  16.12015  2    3 3.159028e-04 0.04516836  1.5281327
## ANTXR2 22.23701  4    5 1.797810e-04 0.03850275 17.0965638

We see that there are total 102 genes that passed the screening test at level FDR = 0.05. Let’s sort these genes by the p-values and check those on the top and bottom,

head(sig.res2.unstr[order(sig.res2.unstr[, "p.value"]), ])
##                     lrt df nIso      p.value        FDR       Time
## BNIP2          38.21052  8    9 6.883128e-06 0.01944874  56.656606
## KDM5B          26.14703  3    4 8.884721e-06 0.01944874   5.739940
## ANKRD10        37.06555  8    9 1.119377e-05 0.01944874 220.359912
## RPS8           34.58682  7    8 1.337205e-05 0.01944874 136.565551
## RPL17-C18orf32 22.24619  2    3 1.476729e-05 0.01944874   1.486135
## RUNX2          29.98654  5    6 1.483882e-05 0.01944874  19.768445
tail(sig.res2.unstr[order(sig.res2.unstr[, "p.value"]), ])
##               lrt df nIso      p.value        FDR        Time
## IKZF5    14.83098  2    3 0.0006018584 0.04879397   1.6080716
## GTF2H2   11.72566  1    2 0.0006164424 0.04946636   0.2989671
## ASUN     14.73104  2    3 0.0006326953 0.04997282   0.7070322
## KIAA1109 23.52474  6    7 0.0006384974 0.04997282  22.3730366
## PRKCSH   27.22480  8    9 0.0006464838 0.04997282 216.3108540
## DICER1   23.48910  6    7 0.0006481724 0.04997282  39.3696904

We saved the results in a worksheet named “Sig. G - Type 2” in a Excel file AML_unstr.xlsx using the following commands

write.xlsx(sig.res2.unstr[order(sig.res2.unstr[, "p.value"]), ], file="AML_unstr.xlsx", sheetName = "Sig. G - Type 2", row.names = T, append = T)

Performing step 2 test using the standard t-test

Next we perform the step 2 (confirmatory) test to identify the isoforms of these 102 genes that are differentially expressed using function step2.Ftest() as follows

final.res2 <- step2.Ftest(data=data, file.step1="output/AML_outcome_unstr_type2.RData", alpha=0.05, method="holm")
## Number of significant genes: 102 
## Cutoff for significant isoform is 0.0006485249

The function returns the results of the confirmatory test for all the isoforms of the above 102 genes in a data.frame object called final.res2. Let’s take a look at the results.

dim(final.res2)
## [1] 475   7
head(final.res2)
##    gene         isoform        FC step2.p.value        adj.p    threshold decision
## 1 UBE2S ENST00000589978 -2.427634  2.705008e-05 5.410016e-05 0.0003242625     TRUE
## 2 UBE2S ENST00000264552 -1.049046  6.913375e-01 6.913375e-01 0.0006485249    FALSE
## 3 INO80 ENST00000361937  1.005198  9.484275e-01 9.484275e-01 0.0006485249    FALSE
## 4 INO80 ENST00000558357 -1.532673  4.038737e-04 1.211621e-03 0.0002161750    FALSE
## 5 INO80 ENST00000401393  1.095610  2.441340e-01 4.882680e-01 0.0003242625    FALSE
## 6 FBXO3 ENST00000526785 -1.002767  9.715154e-01 9.715154e-01 0.0006485249    FALSE

The columns from left to right are gene name (gene), isoform ID (isoform), Fold Change (FC), p-value of the step two test (step2.p.value), adjusted p-value (adj.p), and threshold to determine if the step2.p.value is significant, and descison (decision). An isoform is called significant or considered differentially expressed if the decision is TRUE.

The isoforms that are called significant are as follows

sig.res2.I.t <- final.res2[final.res2$decision, ]
dim(sig.res2.I.t)
## [1] 14  7
head(sig.res2.I.t)
##       gene         isoform        FC step2.p.value        adj.p    threshold decision
## 1    UBE2S ENST00000589978 -2.427634  2.705008e-05 5.410016e-05 3.242625e-04     TRUE
## 25  CDKN1C ENST00000414822 -1.565790  5.414922e-05 2.165969e-04 1.621312e-04     TRUE
## 52  LPCAT1 ENST00000503252  1.641507  6.297745e-06 1.259549e-05 3.242625e-04     TRUE
## 142   DSN1 ENST00000373740  1.487508  4.393664e-05 1.318099e-04 2.161750e-04     TRUE
## 165  BLCAP ENST00000397137  1.798735  1.814551e-05 3.629101e-05 3.242625e-04     TRUE
## 172  BNIP2 ENST00000439052  2.381825  1.504972e-05 1.354475e-04 7.205832e-05     TRUE
tail(sig.res2.I.t)
##        gene         isoform        FC step2.p.value        adj.p    threshold decision
## 254 POLR2J2 ENST00000591000 -2.085772  2.400970e-05 7.202909e-05 2.161750e-04     TRUE
## 276     GCA ENST00000487445  1.545836  1.254005e-04 3.762016e-04 2.161750e-04     TRUE
## 321  EEF1B2 ENST00000435123 -3.253894  1.033427e-05 8.267413e-05 8.106562e-05     TRUE
## 406  TXNRD1 ENST00000427956 -1.697822  9.090531e-07 8.181478e-06 7.205832e-05     TRUE
## 438 ATF7IP2 ENST00000570163  1.893709  1.698652e-05 8.493261e-05 1.297050e-04     TRUE
## 474   PHTF2 ENST00000307305  1.261118  1.464911e-04 5.859646e-04 1.621312e-04     TRUE

We see that with the standard t-test option we identified 14 isoforms that are called significant. These isoforms belong to 14 genes, which are

unique(unique(sig.res2.I.t[, 1]))
##  [1] "UBE2S"   "CDKN1C"  "LPCAT1"  "DSN1"    "BLCAP"   "BNIP2"   "ZDHHC6"  "CENPC"   "POLR2J2"
## [10] "GCA"     "EEF1B2"  "TXNRD1"  "ATF7IP2" "PHTF2"

We saved the results in a worksheet named “Sig. I - Type 2 t-test” in the same Excel file AML_unstr.xlsx using the following commands

write.xlsx(sig.res2.I.t, file="AML_unstr.xlsx", sheetName = "Sig. I - Type 2 t-test", row.names = F, append = T)

Performing step 2 test using the the Wald test based on the linear mixed-effects model

Next we perform the step 2 (confirmatory) test to identify the isoforms of these 102 genes that are differentially expressed using function mc.step2.Waldtest and save the object final.res2.wald.unstr returned by the function in a file called AML_outcome_unstr_type2_step2_Wald.RData under subfolder output

final.res2.wald.unstr <- mc.step2.Waldtest(data=data, VarCov.type="unstr", nCore = NULL, file.step1="output/AML_outcome_unstr_type2.RData", alpha=0.05, method="homl") save(final.res2.wald.unstr, file="output/AML_outcome_unstr_type2_step2_Wald.RData")

This operation may take a little long time. Let’s load the object final.res2.wald.unstr from the file and then take a look.

load(file="output/AML_outcome_unstr_type2_step2_Wald.RData")
dim(final.res2.wald.unstr)
## [1] 475   6
head(final.res2.wald.unstr)
##    gene         isoform step2.p.value        adj.p    threshold decision
## 1 UBE2S ENST00000589978  1.699557e-05 3.399114e-05 0.0003242625     TRUE
## 2 UBE2S ENST00000264552  6.897118e-01 6.897118e-01 0.0006485249    FALSE
## 3 INO80 ENST00000361937  9.481698e-01 9.481698e-01 0.0006485249    FALSE
## 4 INO80 ENST00000558357  4.221384e-04 1.266415e-03 0.0002161750    FALSE
## 5 INO80 ENST00000401393  2.324067e-01 4.648134e-01 0.0003242625    FALSE
## 6 FBXO3 ENST00000526785  9.691171e-01 9.691171e-01 0.0006485249    FALSE

The columns from left to right are gene name (gene), isoform ID (isoform), Fold Change (FC), p-value of the step two test (step2.p.value), adjusted p-value (adj.p), and threshold to determine if the step2.p.value is significant, and descison (decision). An isoform is called significant or considered differentially expressed if the decision is TRUE.

The isoforms that are called significant are as follows

sig.res2.I.wald.unstr <- final.res2.wald.unstr[final.res2.wald.unstr$decision, ]
dim(sig.res2.I.wald.unstr)
## [1] 23  6
head(sig.res2.I.wald.unstr)
##       gene         isoform step2.p.value        adj.p    threshold decision
## 1    UBE2S ENST00000589978  1.699557e-05 3.399114e-05 3.242625e-04     TRUE
## 34  DICER1 ENST00000393063  5.474378e-05 3.832065e-04 9.264642e-05     TRUE
## 52  LPCAT1 ENST00000503252  3.445858e-06 6.891715e-06 3.242625e-04     TRUE
## 75   TOP2B ENST00000542520  1.703146e-05 1.021888e-04 1.080875e-04     TRUE
## 119  FNBP1 ENST00000478129  3.703202e-05 2.221921e-04 1.080875e-04     TRUE
## 142   DSN1 ENST00000373740  5.528330e-05 1.658499e-04 2.161750e-04     TRUE

We see that with the Wlad-test option we identified 23 isoforms that are called significant. These isoforms belong to 21 genes, which are

unique(unique(sig.res2.I.wald.unstr[, 1]))
##  [1] "UBE2S"   "DICER1"  "LPCAT1"  "TOP2B"   "FNBP1"   "DSN1"    "BLCAP"   "BNIP2"   "ZDHHC6" 
## [10] "ZRANB2"  "PCYT1A"  "CENPC"   "POLR2J2" "GCA"     "EEF1B2"  "NPR3"    "TXNRD1"  "ATF7IP2"
## [19] "ZMYM2"   "SOCS2"   "PHTF2"

We saved the results in a worksheet named “Sig. I - Type 2 Wald” in the same Excel file AML_unstr.xlsx using the following commands

write.xlsx(sig.res1.I.wald.unstr, file="AML_unstr.xlsx", sheetName = "Sig. I - Type 2 Wald", row.names = F, append = T)