--- title: "Accuracy in the prediction of disease epidemics when ensembling simple but highly correlated models" author: "Denis A. Shah, Erick D. De Wolf, Pierce A. Paul, Laurence V. Madden" date: "`r format(Sys.time(), '%d %B, %Y')`" output: html_document: fig_caption: yes highlight: tango number_sections: true theme: journal toc: yes toc_depth: 4 toc_float: true editor_options: chunk_output_type: console chunk_output_type: console --- # Objective(s) * ensembling + soft voting + model averaging + model stacking --------------------------------------------------------------------------------------- # The models {.tabset .tabset-fade .tabset-pills} ## Weather-based predictors |ID | Weather variable | Description | |:--|:-----------------|:----------------------------------------------------------------------------| |1 |D.A.1 | Mean D from 20 days pre-anthesis to anthesis | |2 |D.A.2 | Mean D from 20 days pre-anthesis to 10 days pre-anthesis | |3 |D.MINMAXDIFF.3 | Mean of daily max - min D from 20 days pre-anthesis to 10 days post-anthesis | |4 |D.SD.3 | Mean daily std.dev(D) from 40 days pre-anthesis to 10 days post-anthesis | |5 |INT3 | T15307 × TRH9010 interaction (product term) | |6 |P.A.1 | Mean P from 40 days pre-anthesis to 20 days pre-anthesis | |7 |P.A.2 | Mean P from 45 days pre-anthesis to 35 days pre-anthesis | |8 |P.A.3 | Mean P from 60 days pre-anthesis to 50 days pre-anthesis | |9 |P.A.5 | Mean P from 100 days pre-anthesis to 60 days pre-anthesis | |10 |P.MINMAXDIFF.1 | Mean of daily max - min P from 60 days pre-anthesis to 50 days pre-anthesis | |11 |RH.A.1 | Mean RH from 20 days pre-anthesis to anthesis | |12 |RH.A.3 | Mean RH from 10 days pre-anthesis to 10 days post-anthesis | |13 |RH.A.POST10.12H | Mean overnight RH from anthesis to 10 days post-anthesis | |14 |RH.A.POST5.12H | Mean overnight RH from anthesis to 5 days post-anthesis | |15 |RH.A.POST7.12H | Mean overnight RH from anthesis to 7 days post-anthesis | |16 |RH.A.PRE10.24H | Mean RH from 10 days pre-anthesis to anthesis | |17 |RH.A.PRE14.24H | Mean RH from 14 days pre-anthesis to anthesis | |18 |RH.G70.CD.2 | No. days daily mean RH > 70% from anthesis to 10 days post-anthesis | |19 |RH.G70.CD.3 | No. days daily mean RH > 70% from 15 days pre-anthesis to 10 days post-anthesis | |20 |RH.G80.CHD.2 | No. hr RH >= 80% from anthesis to 10 days post-anthesis | |21 |RH.G80.CHD.3 | No. hr RH >= 80% from 20 days pre-anthesis to 10 days post-anthesis | |22 |RH.G80.PRE14.12H | No. overnight hr RH >= 80% from 14 days pre-anthesis to anthesis | |23 |RH.G90.CHD.1 | No. hr RH >= 90% from 20 days pre-anthesis to anthesis | |24 |RH.G90.CHD.2 | No. hr RH >= 90% from anthesis to 10 days post-anthesis | |25 |RH.G90.CHD.3 | No. hr RH >= 90% from 50 days pre-anthesis to 10 days post-anthesis | |26 |RH.G90.PRE10.12H | No. hr overnight RH >= 90% from 10 days pre-anthesis to anthesis | |27 |RH.MINMAXDIFF.3 | Mean of daily max - min RH from 30 days pre-anthesis to 10 days post-anthesis | |28 |RH7 | Mean RH from 7 days pre-anthesis to anthesis | |29 |sq.T.A.PRE7.24H | The square of T.A.PRE7.24H | |30 |T.A.1 | Mean T from 25 days pre-anthesis to 15 days pre-anthesis | |31 |T.A.2 | Mean T from 20 days pre-anthesis to 10 days pre-anthesis | |32 |T.A.3 | Mean T from 40 days pre-anthesis to 30 days pre-anthesis | |33 |T.A.4 | Mean T from anthesis to 10 days post-anthesis | |34 |T.A.5 | Mean T from anthesis to 20 days post-anthesis | |35 |T.A.POST5.24H | Mean T from anthesis to 5 days post-anthesis | |36 |T.A.POST7.24H | Mean T from anthesis to 7 days post-anthesis | |37 |T.A.PRE15.24H | Mean T from 15 days pre-anthesis to anthesis | |38 |T.A.PRE7.24H | Mean T from 7 days pre-anthesis to anthesis | |39 |T.G30.CHD.3 | No. hr T > 30C from 10 days post-anthesis to 20 days post-anthesis | |40 |T.L9.PRE15.24H | No. hr T < 9C from 15 days pre-anthesis to anthesis | |41 |T.L9.PRE7.24H | No. hr T < 9C from 7 days pre-anthesis to anthesis | |42 |T.MINMAXDIFF.2 | Mean of daily max - min T from 20 days pre-anthesis to 10 days post-anthesis | |43 |T.SD.1 | Mean daily std.dev(T) from 20 days pre-anthesis to anthesis | |44 |T.SD.2 | Mean daily std.dev(T) from anthesis to 10 days post-anthesis | |45 |T.SD.3 | Mean daily std.dev(T) from 20 days pre-anthesis to 10 days post-anthesis | |46 |T15307 | No. hr 15C <= T <= 30C from 7 days pre-anthesis to anthesis | |47 |TDD.A.2 | Mean TDD from 30 days pre-anthesis to anthesis | |48 |TDD.A.4 | Mean TDD from anthesis to 20 days post-anthesis | |49 |TDD.A.5 | Mean TDD from 10 days pre-anthesis to 5 days post-anthesis | |50 |TDD.A.6 | Mean TDD from 10 days pre-anthesis to 10 days post-anthesis | |51 |TDD.SD.3 | Mean daily std.dev(TDD) from anthesis to 10 days post-anthesis | |52 |TDD.SD.4 | Mean daily std.dev(TDD) from 20 days pre-anthesis to 10 days post-anthesis | |53 |TRH.15T30nRHG80.CHD.1 | No. hr 15C <= T <= 30C & RH >= 80%, 30 days pre-anthesis to anthesis | |54 |TRH.15T30nRHG80.CHD.2 | No. hr 15C <= T <= 30C & RH >= 80%, 60 to 40 days pre-anthesis | |55 |TRH.15T30nRHG80.CHD.3 | No. hr 15C <= T <= 30C & RH >= 80%, anthesis to 10 days post-anthesis | |56 |TRH.15T30nRHG80.POST10.12H | No. overnight hr 15C <= T <= 30C & RH >= 80%, anthesis to 10 days post-anthesis | |57 |TRH.15T30nRHG80.POST5.12H | No. overnight hr 15C <= T <= 30C & RH >= 80%, anthesis to 5 days post-anthesis | |58 |TRH.15T30nRHG80.POST7.12H | No. overnight hr 15C <= T <= 30C & RH >= 80%, anthesis to 7 days post-anthesis | |59 |TRH.15T30nRHG80.PRE15.24H | No. hr 15C <= T <= 30C & RH >= 80%, 15 days pre-anthesis to anthesis | |60 |TRH.15T30nRHG90.CHD.1 | No. hr 15C <= T <= 30C & RH >= 90%, 30 days pre-anthesis to anthesis | |61 |TRH.15T30nRHG90.CHD.3 | No. hr 15C <= T <= 30C & RH >= 90%, anthesis to 10 days post-anthesis | |62 |TRH.5T30nRHG75.CHD.3 | No. hr 5C <= T <= 30C & RH >= 75%, anthesis to 10 days post-anthesis | |63 |TRH.9T30nRHG90.CHD.2 | No. hr 9C <= T <= 30C & RH >= 90%, 50 days pre-anthesis to anthesis | |64 |TRH.9T30nRHG90.PRE15.24H | No. hr 9C <= T <= 30C & RH >= 90%, 15 days pre-anthesis to anthesis | |65 |TRH9010 | No. hr 15C <= T <= 30C & RH >= 90%, anthesis to 10 days post-anthesis | |66 |VPD.A.1 | Mean VPD from 10 days pre-anthesis to anthesis | |67 |VPD.A.3 | Mean VPD from anthesis to 15 days post-anthesis | |68 |VPD.A.4 | Mean VPD from anthesis to 20 days post-anthesis | |69 |VPD.A.5 | Mean VPD from 10 days pre-anthesis to 10 days post-anthesis | |70 |VPD.L11.CHD.2 | No. hr VPD <= 1.1 kPa, 40 days pre-anthesis to anthesis | |71 |VPD.L45.PRE7.12H | No. overnight hr VPD <= 0.45 kPa, 7 days pre-anthesis to anthesis | |72 |VPD.L6.CD.1 | No. days daily mean VPD < 0.6 kPa, 20 days pre-anthesis to anthesis | |73 |VPD.L6.CD.3 | No. days daily mean VPD < 0.6 kPa, anthesis to 20 days post-anthesis | |74 |VPD.L635.CHD.1 | No. hr VPD <= 0.635 kPa, 15 days pre-anthesis to 5 days pre-anthesis | |75 |VPD.L635.CHD.2 | No. hr VPD <= 0.635 kPa, 40 days pre-anthesis to 5 days pre-anthesis | |76 |VPD.SD.3 | Mean daily std.dev(VPD) from anthesis to 10 days post-anthesis | |77 |VPD.SD.4 | Mean daily std.dev(VPD) from 10 days pre-anthesis to 10 days post-anthesis | ---------------------------------------------------------------------------------------------------- ## Model definitions | Model ID | Genesis | Scalars| Weather-based predictors | Period | |:------------|:----------|:-------|:------------------------------------|:-----------| | M1 | De Wolf (2003) | resist + wc | TRH9010 | post | | M2 | De Wolf (2003) | resist + wc | INT3 | pre-to-post | | M3 | De Wolf (2003) | resist + wc | T15307 | pre | | M4 | Shah (2013) | resist + wc | RH7 | pre | | M5 | Shah (2013) | resist + wc | RH.A.PRE10.24H | pre | | M6 | Shah (2013) | resist + wc | RH.A.PRE14.24H | pre | | M7 | Shah (2013) | resist + wc | RH.A.PRE10.24H + RH.G90.PRE10.12H | pre | | M8 | Shah (2013) | resist + wc | TRH.15T30nRHG80.POST5.12H + RH.A.POST5.12H + T.A.POST5.24H | post | | M9 | Shah (2013) | resist + wc | TRH.15T30nRHG80.POST7.12H + RH.A.POST7.12H + T.A.POST7.24H | post | | M10 | Shah (2013) | resist + wc | TRH.15T30nRHG80.POST10.12H + RH.A.POST10.12H + T.A.4 | post | | M11 | Shah (2013) | resist + wc | RH7 + T.A.PRE7.24H + T.L9.PRE7.24H | pre | | M12 | Shah (2013) | resist + wc | TRH.15T30nRHG80.PRE15.24H + T.A.PRE15.24H + T.L9.PRE15.24H | pre | | M13 | Shah (2014) | resist + wc | RH.A.PRE14.24H + TRH.9T30nRHG90.PRE15.24H + VPD.L45.PRE7.12H | pre | | M14 | Shah (2014) | resist + wc | RH.G80.PRE14.12H + TRH.9T30nRHG90.PRE15.24H + VPD.L45.PRE7.12H | pre | | M15 | Shah (2014) | resist + wc | RH7 + T.A.PRE7.24H + sq.T.A.PRE7.24H | pre | | M16 | Shah (2019) | resist + wc | T.A.5 + RH.G90.CHD.2 + TRH.15T30nRHG80.CHD.3 | post | | M17 | Shah (2019) | resist + wc | T.A.1 + RH.A.1 + TRH.15T30nRHG80.CHD.1 | pre | | M18 | Shah (2019) | resist + wc | T.A.3 + RH.G90.CHD.1 + TRH.15T30nRHG80.CHD.2 | pre | | M19 | Shah (2019) | resist + wc | T.A.2 + RH.A.3 + RH.G90.CHD.3 | pre-to-post | | M20 | New | resist + wc | T.SD.1 + VPD.A.1 + P.A.3 | pre | | M21 | New | resist + wc | T.A.2 + TDD.A.2 + P.A.5 | pre | | M22 | New | resist + wc | T.A.1 + P.A.2 + VPD.L11.CHD.2 | pre | | M23 | New | resist + wc | P.A.1 + D.A.2 + VPD.L6.CD.1 | pre | | M24 | New | resist + wc | D.A.1 + TRH.9T30nRHG90.CHD.2 + P.MINMAXDIFF.1 | pre | | M25 | New | resist + wc | T.A.3 + VPD.L635.CHD.1 + TRH.15T30nRHG90.CHD.1 | pre | | M26 | New | resist + wc | VPD.A.3 + TRH.15T30nRHG80.CHD.3 | post | | M27 | New | resist + wc | TRH.15T30nRHG80.CHD.3 + VPD.A.3 + T.A.4 | post | | M28 | New | resist + wc | VPD.A.4 + T.A.5 + RH.G70.CD.2 | post | | M29 | New | resist + wc | TDD.SD.3 + RH.G80.CHD.2 + T.G30.CHD.3 | post | | M30 | New | resist + wc | VPD.SD.3 + TRH.15T30nRHG90.CHD.3 | post | | M31 | New | resist + wc | TDD.A.4 + T.SD.2 + RH.G90.CHD.2 | post | | M32 | New | resist + wc | VPD.L6.CD.3 + TRH.5T30nRHG75.CHD.3 | post | | M33 | New | resist + wc | VPD.A.5 + RH.G70.CD.3 + T.MINMAXDIFF.2 | pre-to-post | | M34 | New | resist + wc | VPD.A.5 + T.MINMAXDIFF.2 + D.MINMAXDIFF.3 | pre-to-post | | M35 | New | resist + wc | VPD.L635.CHD.2 + RH.G90.CHD.3 + TDD.A.5 | pre-to-post | | M36 | New | resist + wc | D.SD.3 + RH.MINMAXDIFF.3 | pre-to-post | | M37 | New | resist + wc | RH.G70.CD.3 + T.SD.3 | pre-to-post | | M38 | New | resist + wc | TDD.SD.4 + RH.G80.CHD.3 | pre-to-post | | M39 | New | resist + wc | VPD.SD.4 + TDD.A.6 | pre-to-post | ---------------------------------------------------------------------------------------------------- ```{r knitr_setup, include=FALSE, eval=TRUE} options(digits = 3) require(knitr) ## options knitr::opts_chunk$set(cache = TRUE, fig.path = 'Figures/', fig.height = 6) # # to work with kableExtra: (no longer need to set this option as per kableExtra documentation) # options(knitr.table.format = "html") ``` ```{r Libraries, eval=TRUE, echo=FALSE, message=FALSE} library(tidyverse) library(rsample) library(caret) library(ClassDiscovery) library(dendextend) library(GGally) library(glmnet) library(grid) library(gridExtra) library(kableExtra) library(OptimalCutpoints) library(viridis) library(psych) ``` ```{r LoadData, echo=FALSE, eval=TRUE, message=FALSE} X <- readr::read_csv("EnsemblesMainData.csv") X <- X %>% dplyr::mutate(Y = ifelse(S < 10, 0, 1)) %>% # Set up Class for input to caret: dplyr::mutate(Class = ifelse(S < 10, "Nonepidemic", "Epidemic")) %>% dplyr::mutate(Class = factor(Class, levels = c("Nonepidemic", "Epidemic"))) %>% dplyr::mutate(sq.T.A.PRE7.24H = T.A.PRE7.24H^2) %>% # Set up for the wc variable as in Shah et al. (2013) dplyr::mutate(wc = "NA") %>% dplyr::mutate(wc = replace(wc, type == "spring", "sw")) %>% dplyr::mutate(wc = replace(wc, type == "winter" & corn == 0, "wwnoc")) %>% dplyr::mutate(wc = replace(wc, type == "winter" & corn == 1, "wwc")) %>% dplyr::mutate(wc = factor(wc, levels = c("sw", "wwnoc", "wwc"))) %>% dplyr::mutate(state = factor(state, levels = c("AR", "DE", "IL", "IN", "KS", "KY", "MD", "MI", "MN", "MO", "ND", "NE", "NY", "OH", "PA", "SD", "WI"))) %>% dplyr::mutate(type = factor(type, levels = c("spring", "winter"))) %>% dplyr::mutate(resist = factor(resist, levels = c("VS", "S", "MS", "MR"))) Y <- as.integer(X$Class) - 1 # Initialize ID counter: i <- 1 ``` ```{r Functions, eval=TRUE, echo=FALSE} # https://stackoverflow.com/questions/35775696/trying-to-use-dplyr-to-group-by-and-apply-scale # scale expects a matrix, so does not play well with dplyr::mutate. Best to write your own scale function scale2 <- function(x){ (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE) } # This function makes use of `yardstick` functions IndivCVStats <- function(id, X, wvars) { # Obtains cross-validated model statistics # Args: # id = a numeric id for the model # X = data.frame of the input variables # wvars = character vector of the weather variables in the model # # Returns: # A data frame with several cross-validated model fit statistics # # vars <- syms(c("Y", "resist", "wc", wvars)) X1 <- X %>% dplyr::select(!!!vars) %>% dplyr::mutate_at(vars(!!!wvars), scale2) fmla <- reformulate(termlabels = c("resist", "wc", wvars), response = "Y") # Build a tibble with 10 train and test lists: set.seed(100) folding <- modelr::crossv_kfold(data = X1, k = 10) # Add OLS models to each row estimated in the train data: folding <- folding %>% dplyr::mutate(ols = purrr::map(train, ~ glm(fmla, data = ., family = binomial))) # Generate datasets that include not just predictions (.fitted) but also the response value for each of the observations by using augment: folding <- folding %>% mutate(test_with_preds = purrr::map2(ols, test, ~ broom::augment(.x, newdata = .y, type.predict = "response"))) # Now combine all out-of-sample predictions into a single data frame: holdout_datasets <- rbind(folding$test_with_preds$`1`, folding$test_with_preds$`2`, folding$test_with_preds$`3`, folding$test_with_preds$`4`, folding$test_with_preds$`5`, folding$test_with_preds$`6`, folding$test_with_preds$`7`, folding$test_with_preds$`8`, folding$test_with_preds$`9`, folding$test_with_preds$`10`) # Data.frame of the actual response and cross-validated probabilities: y <- data.frame(id = 1:nrow(X), actual = holdout_datasets$Y, prob = holdout_datasets$.fitted) # Cut-point corresponding to YI. From the OptimalCutpoints package: oc <- OptimalCutpoints::optimal.cutpoints(X = prob ~ actual, tag.healthy = 0, methods = "Youden", data = y) cp <- oc$Youden$Global$optimal.cutoff$cutoff z1 <- y %>% # The predicted class: dplyr::mutate(pred = ifelse(prob <= cp, 0, 1)) %>% # Set actual and pred as factors (required for `yardstick`): dplyr::mutate(actual.f = factor(actual, levels = c(1, 0 ))) %>% dplyr::mutate(pred.f = factor(pred, levels = c(1, 0 ))) %>% dplyr::summarise( ## Just a few metrics that are given through yardstick functions # ROC AUC: auc = yardstick::roc_auc(., truth = actual.f, prob) %>% dplyr::pull(.estimate), # Area under the precision-recall curve: prauc = yardstick::pr_auc(., truth = actual.f, prob) %>% dplyr::pull(.estimate), # Accuracy: acc = yardstick::accuracy(., truth = actual.f, estimate = pred.f) %>% dplyr::pull(.estimate), # total number of obs n = n(), # Brier score: Brier = sum((prob - actual)^2)/n, # Calculations based right off the 2-way confusion matrix: TP = sum(actual == 1 & pred == 1), # true positives count FP = sum(actual == 0 & pred == 1), # false positives count FN = sum(actual == 1 & pred == 0), # false negatives count TN = sum(actual == 0 & pred == 0), # true negatives count # Misclassification rate: mr = (FP + FN)/n, # sensitivity (= recall): Se = yardstick::sens(., truth = actual.f, estimate = pred.f) %>% dplyr::pull(.estimate), # specificity: Sp = yardstick::spec(., truth = actual.f, estimate = pred.f) %>% dplyr::pull(.estimate), # Youden Index = Informedness = K: YI = Se + Sp - 1, # Informedness aka Powers (2011). Same as YI: ifd = YI, # positive predictive value (ppv) = Pr(D1|T1) = precision: ppv = TP/(TP + FP), # negative predictive value (npv) = Pr(D0|T0) npv = TN/(FN + TN), # markedness aka Powers (2011): mkd = ppv + npv - 1, # The formulae used below came from Hughes et al. (2015) # The normalized prediction-realization table (next 4 lines): TPn = sum(actual == 1 & pred == 1)/n, FPn = sum(actual == 0 & pred == 1)/n, FNn = sum(actual == 1 & pred == 0)/n, TNn = sum(actual == 0 & pred == 0)/n, TPP = TPn/(TPn + FNn), # true positive proportion = sensitivity = Pr(T1|D1) TNP = TNn/(TNn + FPn), # true negative proportion = specificity = Pr(T0|D0) FPP = 1 - TNP, # false positive proportion = Pr(T1|D0) FNP = 1 - TPP, # false negative proportion = Pr(T0|D1) # Prior probability of epidemic: PrD1 = TPn + FNn, # Prior probability of non-epidemic: PrD0 = 1 - PrD1, # Probability of an epidemic prediction: PrT1 = TPn + FPn, # Probability of a non-epidemic prediction: PrT0 = FNn + TNn, # Information quantity (entropy) H(D): HD = -(PrD1*log(PrD1) + PrD0*log(PrD0)), # Information quantity (entropy) H(T): HT = -(PrT1*log(PrT1) + PrT0*log(PrT0)), # Information quantity (joint entropy) H(D,T): HDT = -(TPn*log(TPn) + FPn*log(FPn) + FNn*log(FNn) + TNn*log(TNn)), # Expected mutual information (I_M(D,T)): IM = HD + HT - HDT, # Conditional entropy H(D|T): HDcT = HD - IM, # Normalized expected mutual information, # (equiv. to McFadden's R2) Hughes et al. (2019): IMN = IM/HD, # MCEN: S = TP + TN + FP + FN, MCEN = (2*(FN+FP)*log2((S-TN)*(S-TP)))/(3*S + FN+FP) - (4*(FN*log2(FN) + FP*log2(FP)))/(3*S + FN+FP) ) %>% dplyr::select(auc:acc, Brier, mr:mkd, PrD1:IMN, MCEN) z <- tibble::tibble(id = id, cp = cp) %>% dplyr::bind_cols(dplyr::bind_rows(z1)) return(z) } # Example of use: # IndivCVStats(id = 1, X = X, wvars = "TRH9010") stats.out <- function(df) { # Outputs model-fit statistics # Args: # df: a data.frame of model statistics, not all of which you want displayed # # Returns: # a styled table of model fit statistics df %>% dplyr::select(id, cp, auc, prauc, Se, Sp, mr, mkd, ifd, IMN, MCEN) %>% dplyr::rename(Misclass = mr) %>% knitr::kable(., row.names = FALSE, align = "lcccccccccc") %>% kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "responsive"), font_size = 9) } # end of function stats.out GetCVProbs <- function(id, X, wvars) { # Obtains cross-validated model probabilities via 10-fold cross-validation # Args: # id = a numeric id for the model # X = data.frame of the input variables # wvars = character vector of the weather variables in the model # # Returns: # A data frame with the obs id and the fitted probability # # vars <- syms(c("Y", "resist", "wc", wvars)) X1 <- X %>% dplyr::select(!!!vars) %>% dplyr::mutate_at(vars(!!!wvars), scale2) fmla <- reformulate(termlabels = c("resist", "wc", wvars), response = "Y") # Build a tibble with five train and test lists: set.seed(100) folding <- modelr::crossv_kfold(data = X1, k = 10) # Add OLS models to each row estimated in the train data: folding <- folding %>% dplyr::mutate(ols = purrr::map(train, ~ glm(fmla, data = ., family = binomial))) # Use map2() to generate predictions for each of the models estimated in the training subsets: folding <- folding %>% dplyr::mutate(test_with_preds = purrr::map2(ols, test, predict, type = "response")) idx <- c(folding$test$`1`$idx, folding$test$`2`$idx, folding$test$`3`$idx, folding$test$`4`$idx, folding$test$`5`$idx, folding$test$`6`$idx, folding$test$`7`$idx, folding$test$`8`$idx, folding$test$`9`$idx, folding$test$`10`$idx) prbs <- c(folding$test_with_preds$`1`, folding$test_with_preds$`2`, folding$test_with_preds$`3`, folding$test_with_preds$`4`, folding$test_with_preds$`5`, folding$test_with_preds$`6`, folding$test_with_preds$`7`, folding$test_with_preds$`8`, folding$test_with_preds$`9`, folding$test_with_preds$`10`) z <- data.frame(idx, prbs) %>% dplyr::arrange(idx) %>% dplyr::rename(!!stringr::str_c("M", i) := prbs) return(z) } # end function GetCVProbs GetProbs <- function(id, X, wvars) { # Obtains fitted model probabilities # Args: # id = a numeric id for the model # X = data.frame of the input variables # wvars = character vector of the weather variables in the model # # Returns: # A data frame of the fitted probability # # vars <- syms(c("Y", "resist", "wc", wvars)) X1 <- X %>% dplyr::select(!!!vars) %>% dplyr::mutate_at(vars(!!!wvars), scale2) fmla <- reformulate(termlabels = c("resist", "wc", wvars), response = "Y") m <- glm(fmla, data = X1, family = binomial) # Data.frame of the fitted probabilities: z <- data.frame(fitted = m$fitted.values) %>% dplyr::rename(!!stringr::str_c("M", i) := fitted) return(z) } # end function GetProbs ``` ---------------------------------------------------------------------------------------------------- ## De Wolf (2003) models {.tabset .tabset-fade .tabset-pills} ### Model A Model = M1 ```{r DeWolf_Arwc, echo=FALSE, eval=TRUE, results='markup'} wvars <- "TRH9010" # Fitted probabilities: prbs <- GetProbs(id = i, X = X, wvars = wvars) # Cross-validated probabilities: cv.prbs <- GetCVProbs(id = i, X = X, wvars = wvars) # Cross-validated stats: indiv.stats <- IndivCVStats(id = i, X = X, wvars = wvars) # Output the cross-validated stats: indiv.stats %>% dplyr::slice(i) %>% stats.out(.) rm(wvars) # Increment i i <- i + 1 ``` ---------------------------------------------------------------------------------------------------- ### Model B Model = M2 ```{r DeWolf_Brwc, echo=FALSE, eval=TRUE, results='markup'} wvars <- "INT3" # Fitted probabilities: prbs <- dplyr::bind_cols(prbs, GetProbs(id = i, X = X, wvars = wvars)) # Cross-validated probabilities: cv.prbs <- dplyr::left_join(cv.prbs, GetCVProbs(id = i, X = X, wvars = wvars), by = "idx") # Cross-validated stats: g <- IndivCVStats(id = i, X = X, wvars = wvars) indiv.stats <- dplyr::bind_rows(indiv.stats, g) # Output the cross-validated stats: indiv.stats %>% dplyr::slice(i) %>% stats.out(.) rm(g, wvars) # Increment i i <- i + 1 ``` ---------------------------------------------------------------------------------------------------- ### Model J Model = M3 ```{r DeWolf_Jrwc, echo=FALSE, eval=TRUE, results='markup'} wvars = "T15307" # Fitted probabilities: prbs <- dplyr::bind_cols(prbs, GetProbs(id = i, X = X, wvars = wvars)) # Cross-validated probabilities: cv.prbs <- dplyr::left_join(cv.prbs, GetCVProbs(id = i, X = X, wvars = wvars), by = "idx") # Cross-validated stats: g <- IndivCVStats(id = i, X = X, wvars = wvars) indiv.stats <- dplyr::bind_rows(indiv.stats, g) # Output the cross-validated stats: indiv.stats %>% dplyr::slice(i) %>% stats.out(.) rm(g, wvars) # Increment i i <- i + 1 ``` ---------------------------------------------------------------------------------------------------- ## Models based on [Shah et al. (2013)](https://apsjournals.apsnet.org/doi/abs/10.1094/PHYTO-11-12-0304-R) {.tabset .tabset-fade .tabset-pills} See Table 2 of the paper. ### Model 1 Model = M4 ```{r Shah_1rwc, echo=FALSE, eval=TRUE, results='markup'} wvars <- "RH7" # Fitted probabilities: prbs <- dplyr::bind_cols(prbs, GetProbs(id = i, X = X, wvars = wvars)) # Cross-validated probabilities: cv.prbs <- dplyr::left_join(cv.prbs, GetCVProbs(id = i, X = X, wvars = wvars), by = "idx") # Cross-validated stats: g <- IndivCVStats(id = i, X = X, wvars = wvars) indiv.stats <- dplyr::bind_rows(indiv.stats, g) # Output the cross-validated stats: indiv.stats %>% dplyr::slice(i) %>% stats.out(.) rm(g, wvars) # Increment i i <- i + 1 ``` ---------------------------------------------------------------------------------------------------- ### Model 2 Model = M5 ```{r Shah_2rwc, echo=FALSE, eval=TRUE, results='markup'} wvars <- "RH.A.PRE10.24H" # Fitted probabilities: prbs <- dplyr::bind_cols(prbs, GetProbs(id = i, X = X, wvars = wvars)) # Cross-validated probabilities: cv.prbs <- dplyr::left_join(cv.prbs, GetCVProbs(id = i, X = X, wvars = wvars), by = "idx") # Cross-validated stats: g <- IndivCVStats(id = i, X = X, wvars = wvars) indiv.stats <- dplyr::bind_rows(indiv.stats, g) # Output the cross-validated stats: indiv.stats %>% dplyr::slice(i) %>% stats.out(.) rm(g, wvars) # Increment i i <- i + 1 ``` ---------------------------------------------------------------------------------------------------- ### Model 3 Model = M6 ```{r Shah_3rwc, echo=FALSE, eval=TRUE, results='markup'} wvars <- "RH.A.PRE14.24H" # Fitted probabilities: prbs <- dplyr::bind_cols(prbs, GetProbs(id = i, X = X, wvars = wvars)) # Cross-validated probabilities: cv.prbs <- dplyr::left_join(cv.prbs, GetCVProbs(id = i, X = X, wvars = wvars), by = "idx") # Cross-validated stats: g <- IndivCVStats(id = i, X = X, wvars = wvars) indiv.stats <- dplyr::bind_rows(indiv.stats, g) # Output the cross-validated stats: indiv.stats %>% dplyr::slice(i) %>% stats.out(.) rm(g, wvars) # Increment i i <- i + 1 ``` ---------------------------------------------------------------------------------------------------- ### Model 4 Model = M7 ```{r Shah_4rwc, echo=FALSE, eval=TRUE, results='markup'} wvars <- c("RH.A.PRE10.24H", "RH.G90.PRE10.12H") # Fitted probabilities: prbs <- dplyr::bind_cols(prbs, GetProbs(id = i, X = X, wvars = wvars)) # Cross-validated probabilities: cv.prbs <- dplyr::left_join(cv.prbs, GetCVProbs(id = i, X = X, wvars = wvars), by = "idx") # Cross-validated stats: g <- IndivCVStats(id = i, X = X, wvars = wvars) indiv.stats <- dplyr::bind_rows(indiv.stats, g) # Output the cross-validated stats: indiv.stats %>% dplyr::slice(i) %>% stats.out(.) rm(g, wvars) # Increment i i <- i + 1 ``` ---------------------------------------------------------------------------------------------------- ### Model 5 Model = M8 ```{r Shah_5rwc, echo=FALSE, eval=TRUE, results='markup'} wvars <- c("TRH.15T30nRHG80.POST5.12H", "RH.A.POST5.12H", "T.A.POST5.24H") # Fitted probabilities: prbs <- dplyr::bind_cols(prbs, GetProbs(id = i, X = X, wvars = wvars)) # Cross-validated probabilities: cv.prbs <- dplyr::left_join(cv.prbs, GetCVProbs(id = i, X = X, wvars = wvars), by = "idx") # Cross-validated stats: g <- IndivCVStats(id = i, X = X, wvars = wvars) indiv.stats <- dplyr::bind_rows(indiv.stats, g) # Output the cross-validated stats: indiv.stats %>% dplyr::slice(i) %>% stats.out(.) rm(g, wvars) # Increment i i <- i + 1 ``` ---------------------------------------------------------------------------------------------------- ### Model 6 Model = M9 ```{r Shah_6rwc, echo=FALSE, eval=TRUE, results='markup'} wvars <- c("TRH.15T30nRHG80.POST7.12H", "RH.A.POST7.12H", "T.A.POST7.24H") # Fitted probabilities: prbs <- dplyr::bind_cols(prbs, GetProbs(id = i, X = X, wvars = wvars)) # Cross-validated probabilities: cv.prbs <- dplyr::left_join(cv.prbs, GetCVProbs(id = i, X = X, wvars = wvars), by = "idx") # Cross-validated stats: g <- IndivCVStats(id = i, X = X, wvars = wvars) indiv.stats <- dplyr::bind_rows(indiv.stats, g) # Output the cross-validated stats: indiv.stats %>% dplyr::slice(i) %>% stats.out(.) rm(g, wvars) # Increment i i <- i + 1 ``` ---------------------------------------------------------------------------------------------------- ### Model 7 Model = M10 ```{r Shah_7rwc, echo=FALSE, eval=TRUE, results='markup'} wvars <- c("TRH.15T30nRHG80.POST10.12H", "RH.A.POST10.12H", "T.A.4") # Fitted probabilities: prbs <- dplyr::bind_cols(prbs, GetProbs(id = i, X = X, wvars = wvars)) # Cross-validated probabilities: cv.prbs <- dplyr::left_join(cv.prbs, GetCVProbs(id = i, X = X, wvars = wvars), by = "idx") # Cross-validated stats: g <- IndivCVStats(id = i, X = X, wvars = wvars) indiv.stats <- dplyr::bind_rows(indiv.stats, g) # Output the cross-validated stats: indiv.stats %>% dplyr::slice(i) %>% stats.out(.) rm(g, wvars) # Increment i i <- i + 1 ``` ---------------------------------------------------------------------------------------------------- ### Model 8 Model = M11 ```{r Shah_8rwc, echo=FALSE, eval=TRUE, results='markup'} wvars <- c("RH7", "T.A.PRE7.24H", "T.L9.PRE7.24H") # Fitted probabilities: prbs <- dplyr::bind_cols(prbs, GetProbs(id = i, X = X, wvars = wvars)) # Cross-validated probabilities: cv.prbs <- dplyr::left_join(cv.prbs, GetCVProbs(id = i, X = X, wvars = wvars), by = "idx") # Cross-validated stats: g <- IndivCVStats(id = i, X = X, wvars = wvars) indiv.stats <- dplyr::bind_rows(indiv.stats, g) # Output the cross-validated stats: indiv.stats %>% dplyr::slice(i) %>% stats.out(.) rm(g, wvars) # Increment i i <- i + 1 ``` ---------------------------------------------------------------------------------------------------- ### Model 9 Model = M12 ```{r Shah_9rwc, echo=FALSE, eval=TRUE, results='markup'} wvars <- c("TRH.15T30nRHG80.PRE15.24H", "T.A.PRE15.24H", "T.L9.PRE15.24H") # Fitted probabilities: prbs <- dplyr::bind_cols(prbs, GetProbs(id = i, X = X, wvars = wvars)) # Cross-validated probabilities: cv.prbs <- dplyr::left_join(cv.prbs, GetCVProbs(id = i, X = X, wvars = wvars), by = "idx") # Cross-validated stats: g <- IndivCVStats(id = i, X = X, wvars = wvars) indiv.stats <- dplyr::bind_rows(indiv.stats, g) # Output the cross-validated stats: indiv.stats %>% dplyr::slice(i) %>% stats.out(.) rm(g, wvars) # Increment i i <- i + 1 ``` --------------------------------------------------------------------------------------------------- ## Suggested from the BRT analyses in [Shah et al. (2014)](https://apsjournals.apsnet.org/doi/abs/10.1094/PHYTO-10-13-0273-R) {.tabset .tabset-fade .tabset-pills} ### BRT 1 Model = M13 ```{r BRT_1rwc, echo=FALSE, eval=TRUE, results='markup'} wvars <- c("RH.A.PRE14.24H", "TRH.9T30nRHG90.PRE15.24H", "VPD.L45.PRE7.12H") # Fitted probabilities: prbs <- dplyr::bind_cols(prbs, GetProbs(id = i, X = X, wvars = wvars)) # Cross-validated probabilities: cv.prbs <- dplyr::left_join(cv.prbs, GetCVProbs(id = i, X = X, wvars = wvars), by = "idx") # Cross-validated stats: g <- IndivCVStats(id = i, X = X, wvars = wvars) indiv.stats <- dplyr::bind_rows(indiv.stats, g) # Output the cross-validated stats: indiv.stats %>% dplyr::slice(i) %>% stats.out(.) rm(g, wvars) # Increment i i <- i + 1 ``` ---------------------------------------------------------------------------------------------------- ### BRT 2 Model = M14 ```{r BRT_2rwc, echo=FALSE, eval=TRUE, results='markup'} wvars <- c("RH.G80.PRE14.12H", "TRH.9T30nRHG90.PRE15.24H", "VPD.L45.PRE7.12H") # Fitted probabilities: prbs <- dplyr::bind_cols(prbs, GetProbs(id = i, X = X, wvars = wvars)) # Cross-validated probabilities: cv.prbs <- dplyr::left_join(cv.prbs, GetCVProbs(id = i, X = X, wvars = wvars), by = "idx") # Cross-validated stats: g <- IndivCVStats(id = i, X = X, wvars = wvars) indiv.stats <- dplyr::bind_rows(indiv.stats, g) # Output the cross-validated stats: indiv.stats %>% dplyr::slice(i) %>% stats.out(.) rm(g, wvars) # Increment i i <- i + 1 ``` ---------------------------------------------------------------------------------------------------- The next model is a reformulation of a model presented in the supplement to Shah et al. (2014). The supplement used `resist` as a categorical predictor. A binary version of `RH7` and up to a squared term for `T.A.PRE7.24H` were used as weather-based predictors. In the reformulations, we retain the continuous form of `RH7`, and add `wc` as a categorical predictor. ### BRT 3 Model = M15 ```{r BRT_3rwc, echo=FALSE, eval=TRUE, results='markup'} wvars <- c("RH7", "T.A.PRE7.24H", "sq.T.A.PRE7.24H") # Fitted probabilities: prbs <- dplyr::bind_cols(prbs, GetProbs(id = i, X = X, wvars = wvars)) # Cross-validated probabilities: cv.prbs <- dplyr::left_join(cv.prbs, GetCVProbs(id = i, X = X, wvars = wvars), by = "idx") # Cross-validated stats: g <- IndivCVStats(id = i, X = X, wvars = wvars) indiv.stats <- dplyr::bind_rows(indiv.stats, g) # Output the cross-validated stats: indiv.stats %>% dplyr::slice(i) %>% stats.out(.) rm(g, wvars) # Increment i i <- i + 1 ``` --------------------------------------------------------------------------------------------------- ## Models proposed in the [Philos. Trans. B (2019) paper](https://royalsocietypublishing.org/doi/full/10.1098/rstb.2018.0273) {.tabset .tabset-fade .tabset-pills} ### Post-anthesis model Model = M16 ```{r PhilTransB_1rwc, echo=FALSE, eval=TRUE, results='markup'} wvars <- c("T.A.5", "RH.G90.CHD.2", "TRH.15T30nRHG80.CHD.3") # Fitted probabilities: prbs <- dplyr::bind_cols(prbs, GetProbs(id = i, X = X, wvars = wvars)) # Cross-validated probabilities: cv.prbs <- dplyr::left_join(cv.prbs, GetCVProbs(id = i, X = X, wvars = wvars), by = "idx") # Cross-validated stats: g <- IndivCVStats(id = i, X = X, wvars = wvars) indiv.stats <- dplyr::bind_rows(indiv.stats, g) # Output the cross-validated stats: indiv.stats %>% dplyr::slice(i) %>% stats.out(.) rm(g, wvars) # Increment i i <- i + 1 ``` ---------------------------------------------------------------------------------------------------- ### Pre-anthesis model 1 Model = M17 ```{r PhilTransB_2rwc, echo=FALSE, eval=TRUE, results='markup'} wvars <- c("T.A.1", "RH.A.1", "TRH.15T30nRHG80.CHD.1") # Fitted probabilities: prbs <- dplyr::bind_cols(prbs, GetProbs(id = i, X = X, wvars = wvars)) # Cross-validated probabilities: cv.prbs <- dplyr::left_join(cv.prbs, GetCVProbs(id = i, X = X, wvars = wvars), by = "idx") # Cross-validated stats: g <- IndivCVStats(id = i, X = X, wvars = wvars) indiv.stats <- dplyr::bind_rows(indiv.stats, g) # Output the cross-validated stats: indiv.stats %>% dplyr::slice(i) %>% stats.out(.) rm(g, wvars) # Increment i i <- i + 1 ``` ---------------------------------------------------------------------------------------------------- ### Pre-anthesis model 2 Model = M18 ```{r PhilTransB_3rwc, echo=FALSE, eval=TRUE, results='markup'} wvars <- c("T.A.3", "RH.G90.CHD.1", "TRH.15T30nRHG80.CHD.2") # Fitted probabilities: prbs <- dplyr::bind_cols(prbs, GetProbs(id = i, X = X, wvars = wvars)) # Cross-validated probabilities: cv.prbs <- dplyr::left_join(cv.prbs, GetCVProbs(id = i, X = X, wvars = wvars), by = "idx") # Cross-validated stats: g <- IndivCVStats(id = i, X = X, wvars = wvars) indiv.stats <- dplyr::bind_rows(indiv.stats, g) # Output the cross-validated stats: indiv.stats %>% dplyr::slice(i) %>% stats.out(.) rm(g, wvars) # Increment i i <- i + 1 ``` ---------------------------------------------------------------------------------------------------- ### Pre-to-post model Model = M19 ```{r PhilTransB_4rwc, echo=FALSE, eval=TRUE, results='markup'} wvars <- c("T.A.2", "RH.A.3", "RH.G90.CHD.3") # Fitted probabilities: prbs <- dplyr::bind_cols(prbs, GetProbs(id = i, X = X, wvars = wvars)) # Cross-validated probabilities: cv.prbs <- dplyr::left_join(cv.prbs, GetCVProbs(id = i, X = X, wvars = wvars), by = "idx") # Cross-validated stats: g <- IndivCVStats(id = i, X = X, wvars = wvars) indiv.stats <- dplyr::bind_rows(indiv.stats, g) # Output the cross-validated stats: indiv.stats %>% dplyr::slice(i) %>% stats.out(.) rm(g, wvars) # Increment i i <- i + 1 ``` ## New {.tabset .tabset-fade .tabset-pills} ### Pre-anthesis 1 Model = M20 ```{r New_pre1rwc, echo=FALSE, eval=TRUE, results='markup'} wvars <- c("T.SD.1", "VPD.A.1", "P.A.3") # Fitted probabilities: prbs <- dplyr::bind_cols(prbs, GetProbs(id = i, X = X, wvars = wvars)) # Cross-validated probabilities: cv.prbs <- dplyr::left_join(cv.prbs, GetCVProbs(id = i, X = X, wvars = wvars), by = "idx") # Cross-validated stats: g <- IndivCVStats(id = i, X = X, wvars = wvars) indiv.stats <- dplyr::bind_rows(indiv.stats, g) # Output the cross-validated stats: indiv.stats %>% dplyr::slice(i) %>% stats.out(.) rm(g, wvars) # Increment i i <- i + 1 ``` ---------------------------------------------------------------------------------------------------- ### Pre-anthesis 2 Model = M21 ```{r New_pre2rwc, echo=FALSE, eval=TRUE, results='markup'} wvars <- c("T.A.2", "TDD.A.2", "P.A.5") # Fitted probabilities: prbs <- dplyr::bind_cols(prbs, GetProbs(id = i, X = X, wvars = wvars)) # Cross-validated probabilities: cv.prbs <- dplyr::left_join(cv.prbs, GetCVProbs(id = i, X = X, wvars = wvars), by = "idx") # Cross-validated stats: g <- IndivCVStats(id = i, X = X, wvars = wvars) indiv.stats <- dplyr::bind_rows(indiv.stats, g) # Output the cross-validated stats: indiv.stats %>% dplyr::slice(i) %>% stats.out(.) rm(g, wvars) # Increment i i <- i + 1 ``` ---------------------------------------------------------------------------------------------------- ### Pre-anthesis 3 Model = M22 ```{r New_pre3rwc, echo=FALSE, eval=TRUE, results='markup'} wvars <- c("T.A.1", "P.A.2", "VPD.L11.CHD.2") # Fitted probabilities: prbs <- dplyr::bind_cols(prbs, GetProbs(id = i, X = X, wvars = wvars)) # Cross-validated probabilities: cv.prbs <- dplyr::left_join(cv.prbs, GetCVProbs(id = i, X = X, wvars = wvars), by = "idx") # Cross-validated stats: g <- IndivCVStats(id = i, X = X, wvars = wvars) indiv.stats <- dplyr::bind_rows(indiv.stats, g) # Output the cross-validated stats: indiv.stats %>% dplyr::slice(i) %>% stats.out(.) rm(g, wvars) # Increment i i <- i + 1 ``` ---------------------------------------------------------------------------------------------------- ### Pre-anthesis 4 Model = M23 ```{r New_pre4rwc, echo=FALSE, eval=TRUE, results='markup'} wvars <- c("P.A.1", "D.A.2", "VPD.L6.CD.1") # Fitted probabilities: prbs <- dplyr::bind_cols(prbs, GetProbs(id = i, X = X, wvars = wvars)) # Cross-validated probabilities: cv.prbs <- dplyr::left_join(cv.prbs, GetCVProbs(id = i, X = X, wvars = wvars), by = "idx") # Cross-validated stats: g <- IndivCVStats(id = i, X = X, wvars = wvars) indiv.stats <- dplyr::bind_rows(indiv.stats, g) # Output the cross-validated stats: indiv.stats %>% dplyr::slice(i) %>% stats.out(.) rm(g, wvars) # Increment i i <- i + 1 ``` ---------------------------------------------------------------------------------------------------- ### Pre-anthesis 5 Model = M24 ```{r New_pre5rwc, echo=FALSE, eval=TRUE, results='markup'} wvars <- c("D.A.1", "TRH.9T30nRHG90.CHD.2", "P.MINMAXDIFF.1") # Fitted probabilities: prbs <- dplyr::bind_cols(prbs, GetProbs(id = i, X = X, wvars = wvars)) # Cross-validated probabilities: cv.prbs <- dplyr::left_join(cv.prbs, GetCVProbs(id = i, X = X, wvars = wvars), by = "idx") # Cross-validated stats: g <- IndivCVStats(id = i, X = X, wvars = wvars) indiv.stats <- dplyr::bind_rows(indiv.stats, g) # Output the cross-validated stats: indiv.stats %>% dplyr::slice(i) %>% stats.out(.) rm(g, wvars) # Increment i i <- i + 1 ``` ---------------------------------------------------------------------------------------------------- ### Pre-anthesis 6 Model = M25 ```{r New_pre6rwc, echo=FALSE, eval=TRUE, results='markup'} wvars <- c("T.A.3", "VPD.L635.CHD.1", "TRH.15T30nRHG90.CHD.1") # Fitted probabilities: prbs <- dplyr::bind_cols(prbs, GetProbs(id = i, X = X, wvars = wvars)) # Cross-validated probabilities: cv.prbs <- dplyr::left_join(cv.prbs, GetCVProbs(id = i, X = X, wvars = wvars), by = "idx") # Cross-validated stats: g <- IndivCVStats(id = i, X = X, wvars = wvars) indiv.stats <- dplyr::bind_rows(indiv.stats, g) # Output the cross-validated stats: indiv.stats %>% dplyr::slice(i) %>% stats.out(.) rm(g, wvars) # Increment i i <- i + 1 ``` ---------------------------------------------------------------------------------------------------- ### Post-anthesis 1 Model = M26 ```{r New_post1rwc, echo=FALSE, eval=TRUE, results='markup'} wvars <- c("VPD.A.3", "TRH.15T30nRHG80.CHD.3") # Fitted probabilities: prbs <- dplyr::bind_cols(prbs, GetProbs(id = i, X = X, wvars = wvars)) # Cross-validated probabilities: cv.prbs <- dplyr::left_join(cv.prbs, GetCVProbs(id = i, X = X, wvars = wvars), by = "idx") # Cross-validated stats: g <- IndivCVStats(id = i, X = X, wvars = wvars) indiv.stats <- dplyr::bind_rows(indiv.stats, g) # Output the cross-validated stats: indiv.stats %>% dplyr::slice(i) %>% stats.out(.) rm(g, wvars) # Increment i i <- i + 1 ``` ---------------------------------------------------------------------------------------------------- ### Post-anthesis 2 Model = M27 ```{r New_post2rwc, echo=FALSE, eval=TRUE, results='markup'} wvars <- c("TRH.15T30nRHG80.CHD.3", "VPD.A.3", "T.A.4") # Fitted probabilities: prbs <- dplyr::bind_cols(prbs, GetProbs(id = i, X = X, wvars = wvars)) # Cross-validated probabilities: cv.prbs <- dplyr::left_join(cv.prbs, GetCVProbs(id = i, X = X, wvars = wvars), by = "idx") # Cross-validated stats: g <- IndivCVStats(id = i, X = X, wvars = wvars) indiv.stats <- dplyr::bind_rows(indiv.stats, g) # Output the cross-validated stats: indiv.stats %>% dplyr::slice(i) %>% stats.out(.) rm(g, wvars) # Increment i i <- i + 1 ``` ---------------------------------------------------------------------------------------------------- ### Post-anthesis 3 Model = M28 ```{r New_post3rwc, echo=FALSE, eval=TRUE, results='markup'} wvars <- c("VPD.A.4", "T.A.5", "RH.G70.CD.2") # Fitted probabilities: prbs <- dplyr::bind_cols(prbs, GetProbs(id = i, X = X, wvars = wvars)) # Cross-validated probabilities: cv.prbs <- dplyr::left_join(cv.prbs, GetCVProbs(id = i, X = X, wvars = wvars), by = "idx") # Cross-validated stats: g <- IndivCVStats(id = i, X = X, wvars = wvars) indiv.stats <- dplyr::bind_rows(indiv.stats, g) # Output the cross-validated stats: indiv.stats %>% dplyr::slice(i) %>% stats.out(.) rm(g, wvars) # Increment i i <- i + 1 ``` ---------------------------------------------------------------------------------------------------- ### Post-anthesis 4 Model = M29 ```{r New_post4rwc, echo=FALSE, eval=TRUE, results='markup'} wvars <- c("TDD.SD.3", "RH.G80.CHD.2", "T.G30.CHD.3") # Fitted probabilities: prbs <- dplyr::bind_cols(prbs, GetProbs(id = i, X = X, wvars = wvars)) # Cross-validated probabilities: cv.prbs <- dplyr::left_join(cv.prbs, GetCVProbs(id = i, X = X, wvars = wvars), by = "idx") # Cross-validated stats: g <- IndivCVStats(id = i, X = X, wvars = wvars) indiv.stats <- dplyr::bind_rows(indiv.stats, g) # Output the cross-validated stats: indiv.stats %>% dplyr::slice(i) %>% stats.out(.) rm(g, wvars) # Increment i i <- i + 1 ``` ---------------------------------------------------------------------------------------------------- ### Post-anthesis 5 Model = M30 ```{r New_post5rwc, echo=FALSE, eval=TRUE, results='markup'} wvars <- c("VPD.SD.3", "TRH.15T30nRHG90.CHD.3") # Fitted probabilities: prbs <- dplyr::bind_cols(prbs, GetProbs(id = i, X = X, wvars = wvars)) # Cross-validated probabilities: cv.prbs <- dplyr::left_join(cv.prbs, GetCVProbs(id = i, X = X, wvars = wvars), by = "idx") # Cross-validated stats: g <- IndivCVStats(id = i, X = X, wvars = wvars) indiv.stats <- dplyr::bind_rows(indiv.stats, g) # Output the cross-validated stats: indiv.stats %>% dplyr::slice(i) %>% stats.out(.) rm(g, wvars) # Increment i i <- i + 1 ``` ---------------------------------------------------------------------------------------------------- ### Post-anthesis 6 Model = M31 ```{r New_post6rwc, echo=FALSE, eval=TRUE, results='markup'} wvars <- c("TDD.A.4", "T.SD.2", "RH.G90.CHD.2") # Fitted probabilities: prbs <- dplyr::bind_cols(prbs, GetProbs(id = i, X = X, wvars = wvars)) # Cross-validated probabilities: cv.prbs <- dplyr::left_join(cv.prbs, GetCVProbs(id = i, X = X, wvars = wvars), by = "idx") # Cross-validated stats: g <- IndivCVStats(id = i, X = X, wvars = wvars) indiv.stats <- dplyr::bind_rows(indiv.stats, g) # Output the cross-validated stats: indiv.stats %>% dplyr::slice(i) %>% stats.out(.) rm(g, wvars) # Increment i i <- i + 1 ``` ---------------------------------------------------------------------------------------------------- ### Post-anthesis 7 Model = M32 ```{r New_post7rwc, echo=FALSE, eval=TRUE, results='markup'} wvars <- c("VPD.L6.CD.3", "TRH.5T30nRHG75.CHD.3") # Fitted probabilities: prbs <- dplyr::bind_cols(prbs, GetProbs(id = i, X = X, wvars = wvars)) # Cross-validated probabilities: cv.prbs <- dplyr::left_join(cv.prbs, GetCVProbs(id = i, X = X, wvars = wvars), by = "idx") # Cross-validated stats: g <- IndivCVStats(id = i, X = X, wvars = wvars) indiv.stats <- dplyr::bind_rows(indiv.stats, g) # Output the cross-validated stats: indiv.stats %>% dplyr::slice(i) %>% stats.out(.) rm(g, wvars) # Increment i i <- i + 1 ``` ---------------------------------------------------------------------------------------------------- ### Pre-to-post anthesis 1 Model = M33 ```{r New_prepost1rwc, echo=FALSE, eval=TRUE, results='markup'} wvars <- c("VPD.A.5", "RH.G70.CD.3", "T.MINMAXDIFF.2") # Fitted probabilities: prbs <- dplyr::bind_cols(prbs, GetProbs(id = i, X = X, wvars = wvars)) # Cross-validated probabilities: cv.prbs <- dplyr::left_join(cv.prbs, GetCVProbs(id = i, X = X, wvars = wvars), by = "idx") # Cross-validated stats: g <- IndivCVStats(id = i, X = X, wvars = wvars) indiv.stats <- dplyr::bind_rows(indiv.stats, g) # Output the cross-validated stats: indiv.stats %>% dplyr::slice(i) %>% stats.out(.) rm(g, wvars) # Increment i i <- i + 1 ``` ---------------------------------------------------------------------------------------------------- ### Pre-to-post anthesis 2 Model = M34 ```{r New_prepost2rwc, echo=FALSE, eval=TRUE, results='markup'} wvars <- c("VPD.A.5", "T.MINMAXDIFF.2", "D.MINMAXDIFF.3") # Fitted probabilities: prbs <- dplyr::bind_cols(prbs, GetProbs(id = i, X = X, wvars = wvars)) # Cross-validated probabilities: cv.prbs <- dplyr::left_join(cv.prbs, GetCVProbs(id = i, X = X, wvars = wvars), by = "idx") # Cross-validated stats: g <- IndivCVStats(id = i, X = X, wvars = wvars) indiv.stats <- dplyr::bind_rows(indiv.stats, g) # Output the cross-validated stats: indiv.stats %>% dplyr::slice(i) %>% stats.out(.) rm(g, wvars) # Increment i i <- i + 1 ``` ---------------------------------------------------------------------------------------------------- ### Pre-to-post anthesis 3 Model = M35 ```{r New_prepost3rwc, echo=FALSE, eval=TRUE, results='markup'} wvars <- c("VPD.L635.CHD.2", "RH.G90.CHD.3", "TDD.A.5") # Fitted probabilities: prbs <- dplyr::bind_cols(prbs, GetProbs(id = i, X = X, wvars = wvars)) # Cross-validated probabilities: cv.prbs <- dplyr::left_join(cv.prbs, GetCVProbs(id = i, X = X, wvars = wvars), by = "idx") # Cross-validated stats: g <- IndivCVStats(id = i, X = X, wvars = wvars) indiv.stats <- dplyr::bind_rows(indiv.stats, g) # Output the cross-validated stats: indiv.stats %>% dplyr::slice(i) %>% stats.out(.) rm(g, wvars) # Increment i i <- i + 1 ``` ---------------------------------------------------------------------------------------------------- ### Pre-to-post anthesis 4 Model = M36 ```{r New_prepost4rwc, echo=FALSE, eval=TRUE, results='markup'} wvars <- c("D.SD.3", "RH.MINMAXDIFF.3") # Fitted probabilities: prbs <- dplyr::bind_cols(prbs, GetProbs(id = i, X = X, wvars = wvars)) # Cross-validated probabilities: cv.prbs <- dplyr::left_join(cv.prbs, GetCVProbs(id = i, X = X, wvars = wvars), by = "idx") # Cross-validated stats: g <- IndivCVStats(id = i, X = X, wvars = wvars) indiv.stats <- dplyr::bind_rows(indiv.stats, g) # Output the cross-validated stats: indiv.stats %>% dplyr::slice(i) %>% stats.out(.) rm(g, wvars) # Increment i i <- i + 1 ``` ---------------------------------------------------------------------------------------------------- ### Pre-to-post anthesis 5 Model = M37 ```{r New_prepost5rwc, echo=FALSE, eval=TRUE, results='markup'} wvars <- c("RH.G70.CD.3", "T.SD.3") # Fitted probabilities: prbs <- dplyr::bind_cols(prbs, GetProbs(id = i, X = X, wvars = wvars)) # Cross-validated probabilities: cv.prbs <- dplyr::left_join(cv.prbs, GetCVProbs(id = i, X = X, wvars = wvars), by = "idx") # Cross-validated stats: g <- IndivCVStats(id = i, X = X, wvars = wvars) indiv.stats <- dplyr::bind_rows(indiv.stats, g) # Output the cross-validated stats: indiv.stats %>% dplyr::slice(i) %>% stats.out(.) rm(g, wvars) # Increment i i <- i + 1 ``` ---------------------------------------------------------------------------------------------------- ### Pre-to-post anthesis 6 Model = M38 ```{r New_prepost6rwc, echo=FALSE, eval=TRUE, results='markup'} wvars <- c("TDD.SD.4", "RH.G80.CHD.3") # Fitted probabilities: prbs <- dplyr::bind_cols(prbs, GetProbs(id = i, X = X, wvars = wvars)) # Cross-validated probabilities: cv.prbs <- dplyr::left_join(cv.prbs, GetCVProbs(id = i, X = X, wvars = wvars), by = "idx") # Cross-validated stats: g <- IndivCVStats(id = i, X = X, wvars = wvars) indiv.stats <- dplyr::bind_rows(indiv.stats, g) # Output the cross-validated stats: indiv.stats %>% dplyr::slice(i) %>% stats.out(.) rm(g, wvars) # Increment i i <- i + 1 ``` ---------------------------------------------------------------------------------------------------- ### Pre-to-post anthesis 7 Model = M39 ```{r New_prepost7rwc, echo=FALSE, eval=TRUE, results='markup'} wvars <- c("VPD.SD.4", "TDD.A.6") # Fitted probabilities: prbs <- dplyr::bind_cols(prbs, GetProbs(id = i, X = X, wvars = wvars)) # Cross-validated probabilities: cv.prbs <- dplyr::left_join(cv.prbs, GetCVProbs(id = i, X = X, wvars = wvars), by = "idx") # Cross-validated stats: g <- IndivCVStats(id = i, X = X, wvars = wvars) indiv.stats <- dplyr::bind_rows(indiv.stats, g) # Output the cross-validated stats: indiv.stats %>% dplyr::slice(i) %>% stats.out(.) rm(g, wvars) # Increment i i <- i + 1 ``` --------------------------------------------------------------------------------------------------- ```{r IndivModelsStats, eval=TRUE, echo=FALSE} period <- read.table(file = "EnsemblesAddlDescr.txt", header = TRUE, sep = ";", stringsAsFactors = FALSE, strip.white = TRUE) %>% dplyr::pull(period) indiv.stats <- indiv.stats %>% dplyr::mutate(source = "Ind", period = period) %>% dplyr::select(id, source, period, everything()) ``` # The cross-validated (cv) probabilities {.tabset .tabset-fade .tabset-pills} ## Correlation among the cv probabilities **S1 Figure** ```{r Prob_cor, eval=TRUE, echo=FALSE} # Colors used for the three model generations: my.cols <- RColorBrewer::brewer.pal(8, "Dark2")[1:3] # I don't know why this works, I first created a vector of length 39 for the models, but when passed it to the color argument, got the error message: Error: Aesthetics must be either length 1 or the same as the data (40): colour # So added an extra at the beginning of the vector. And now it works. Not an elegant hack. M.cols <- c(rep(my.cols[1], 4), rep(my.cols[2], 12), rep(my.cols[3], 24)) GGally::ggcorr(cv.prbs %>% dplyr::select(-idx), nbreaks = 40, palette = "YlOrRd", drop = TRUE, hjust = 0.8, size = 2.5, color = M.cols, layout.exp = 1) ``` ## Distribution of cv probabilities * 25 observations at a time ```{r FittedProbsFunction, eval=TRUE, echo=FALSE} # Summary: individual histograms colored by resist, facet labels indicate epi or non-epi plotfitteddistrIV <- function(x, y) { # Plots model-fitted probabilities for a given observation, in groups of 25 obs at a time # Histograms are color-coded by resistance level # Facet labels indicate whether the observation is an epidemic or non-epidemic # # Args: # x: the first observation # y: the last observation in the range (range must have 25 obs) # # Returns: # histograms (5 x 5 matrix) showing the distributions of fitted probabilities for a single # observation generated by the set of models # A vector of the response: Y <- as.integer(X$Class) - 1 a <- c(x:y) # A vector of the epi status for each observation: epi <- factor(Y[a], levels = c(0, 1), labels = c("Nonepi", "Epi")) col.names <- stringr::str_c(as.character(a), ":", epi) first <- col.names[1] last <- col.names[length(col.names)] # A vector of the resist level for each obs: Z <- X$resist df <- cv.prbs %>% dplyr::select(-idx) df %>% dplyr::slice(a) %>% t(.) %>% tibble::as_tibble(., .name_repair = ~ col.names) %>% tidyr::gather(., key = obs, value = fitted.prob, first:last) %>% dplyr::mutate(obs = factor(obs, levels = col.names)) %>% ggplot(., aes(x = fitted.prob, fill = rep(Z[a], each = ncol(df)))) + geom_histogram(binwidth = 0.05, colour = "black") + # Use alpha option to lighten the fill a bit, and don't drop unused factor levels: scale_fill_viridis_d(labels = c("VS", "S", "MS", "MR"), alpha = 0.5, drop = FALSE) + expand_limits(x = c(0, 1)) + facet_wrap(~obs, ncol = 5) + xlab("Fitted probability") + ylab("Count") + theme_bw() + theme(axis.title.y = element_text(face = "bold", size = 12)) + theme(axis.text.y = element_text(size = rel(0.8))) + theme(axis.title.x = element_text(face = "bold", size = 12)) + theme(axis.text.x = element_text(size = rel(0.8))) + theme(legend.position = "bottom", legend.direction = "horizontal", legend.box = "vertical") + labs(fill = "Resistance") + # Smaller symbols in the legend key: theme(legend.key.size = unit(0.5, "line")) + # Add a vertical line at the proportion of the data that are FHB epidemics: geom_vline(xintercept = 0.27, colour = "grey50", linetype = "dashed") } ``` * Observations 251 to 275 * **S2 Figure** ```{r FittedProbsDistr251to275, eval=TRUE, echo=FALSE, results='hide', fig.height=7} plotfitteddistrIV(x = 251, y = 275) ``` -------------------------------------------------------------------------------------------------- # Classification errors {.tabset .tabset-fade .tabset-pills} ```{r ClassErrorsSetup, eval=TRUE, echo=FALSE} # Use the model-fitted probabilities to estimate the cut-point associated with the Youden Index for each model, then obtain the model's predicted classification for each observation cv.prbsII <- cv.prbs %>% dplyr::select(-idx) # Initialize an empty data.frame: class.error <- data.frame(matrix(NA, nrow = nrow(cv.prbsII), ncol = ncol(cv.prbsII))) for (i in seq_along(cv.prbsII)) { # Data frame of actual class and fitted probs: x <- data.frame(id = 1:nrow(cv.prbsII), Y, prob = cv.prbsII[, i]) # Cut-point corresponding to YI, from the OptimalCutpoints package: oc <- OptimalCutpoints::optimal.cutpoints(X = prob ~ Y, tag.healthy = 0, methods = "Youden", data = x) cp <- oc$Youden$Global$optimal.cutoff$cutoff # Predicted classification: pred.class <- ifelse(x$prob <= cp, 0, 1) # Classification error: class.error[, i] <- pred.class - Y } names(class.error) <- names(cv.prbsII) rm(i, x, cp, oc, pred.class) ``` ```{r ClassErrorsFunctions, eval=TRUE, echo=FALSE} PlotClassError <- function(x, y) { # Plots model classification errors for a given observation, in groups of 25 obs at a time # Color-coded by resistance level # Facet labels indicate whether the observation is an epidemic or non-epidemic # # Args: # x: the first observation # y: the last observation in the range (range must have 25 obs) # # Returns: # dotplots of the classification errors given by each model on an individual observation # a <- c(x:y) # A vector of the epi status for each observation: epi <- factor(Y[a], levels = c(0, 1), labels = c("Nonepi", "Epi")) col.names <- stringr::str_c(as.character(a), ":", epi) first <- col.names[1] last <- col.names[length(col.names)] # A vector of the resist level for each obs: Z <- X$resist class.error %>% dplyr::slice(a) %>% t(.) %>% tibble::as_tibble(., .name_repair = ~ col.names) %>% dplyr::mutate(model = rownames(.)) %>% tidyr::gather(., key = obs, value = classif, first:last) %>% dplyr::mutate(classif = factor(classif, levels = c(-1, 0, 1), labels = c("FN", "Correct", "FP"))) %>% dplyr::mutate(obs = factor(obs, levels = col.names)) %>% ggplot(., aes(x = model, y = classif, fill = rep(Z[a], each = ncol(cv.prbsII)))) + geom_point(shape = 21, colour = "black", alpha = 0.5) + # Use alpha option to lighten the fill a bit, and don't drop unused factor levels: scale_fill_viridis_d(labels = c("VS", "S", "MS", "MR"), alpha = 0.5, drop = FALSE) + facet_wrap(~obs, ncol = 5) + xlab("Model") + ylab("Classification") + theme_bw() + theme(axis.title.y = element_text(face = "bold", size = 12)) + theme(axis.text.y = element_text(size = rel(0.8))) + theme(axis.title.x = element_text(face = "bold", size = 12)) + theme(axis.text.x = element_blank()) + theme(axis.ticks.x = element_blank()) + theme(legend.position = "bottom", legend.direction = "horizontal", legend.box = "vertical") + labs(fill = "Resistance") + # Smaller symbols in the legend key: theme(legend.key.size = unit(0.5, "line")) } ``` * observations 251 to 275 * **S3 Figure** ```{r ClassError251to275, eval=TRUE, echo=FALSE, results='hide', fig.height=7} PlotClassError(x = 251, y = 275) ``` ----------------------------------------------------------------------------------------------------- # Brier scores {.tabset .tabset-fade .tabset-pills} Brier scores were calculated on the cross-validated probabilities. ```{r BrierScores, eval=TRUE, echo=FALSE} # Calculate the Brier scores on the cv probabilities: BrScore <- cv.prbsII %>% dplyr::mutate_all(list(~(. - Y)^2)) ``` ## Correlation matrix of the Brier scores * **S4 Figure** ```{r Brier_cor, eval=TRUE, echo=FALSE, fig.keep='last'} # 7/17/2020: but then I thought should have the label colors reflect the generations. # Colors used for the three generations in the dendrogram based on the Brier scores my.cols <- RColorBrewer::brewer.pal(8, "Dark2")[1:3] M.cols <- c(rep(my.cols[1], 4), rep(my.cols[2], 12), rep(my.cols[3], 24)) GGally::ggcorr(BrScore, nbreaks = 20, palette = "YlOrRd", drop = TRUE, hjust = 0.8, size = 2.5, color = M.cols, layout.exp = 1) ``` ## Mean Brier score $\bar B_m$ for each model $m$ * The data shown are the mean Brier score $\bar B_m$ for each model $m$. * **S5 Figure** ```{r Brier_Sorted, eval=TRUE, echo=FALSE, fig.height=7.0, fig.keep='last'} # The average score over all 39 models: BRScore.mean <- BrScore %>% purrr::map_dbl(~mean(.)) %>% mean(.) # This gets the model generations: labs <- read.table(file = "EnsemblesAddlDescr.txt", header = TRUE, sep = ";", stringsAsFactors = FALSE, strip.white = TRUE) %>% dplyr::mutate(genesis = factor(genesis, levels = c("DeWolf_2003", "Shah_2013", "Shah_2014", "Shah_2019", "New"))) %>% dplyr::mutate(gen = forcats::fct_collapse(genesis, gen1 = "DeWolf_2003", gen2 = c("Shah_2013", "Shah_2014"), gen3 = c("Shah_2019", "New"))) %>% dplyr::mutate(source = factor(gen, levels = c("gen1", "gen2", "gen3"), labels = c("1st", "2nd", "3rd"))) %>% droplevels() %>% dplyr::select(model, source) %>% # This is actually what we want: dplyr::rename(gen = source) # The colors used for the three model generations: my.cols <- RColorBrewer::brewer.pal(8, "Dark2")[1:3] # The vector of colors for models M1:M39 M.cols <- c(rep(my.cols[1], 3), rep(my.cols[2], 12), rep(my.cols[3], 24)) # A tibble of the model (M1:M39), mean Brier score, generation and label colors: BrScore.dat <- BrScore %>% purrr::map_dbl(~mean(.)) %>% unlist(., use.names = FALSE) %>% data.frame(brier = .) %>% tibble::rownames_to_column(.) %>% dplyr::rename(model = rowname) %>% dplyr::left_join(., labs, by = "model") %>% dplyr::mutate(lab.cols = M.cols) %>% dplyr::mutate(model = factor(model, levels = model[order(brier, decreasing = TRUE)])) # However, we want the label colors sorted on the model (sorted by Brier score): lab.cols <- BrScore.dat %>% dplyr::arrange(model) %>% dplyr::pull(lab.cols) # Now can plot where points and labels are colored by model generation: BrScore.dat %>% ggplot(., aes(x = brier, y = model, colour = gen)) + theme_bw() + geom_point(size = 3) + scale_color_manual(values = my.cols) + geom_vline(xintercept = BRScore.mean, linetype = 2) + xlab("Mean Brier score") + ylab("Model") + theme(axis.title.y = element_text(face = "bold", size = 12)) + theme(axis.text.y = element_text(size = rel(1.1), colour = lab.cols)) + theme(axis.title.x = element_text(face = "bold", size = 12)) + theme(axis.text.x = element_text(size = rel(1.1))) + theme(legend.position = "bottom", legend.direction = "horizontal", legend.box = "vertical") + labs(colour = "Generation") ``` ----------------------------------------------------------------------------------------------------------------- # Soft voting ```{r SoftVote_functions, eval=TRUE, echo=FALSE} CollateStats.sv <- function(id = NULL, y) { # Collates model fit statistics and other model information. # # Args: # id: a unique identifier for each model # y: data.frame of the actual Class (as 0, 1) and fitted probabilities: # # Returns: # A data.frame # # cut-point which maximizes sensitivity + specificity - 1, the max of which is the Youden Index (YI) # Cut-point corresponding to YI: # From the OptimalCutpoints package: oc <- OptimalCutpoints::optimal.cutpoints(X = sv.prob ~ Y, tag.healthy = 0, methods = "Youden", data = y) cp <- oc$Youden$Global$optimal.cutoff$cutoff z1 <- y %>% # renaming so that don't have to make a bunch of replacements in the code: dplyr::rename(actual = Y) %>% dplyr::rename(prob = sv.prob) %>% # The predicted class: dplyr::mutate(pred = ifelse(prob <= cp, 0, 1)) %>% # Set actual and pred as factors (required for `yardstick`): dplyr::mutate(actual.f = factor(actual, levels = c(1, 0 ))) %>% dplyr::mutate(pred.f = factor(pred, levels = c(1, 0 ))) %>% dplyr::summarise( ## Just a few metrics that are given through yardstick functions # ROC AUC: auc = yardstick::roc_auc(., truth = actual.f, prob) %>% dplyr::pull(.estimate), # Area under the precision-recall curve: prauc = yardstick::pr_auc(., truth = actual.f, prob) %>% dplyr::pull(.estimate), # Accuracy: acc = yardstick::accuracy(., truth = actual.f, estimate = pred.f) %>% dplyr::pull(.estimate), # total number of obs n = n(), # Brier score: Brier = sum((prob - actual)^2)/n, # Calculations based right off the 2-way confusion matrix: TP = sum(actual == 1 & pred == 1), # true positives count FP = sum(actual == 0 & pred == 1), # false positives count FN = sum(actual == 1 & pred == 0), # false negatives count TN = sum(actual == 0 & pred == 0), # true negatives count # Misclassification rate: mr = (FP + FN)/n, # sensitivity (= recall): Se = yardstick::sens(., truth = actual.f, estimate = pred.f) %>% dplyr::pull(.estimate), # specificity: Sp = yardstick::spec(., truth = actual.f, estimate = pred.f) %>% dplyr::pull(.estimate), # Youden Index = Informedness = K: YI = Se + Sp - 1, # Informedness aka Powers (2011). Same as YI: ifd = YI, # positive predictive value (ppv) = Pr(D1|T1) = precision: ppv = TP/(TP + FP), # negative predictive value (npv) = Pr(D0|T0) npv = TN/(FN + TN), # markedness aka Powers (2011): mkd = ppv + npv - 1, # The normalized prediction-realization table (next 4 lines): TPn = sum(actual == 1 & pred == 1)/n, FPn = sum(actual == 0 & pred == 1)/n, FNn = sum(actual == 1 & pred == 0)/n, TNn = sum(actual == 0 & pred == 0)/n, TPP = TPn/(TPn + FNn), # true positive proportion = sensitivity = Pr(T1|D1) TNP = TNn/(TNn + FPn), # true negative proportion = specificity = Pr(T0|D0) FPP = 1 - TNP, # false positive proportion = Pr(T1|D0) FNP = 1 - TPP, # false negative proportion = Pr(T0|D1) # Prior probability of epidemic: PrD1 = TPn + FNn, # Prior probability of non-epidemic: PrD0 = 1 - PrD1, # Probability of an epidemic prediction: PrT1 = TPn + FPn, # Probability of a non-epidemic prediction: PrT0 = FNn + TNn, # Information quantity (entropy) H(D): HD = -(PrD1*log(PrD1) + PrD0*log(PrD0)), # Information quantity (entropy) H(T): HT = -(PrT1*log(PrT1) + PrT0*log(PrT0)), # Information quantity (joint entropy) H(D,T): HDT = -(TPn*log(TPn) + FPn*log(FPn) + FNn*log(FNn) + TNn*log(TNn)), # Expected mutual information (I_M(D,T)): IM = HD + HT - HDT, # Conditional entropy H(D|T): HDcT = HD - IM, # Normalized expected mutual information, # (equiv. to McFadden's R2) Hughes et al. (2019): IMN = IM/HD, # MCEN: S = TP + TN + FP + FN, MCEN = (2*(FN+FP)*log2((S-TN)*(S-TP)))/(3*S + FN+FP) - (4*(FN*log2(FN) + FP*log2(FP)))/(3*S + FN+FP) ) %>% dplyr::select(auc:acc, Brier, mr:mkd, PrD1:IMN, MCEN) z <- tibble::tibble(id = id, cp = cp) %>% dplyr::bind_cols(dplyr::bind_rows(z1)) return(z) } # end function CollateStats.sv ``` ```{r SoftVote_stats, eval=TRUE, echo=FALSE} df.sv <- cv.prbs %>% # Drop DeWolf (2003) model M3: dplyr::select(-M3) %>% # Sort so that the data line up with the order of the response Y: dplyr::arrange(idx) %>% dplyr::rowwise() %>% dplyr::mutate(sv.prob = mean(c(M1:M39))) %>% dplyr::ungroup() %>% dplyr::mutate(id = 1:nrow(.), Y = X$Y) %>% dplyr::select(id, Y, sv.prob) %>% # Output as data.frame: as.data.frame(.) sv.stats <- CollateStats.sv(id = 1, y = df.sv) %>% dplyr::mutate(source = "SV", period = "Unrestricted") %>% dplyr::select(id, source, period, everything()) sv.stats %>% dplyr::select(id, cp, auc, prauc, Se, Sp, mr, mkd, ifd, IMN, MCEN) %>% dplyr::rename(Misclass = mr) %>% knitr::kable(., row.names = FALSE, align = "lcccccccccc") %>% kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "responsive"), font_size = 9) ``` --------------------------------------------------------------------------------------------------- # Model averaging {.tabset .tabset-fade .tabset-pills} * Dendrogram shows the partitioning of the models into four groups based on hierarchical clustering on the Brier scores. 1. green = 1st generation 2. orange = 2nd generation 3. purple = 3rd generation * **Figure 1** ```{r MA_setup_dendrogram, eval=TRUE, echo=FALSE, message=FALSE} # For labelling the nodes by model generation (gen): labs <- read.table(file = "EnsemblesAddlDescr.txt", header = TRUE, sep = ";", stringsAsFactors = TRUE, strip.white = TRUE) %>% dplyr::filter(!model %in% c("M3")) %>% dplyr::mutate(genesis = factor(genesis, levels = c("DeWolf_2003", "Shah_2013", "Shah_2014", "Shah_2019", "New"))) %>% dplyr::mutate(gen = forcats::fct_collapse(genesis, gen1 = "DeWolf_2003", gen2 = c("Shah_2013", "Shah_2014"), gen3 = c("Shah_2019", "New"))) %>% dplyr::mutate(source = factor(gen, levels = c("gen1", "gen2", "gen3"), labels = c("1st generation", "2nd generation", "3rd generation"))) %>% dplyr::mutate(period = factor(period, levels = c("pre", "post", "pre-post"))) %>% droplevels() # Fit the dendrogram using the Manhattan distance: m <- BrScore %>% dplyr::select(-M3) %>% ClassDiscovery::distanceMatrix(., metric = 'manhattan') # Following the dendextend vignettes: dend <- m %>% hclust(method = "complete") %>% as.dendrogram # Color branches (4 groups): dend <- dendextend::color_branches(dend, k = 4) # Reduce the size of the labels: dend <- dendextend::assign_values_to_leaves_nodePar(dend, 0.6, "lab.cex") # Have the label colors reflect labs$gen (model generation): # my.cols <- RColorBrewer::brewer.pal(8, "Dark2")[4:6] # Revised to be consistent with plots of metrics later on... my.cols <- RColorBrewer::brewer.pal(8, "Dark2")[1:3] dendextend::labels_colors(dend) <- my.cols[as.numeric(labs$gen)[order.dendrogram(dend)]] dend %>% sort %>% plot(., horiz = TRUE, xlab = "Dissimilarity score") ``` ```{r MA_setup_id_sets, eval=TRUE, echo=FALSE, message=FALSE} ## ----MA_unrestricted_sets # Cut the dendrogram into k groups, and get the models within each group: k <- 4 w <- dend %>% cutree(., k = k) %>% .[. == "1"] %>% names(.) x <- dend %>% cutree(., k = k) %>% .[. == "2"] %>% names(.) y <- dend %>% cutree(., k = k) %>% .[. == "3"] %>% names(.) z <- dend %>% cutree(., k = k) %>% .[. == "4"] %>% names(.) # How many possible unique level 1 models: 2295 m1 <- tidyr::crossing(w, x, y, z) %>% dplyr::distinct() # Generate 4 larger sets, where each consists of 20 groups of base models (4 base models per group): set.seed(100) # maset = model-averaged set maset <- sample(1:nrow(m1), size = 80, replace = FALSE) # Model averaging sets: ma.un.set1 <- m1[maset, ] %>% dplyr::slice(1:20) ma.un.set2 <- m1[maset, ] %>% dplyr::slice(21:40) ma.un.set3 <- m1[maset, ] %>% dplyr::slice(41:60) ma.un.set4 <- m1[maset, ] %>% dplyr::slice(61:80) # Clean up: the only objects you need returned are the model averaging sets, which are the ids of the base models rm(m, dend, my.cols, k, w, x, y, z, m1, maset) # Mean Brier score for each model: m.brier <- BrScore %>% purrr::map_dbl(~mean(.)) %>% unlist(., use.names = FALSE) %>% data.frame(brier = .) %>% tibble::rownames_to_column(.) %>% dplyr::rename(model = rowname) ``` ```{r MAModels, child='MAModels.Rmd', eval=TRUE, echo=FALSE} ``` ## Example: Model-averaged model fit statistics * Example: summary of model-averaged models (set 1) ```{r MA_unrestricted_display_stats, eval=TRUE, echo=FALSE} # Example display of the results for set 1: purrr::map(1:10, MACVStats, .set = ma.un.set1) %>% dplyr::bind_rows(.) %>% dplyr::select(id, cp, auc, prauc, Se, Sp, mr, mkd, ifd, IMN, MCEN) %>% dplyr::rename(Misclass = mr) %>% knitr::kable(., row.names = FALSE, align = "lcccccccccc") %>% kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "responsive"), font_size = 9) ``` ------------------------------------------------------------------------------------------------------ # Stacking In this section, we use functions from the `caret` package to create a stacked model, where the base learners are the individual logistic regression models (NOTE: excludes model M3). ## Ridge penalty ```{r StackingModelRidge, child='StackingRidge.Rmd', eval=TRUE, echo=FALSE} ``` ## Lasso penalty ```{r StackingModelLasso, child='StackingLasso.Rmd', eval=TRUE, echo=FALSE} ``` ## Elastic-net penalty ```{r StackingModelElasticNet, child='StackingElasticNet.Rmd', eval=TRUE, echo=FALSE} ``` ----------------------------------------------------------------------------------------------------- # Model performance {.tabset .tabset-fade .tabset-pills} ```{r Perf_Setup, eval=TRUE, echo=FALSE} labs <- read.table(file = "EnsemblesAddlDescr.txt", header = TRUE, sep = ";", stringsAsFactors = TRUE, strip.white = TRUE) %>% dplyr::rename(source = genesis) # The individual models: indiv.stats <- indiv.stats %>% dplyr::mutate(id = stringr::str_c("LR:", id)) %>% dplyr::mutate(source = as.character(labs$source)) %>% dplyr::mutate(type = "I") # Soft voting model: sv.stats <- sv.stats %>% dplyr::mutate(source = "Soft vote") %>% dplyr::mutate(id = stringr::str_c("SV:", id)) %>% dplyr::mutate(type = "SV") # Model averaged (unrestricted) Set 1: ma.un.set1.stats <- ma.un.set1.stats %>% dplyr::mutate(source = "Weighted average") %>% dplyr::mutate(id = stringr::str_c("MA:", id)) %>% dplyr::mutate(type = "MA") # Stacked (ridge) model: stack.ridge.stats <- stack.ridge.stats %>% dplyr::mutate(source = "Stacked") %>% dplyr::mutate(id = stringr::str_c("SR:", id)) %>% dplyr::mutate(type = "ST") # Stacked (lasso) model: stack.lasso.stats <- stack.lasso.stats %>% dplyr::mutate(source = "Stacked") %>% dplyr::mutate(id = stringr::str_c("SL:", id)) %>% dplyr::mutate(type = "ST") # Stacked (elastic-net) model: stack.en.stats <- stack.en.stats %>% dplyr::mutate(source = "Stacked") %>% dplyr::mutate(id = stringr::str_c("SEN:", id)) %>% dplyr::mutate(type = "ST") ## Refer to the models by generation m <- dplyr::bind_rows(indiv.stats, sv.stats, ma.un.set1.stats, stack.ridge.stats, stack.lasso.stats, stack.en.stats) %>% dplyr::mutate(source = factor(source, levels = c("DeWolf_2003", "Shah_2013", "Shah_2014", "Shah_2019", "New", "Soft vote", "Weighted average", "Stacked"))) %>% dplyr::mutate(gen = forcats::fct_collapse(source, gen1 = "DeWolf_2003", gen2 = c("Shah_2013", "Shah_2014"), gen3 = c("Shah_2019", "New"), sv = "Soft vote", ma = "Weighted average", stk = "Stacked")) %>% dplyr::mutate(source = factor(gen, levels = c("gen1", "gen2", "gen3", "sv", "ma", "stk"), labels = c("1st generation", "2nd generation", "3rd generation", "Soft vote", "Weighted average", "Stacked"))) %>% dplyr::mutate(period = factor(period, levels = c("pre", "post", "pre-post", "Unrestricted"), labels = c("Pre-anthesis", "Post-anthesis", "Pre- to post-anthesis", "Unrestricted"))) %>% dplyr::select(id, source, period, type, everything()) ModelvsStat <- function(.Var, .sorted = TRUE, .decreasing = FALSE, ref.line) { # Plots the model statistic for each model # Args: # .Var = an unquoted string for the metric # .sorted = Boolean for whether plots should be sorted by the metric or not # .decreasing = Boolean for the sort order # ref.line = numeric for placement of a vertical reference line in the panels # Returns: # a facetted plot (by period and scalars included) # var <- enquo(.Var) m %>% dplyr::select(!!var, id, source) %>% dplyr::mutate(id = if(.sorted == TRUE) {factor(id, levels = id[order(!!var, decreasing = .decreasing)])} else {id}) %>% ggplot(., aes(x = !!var, y = id, fill = source)) + theme_bw() + # scale_fill_viridis_d() + scale_fill_viridis_d(guide = FALSE) + geom_point(shape = 21, colour = "black", size = 3) + # Originally had ncol = 2 prior to using generations facet_wrap(~ source, ncol = 3, scales = "free_y") + ylab("Model") + theme(axis.title.y = element_text(face = "bold", size = 12)) + theme(axis.text.y = element_text(size = rel(0.7))) + xlab(rlang::as_label(var)) + theme(axis.title.x = element_text(face = "bold", size = 12)) + theme(axis.text.x = element_text(size = rel(1.1))) + theme(strip.text.x = element_text(face = "bold", size = 10)) + # legend is repetitious given the paneling and colored points: # theme(legend.position = "bottom", legend.direction = "horizontal", legend.box = "vertical") + # labs(fill = "Source") + # Add a vertical line for visual reference: geom_vline(xintercept = ref.line, colour = "grey50", linetype = "dashed") } # end function ModelvsStat ``` ## Ranking metrics {.tabset .tabset-fade .tabset-pills} ### AUC {.tabset .tabset-fade .tabset-pills} * area under the ROC curve #### Ranked * How do they rank? ```{r Perf_AUC_Ranked, eval=TRUE, echo=FALSE} m %>% dplyr::select(id, auc) %>% dplyr::arrange(auc) %>% knitr::kable(., row.names = TRUE, align = "lc") %>% kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "responsive")) %>% scroll_box(width = "500px", height = "400px") ``` #### Model-averaged comparisons {.tabset .tabset-fade .tabset-pills} ##### vs individual models {.tabset .tabset-fade .tabset-pills} ###### Sorted ```{r Perf_AUC_MAvsI_Ranked, eval=TRUE, echo=FALSE} # Model-averaged vs individual models, ranked: m %>% dplyr::filter(type %in% c("I", "MA")) %>% dplyr::select(type, auc) %>% dplyr::arrange(auc) %>% knitr::kable(., row.names = TRUE, align = "lc") %>% kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "responsive")) %>% scroll_box(width = "500px", height = "400px") ``` ###### Means ```{r Perf_AUC_MAvsI_Means, eval=TRUE, echo=FALSE} m %>% dplyr::filter(type %in% c("I", "MA")) %>% dplyr::select(type, auc) %>% dplyr::group_by(type) %>% dplyr::summarise(n = n(), avg = mean(auc)) %>% knitr::kable(., row.names = TRUE, align = "lcc", col.names = c("Model type", "n", "Mean ROC-AUC")) %>% kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "responsive")) %>% scroll_box(width = "500px") ``` ##### vs soft vote ```{r Perf_AUC_MAvsSV_Means, eval=TRUE, echo=FALSE} m %>% dplyr::filter(type %in% c("SV", "MA")) %>% dplyr::select(type, auc) %>% dplyr::group_by(type) %>% dplyr::summarise(n = n(), avg = mean(auc)) %>% knitr::kable(., row.names = TRUE, align = "lcc", col.names = c("Model type", "n", "Mean ROC-AUC")) %>% kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "responsive")) %>% scroll_box(width = "500px") ``` #### Stacked comparisons {.tabset .tabset-fade .tabset-pills} ##### vs individual models ```{r Perf_AUC_StackedvsInd, eval=TRUE, echo=FALSE} m %>% dplyr::filter(type %in% c("I", "ST")) %>% dplyr::select(type, auc) %>% dplyr::group_by(type) %>% dplyr::summarise(n = n(), avg = mean(auc)) %>% dplyr::ungroup() %>% dplyr::mutate(mx = max(avg)) %>% dplyr::mutate(incr = 100*(mx - avg)/avg) %>% knitr::kable(., row.names = TRUE, align = "lcccc", col.names = c("Model type", "n", "Mean ROC-AUC", "Max(mean ROC-AUC)", "Increase Mean ROC-AUC (%)")) %>% kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "responsive")) %>% scroll_box(width = "600px") ``` ##### vs soft vote model ```{r Perf_AUC_StackedvsSV, eval=TRUE, echo=FALSE} m %>% dplyr::filter(type %in% c("SV", "ST")) %>% dplyr::select(type, auc) %>% dplyr::group_by(type) %>% dplyr::summarise(n = n(), avg = mean(auc)) %>% dplyr::ungroup() %>% dplyr::mutate(mx = max(avg)) %>% dplyr::mutate(incr = 100*(mx - avg)/avg) %>% knitr::kable(., row.names = TRUE, align = "lcccc", col.names = c("Model type", "n", "Mean ROC-AUC", "Max(mean ROC-AUC)", "Increase Mean ROC-AUC (%)")) %>% kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "responsive")) %>% scroll_box(width = "600px") ``` ##### vs model-averaged ```{r Perf_AUC_StackedvsMA, eval=TRUE, echo=FALSE} m %>% dplyr::filter(type %in% c("MA", "ST")) %>% dplyr::select(type, auc) %>% dplyr::group_by(type) %>% dplyr::summarise(n = n(), avg = mean(auc)) %>% dplyr::ungroup() %>% dplyr::mutate(mx = max(avg)) %>% dplyr::mutate(incr = 100*(mx - avg)/avg) %>% knitr::kable(., row.names = TRUE, align = "lcccc", col.names = c("Model type", "n", "Mean ROC-AUC", "Max(mean ROC-AUC)", "Increase Mean ROC-AUC (%)")) %>% kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "responsive")) %>% scroll_box(width = "600px") ``` ------------------------------------------------------------------------------------------------------ ### PR-AUC {.tabset .tabset-fade .tabset-pills} * area under the precision-recall curve #### Ranked * How do they rank? ```{r Perf_PRAUC_Ranked, eval=TRUE, echo=FALSE} m %>% dplyr::select(id, prauc) %>% dplyr::arrange(prauc) %>% knitr::kable(., row.names = TRUE, align = "lc") %>% kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "responsive")) %>% scroll_box(width = "500px", height = "400px") ``` #### Model-averaged comparisons {.tabset .tabset-fade .tabset-pills} ##### vs individual models {.tabset .tabset-fade .tabset-pills} ###### Sorted ```{r Perf_PRAUC_MAvsI_Ranked, eval=TRUE, echo=FALSE} # Model-averaged vs individual models, ranked: m %>% dplyr::filter(type %in% c("I", "MA")) %>% dplyr::select(type, prauc) %>% dplyr::arrange(prauc) %>% knitr::kable(., row.names = TRUE, align = "lc") %>% kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "responsive")) %>% scroll_box(width = "500px", height = "400px") ``` ###### Means ```{r Perf_PRAUC_MAvsI_Means, eval=TRUE, echo=FALSE} m %>% dplyr::filter(type %in% c("I", "MA")) %>% dplyr::select(type, prauc) %>% dplyr::group_by(type) %>% dplyr::summarise(n = n(), avg = mean(prauc)) %>% knitr::kable(., row.names = TRUE, align = "lcc", col.names = c("Model type", "n", "Mean PR-AUC")) %>% kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "responsive")) %>% scroll_box(width = "500px") ``` ##### vs soft vote ```{r Perf_PRAUC_MAvsSV_Means, eval=TRUE, echo=FALSE} m %>% dplyr::filter(type %in% c("SV", "MA")) %>% dplyr::select(type, prauc) %>% dplyr::group_by(type) %>% dplyr::summarise(n = n(), avg = mean(prauc)) %>% knitr::kable(., row.names = TRUE, align = "lcc", col.names = c("Model type", "n", "Mean PR-AUC")) %>% kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "responsive")) %>% scroll_box(width = "500px") ``` #### Stacked comparisons {.tabset .tabset-fade .tabset-pills} ##### vs individual models ```{r Perf_PRAUC_StackedvsInd, eval=TRUE, echo=FALSE} m %>% dplyr::filter(type %in% c("I", "ST")) %>% dplyr::select(type, prauc) %>% dplyr::group_by(type) %>% dplyr::summarise(n = n(), avg = mean(prauc)) %>% dplyr::ungroup() %>% dplyr::mutate(mx = max(avg)) %>% dplyr::mutate(incr = 100*(mx - avg)/avg) %>% knitr::kable(., row.names = TRUE, align = "lcccc", col.names = c("Model type", "n", "Mean PR-AUC", "Max(mean PR-AUC)", "Increase Mean PR-AUC (%)")) %>% kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "responsive")) %>% scroll_box(width = "600px") ``` ##### vs soft vote model ```{r Perf_PRAUC_StackedvsSV, eval=TRUE, echo=FALSE} m %>% dplyr::filter(type %in% c("SV", "ST")) %>% dplyr::select(type, prauc) %>% dplyr::group_by(type) %>% dplyr::summarise(n = n(), avg = mean(prauc)) %>% dplyr::ungroup() %>% dplyr::mutate(mx = max(avg)) %>% dplyr::mutate(incr = 100*(mx - avg)/avg) %>% knitr::kable(., row.names = TRUE, align = "lcccc", col.names = c("Model type", "n", "Mean PR-AUC", "Max(mean PR-AUC)", "Increase Mean PR-AUC (%)")) %>% kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "responsive")) %>% scroll_box(width = "600px") ``` ##### vs model-averaged ```{r Perf_PRAUC_StackedvsMA, eval=TRUE, echo=FALSE} m %>% dplyr::filter(type %in% c("MA", "ST")) %>% dplyr::select(type, prauc) %>% dplyr::group_by(type) %>% dplyr::summarise(n = n(), avg = mean(prauc)) %>% dplyr::ungroup() %>% dplyr::mutate(mx = max(avg)) %>% dplyr::mutate(incr = 100*(mx - avg)/avg) %>% knitr::kable(., row.names = TRUE, align = "lcccc", col.names = c("Model type", "n", "Mean PR-AUC", "Max(mean PR-AUC)", "Increase Mean PR-AUC (%)")) %>% kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "responsive")) %>% scroll_box(width = "600px") ``` ## Composite plots ```{r Perf_CompositeSetup, eval=TRUE, echo=FALSE} # The individual models: indiv <- indiv.stats %>% dplyr::mutate(source = as.character(labs$source)) # Soft voting model: sv <- sv.stats %>% dplyr::mutate(source = "Soft vote") # Model averaged (unrestricted) Set 1: ma <- ma.un.set1.stats %>% dplyr::slice(1:10) %>% dplyr::mutate(source = "Weighted average") # Stacked (ridge) model: stack.ridge <- stack.ridge.stats %>% dplyr::mutate(source = "Stacked_R") # Stacked (lasso) model: stack.lasso <- stack.lasso.stats %>% dplyr::mutate(source = "Stacked_L") # Stacked (elastic-net) model: stack.en <- stack.en.stats %>% dplyr::mutate(source = "Stacked_EN") m2 <- dplyr::bind_rows(indiv, sv, ma, stack.ridge, stack.lasso, stack.en) %>% dplyr::mutate(source = factor(source, levels = c("DeWolf_2003", "Shah_2013", "Shah_2014", "Shah_2019", "New", "Soft vote", "Weighted average", "Stacked_L", "Stacked_R", "Stacked_EN"))) %>% dplyr::mutate(gen = forcats::fct_collapse(source, gen1 = "DeWolf_2003", gen2 = c("Shah_2013", "Shah_2014"), gen3 = c("Shah_2019", "New"), sv = "Soft vote", ma = "Weighted average", stkl = "Stacked_L", stkr = "Stacked_R", stken = "Stacked_EN")) %>% dplyr::mutate(source = factor(gen, levels = c("gen1", "gen2", "gen3", "sv", "ma", "stkl", "stkr", "stken"), labels = c("1st generation", "2nd generation", "3rd generation", "Soft vote", "Weighted average", "Stacked (lasso)", "Stacked (ridge)", "Stacked (elastic-net)"))) %>% dplyr::select(id, source, everything()) ``` ```{r Perf_Composite_Functions, eval=TRUE, echo=FALSE} # This function is taken from https://cran.r-project.org/web/packages/egg/vignettes/Ecosystem.html # It is for combining plots, and placing a common legend. grid_arrange_shared_legend <- function(..., ncol = length(list(...)), nrow = 1, position = c("bottom", "right")) { plots <- list(...) position <- match.arg(position) g <- ggplot2::ggplotGrob(plots[[1]] + theme(legend.position = position))$grobs legend <- g[[which(sapply(g, function(x) x$name) == "guide-box")]] lheight <- sum(legend$height) lwidth <- sum(legend$width) gl <- lapply(plots, function(x) x + theme(legend.position = "none")) gl <- c(gl, ncol = ncol, nrow = nrow) combined <- switch( position, "bottom" = gridExtra::arrangeGrob( do.call(gridExtra::arrangeGrob, gl), legend, ncol = 1, heights = grid::unit.c(unit(1, "npc") - lheight, lheight) ), "right" = gridExtra::arrangeGrob( do.call(gridExtra::arrangeGrob, gl), legend, ncol = 2, widths = grid::unit.c(unit(1, "npc") - lwidth, lwidth) ) ) grid::grid.newpage() grid::grid.draw(combined) # return gtable invisibly invisible(combined) } # end function grid_arrange_shared_legend # Instead of repeating code for each of the plots, use a function: plt.fcn <- function(.x, .y, .xlab, .ylab, .title) { # Plot variable 1 vs variable 2 for the LR, SV, MA and SR models # Args: # .x = unquoted string for the x-axis variable # .y = unquoted string for the y-axis variable # .xlab = character string for the x-axis label # .ylab = character string for the y-axis label # .title = character string for the plot title # Returns: # a ggplot object # var1 <- enquo(.x) var2 <- enquo(.y) # The color scheme for the points: col.vec <- c(RColorBrewer::brewer.pal(8, "Dark2")[1:3], "deepskyblue", RColorBrewer::brewer.pal(8, "Dark2")[5], "grey0", "grey35", "grey65") m2 %>% dplyr::select(!!var1, !!var2, source) %>% ggplot(., aes(x = !!var1, y = !!var2, colour = source, shape = source, size = source)) + theme_bw() + scale_shape_manual(values = c(16, 16, 16, 15, 17, 18, 18, 18)) + scale_size_manual(values = c(2, 2, 2, 4, 4, 6, 6, 6)) + scale_colour_manual(values = col.vec) + geom_point(alpha = 0.8) + # change the legend position and shape: theme(legend.position = "bottom", legend.direction = "horizontal", legend.box = "vertical") + guides(colour = guide_legend(NULL, ncol = 4), shape = guide_legend(NULL, ncol = 4), size = guide_legend(NULL, ncol = 4)) + ylab(.ylab) + theme(axis.title.y = element_text(face = "bold", size = 12)) + theme(axis.text.y = element_text(size = rel(1.0))) + xlab(.xlab) + theme(axis.title.x = element_text(face = "bold", size = 12)) + theme(axis.text.x = element_text(size = rel(1.0))) + geom_smooth(aes(x = !!var1, y = !!var2), method = "lm", inherit.aes = FALSE, se = FALSE, size = 0.5, colour = "black", linetype = "dashed") + ggtitle(.title) + theme(plot.title = element_text(face = "bold")) } # end function plt.fcn # Simple way of combining the plots: # gridExtra::grid.arrange(p1, p2, p3, p4, nrow = 2) ``` **Figure 2** ```{r Perf_Composite_Set3, eval=TRUE, echo=FALSE} # Sensitivity and specificify: p1 <- plt.fcn(.x = Se, .y = Sp, .xlab = "Se", .ylab = "Sp", .title = "A") # Informedness and markedness: p2 <- plt.fcn(.x = ifd, .y = mkd, .xlab = "IFD", .ylab = "MKD", .title = "B") # PR-AUC and ROC-AUC: p3 <- plt.fcn(.x = auc, .y = prauc, .xlab = "ROC-AUC", .ylab = "PR-AUC", .title = "C") # IMN and MCEN: p4 <- plt.fcn(.x = IMN, .y = MCEN, .xlab = "IMN", .ylab = "MCEN", .title = "D") grid_arrange_shared_legend(p1, p2, p3, p4, ncol = 2, nrow = 2) ``` # Correlations between measures **S6 Figure** ```{r MeasureCorrelationsIII_SF, eval=TRUE, echo=FALSE} indiv.stats %>% dplyr::select(Se, Sp, ifd, mkd, IMN, MCEN, auc, prauc) %>% psych::pairs.panels(., scale = FALSE, labels = c("Se", "Sp", "IFD", "MKD", "IMN", "MCEN", "ROC-AUC", "PR-AUC")) ``` -------------------------------------------------------------------------------------------------- # Computational environment ```{r SessionInfo, eval=TRUE, echo=FALSE, results='markup'} R.Version()$version.string R.Version()$system sessionInfo() ```