Chapter 5 Prediction from binary outcome

In this chapter, we will talk about Regression that deals with prediction of binary outcomes. We will use logistic regression to build the first prediction model.

  • Watch the video describing this chapter video

5.1 Read previously saved data

ObsData <- readRDS(file = "data/rhcAnalytic.RDS")

5.2 Outcome levels (factor)

  • Label
    • Possible values of outcome
levels(ObsData$Death)=c("No","Yes") # this is useful for caret
# ref: https://tinyurl.com/caretbin
class(ObsData$Death)
## [1] "factor"
table(ObsData$Death)
## 
##   No  Yes 
## 2013 3722

5.3 Measuring prediction error

  • Brier score
    • Brier score 0 means perfect prediction, and
    • close to zero means better prediction,
    • 1 being the worst prediction.
    • Less accurate forecasts get higher score in Brier score.
  • AUC
    • The area under a ROC curve is called as a c statistics.
    • c being 0.5 means random prediction and
    • 1 indicates perfect prediction

5.3.1 Prediction for death

In this section, we show the regression fitting when outcome is binary (death).

5.4 Variables

baselinevars <- names(dplyr::select(ObsData, 
                         !c(Length.of.Stay,Death)))
baselinevars
##  [1] "Disease.category"      "Cancer"                "Cardiovascular"       
##  [4] "Congestive.HF"         "Dementia"              "Psychiatric"          
##  [7] "Pulmonary"             "Renal"                 "Hepatic"              
## [10] "GI.Bleed"              "Tumor"                 "Immunosupperssion"    
## [13] "Transfer.hx"           "MI"                    "age"                  
## [16] "sex"                   "edu"                   "DASIndex"             
## [19] "APACHE.score"          "Glasgow.Coma.Score"    "blood.pressure"       
## [22] "WBC"                   "Heart.rate"            "Respiratory.rate"     
## [25] "Temperature"           "PaO2vs.FIO2"           "Albumin"              
## [28] "Hematocrit"            "Bilirubin"             "Creatinine"           
## [31] "Sodium"                "Potassium"             "PaCo2"                
## [34] "PH"                    "Weight"                "DNR.status"           
## [37] "Medical.insurance"     "Respiratory.Diag"      "Cardiovascular.Diag"  
## [40] "Neurological.Diag"     "Gastrointestinal.Diag" "Renal.Diag"           
## [43] "Metabolic.Diag"        "Hematologic.Diag"      "Sepsis.Diag"          
## [46] "Trauma.Diag"           "Orthopedic.Diag"       "race"                 
## [49] "income"                "RHC.use"

5.5 Model

# adjust covariates
out.formula2 <- as.formula(paste("Death~ ", 
                               paste(baselinevars, 
                                     collapse = "+")))
saveRDS(out.formula2, file = "data/form2.RDS")
fit2 <- glm(out.formula2, data = ObsData, 
            family = binomial(link = "logit"))
require(Publish)
adj.fit2 <- publish(fit2, digits=1)$regressionTable
out.formula2
## Death ~ Disease.category + Cancer + Cardiovascular + Congestive.HF + 
##     Dementia + Psychiatric + Pulmonary + Renal + Hepatic + GI.Bleed + 
##     Tumor + Immunosupperssion + Transfer.hx + MI + age + sex + 
##     edu + DASIndex + APACHE.score + Glasgow.Coma.Score + blood.pressure + 
##     WBC + Heart.rate + Respiratory.rate + Temperature + PaO2vs.FIO2 + 
##     Albumin + Hematocrit + Bilirubin + Creatinine + Sodium + 
##     Potassium + PaCo2 + PH + Weight + DNR.status + Medical.insurance + 
##     Respiratory.Diag + Cardiovascular.Diag + Neurological.Diag + 
##     Gastrointestinal.Diag + Renal.Diag + Metabolic.Diag + Hematologic.Diag + 
##     Sepsis.Diag + Trauma.Diag + Orthopedic.Diag + race + income + 
##     RHC.use
adj.fit2
##                 Variable               Units OddsRatio      CI.95 p-value
## 1       Disease.category                 ARF       Ref                   
## 2                                        CHF       1.0  [0.8;1.4]     0.8
## 3                                      Other       1.6  [1.3;2.0]    <0.1
## 4                                       MOSF       1.0  [0.9;1.2]     0.7
## 5                 Cancer                None       Ref                   
## 6                            Localized (Yes)       6.6 [2.2;19.4]    <0.1
## 7                                 Metastatic      26.4 [8.3;83.6]    <0.1
## 8         Cardiovascular                   0       Ref                   
## 9                                          1       1.3  [1.0;1.5]    <0.1
## 10         Congestive.HF                   0       Ref                   
## 11                                         1       1.6  [1.3;1.9]    <0.1
## 12              Dementia                   0       Ref                   
## 13                                         1       1.3  [1.0;1.6]    <0.1
## 14           Psychiatric                   0       Ref                   
## 15                                         1       0.9  [0.7;1.2]     0.6
## 16             Pulmonary                   0       Ref                   
## 17                                         1       1.0  [0.9;1.2]     0.7
## 18                 Renal                   0       Ref                   
## 19                                         1       1.3  [0.9;1.9]     0.2
## 20               Hepatic                   0       Ref                   
## 21                                         1       1.3  [0.9;1.8]     0.1
## 22              GI.Bleed                   0       Ref                   
## 23                                         1       1.2  [0.8;1.9]     0.3
## 24                 Tumor                   0       Ref                   
## 25                                         1       0.3  [0.1;0.9]    <0.1
## 26     Immunosupperssion                   0       Ref                   
## 27                                         1       1.2  [1.1;1.4]    <0.1
## 28           Transfer.hx                   0       Ref                   
## 29                                         1       0.8  [0.7;1.0]    <0.1
## 30                    MI                   0       Ref                   
## 31                                         1       0.8  [0.6;1.2]     0.3
## 32                   age           [-Inf,50)       Ref                   
## 33                                   [50,60)       1.4  [1.2;1.8]    <0.1
## 34                                   [60,70)       2.1  [1.7;2.5]    <0.1
## 35                                   [70,80)       2.1  [1.6;2.6]    <0.1
## 36                                 [80, Inf)       2.8  [2.1;3.8]    <0.1
## 37                   sex                Male       Ref                   
## 38                                    Female       0.8  [0.7;0.9]    <0.1
## 39                   edu                           1.0  [1.0;1.0]     0.4
## 40              DASIndex                           1.0  [0.9;1.0]    <0.1
## 41          APACHE.score                           1.0  [1.0;1.0]    <0.1
## 42    Glasgow.Coma.Score                           1.0  [1.0;1.0]    <0.1
## 43        blood.pressure                           1.0  [1.0;1.0]     0.4
## 44                   WBC                           1.0  [1.0;1.0]     0.2
## 45            Heart.rate                           1.0  [1.0;1.0]     0.7
## 46      Respiratory.rate                           1.0  [1.0;1.0]     0.8
## 47           Temperature                           0.9  [0.9;1.0]    <0.1
## 48           PaO2vs.FIO2                           1.0  [1.0;1.0]     0.3
## 49               Albumin                           1.0  [0.9;1.1]     0.6
## 50            Hematocrit                           1.0  [1.0;1.0]    <0.1
## 51             Bilirubin                           1.0  [1.0;1.1]    <0.1
## 52            Creatinine                           1.0  [1.0;1.0]     0.9
## 53                Sodium                           1.0  [1.0;1.0]     0.7
## 54             Potassium                           1.0  [0.9;1.1]     0.9
## 55                 PaCo2                           1.0  [1.0;1.0]     0.3
## 56                    PH                           1.1  [0.5;2.3]     0.9
## 57                Weight                           1.0  [1.0;1.0]    <0.1
## 58            DNR.status                  No       Ref                   
## 59                                       Yes       2.6  [2.0;3.3]    <0.1
## 60     Medical.insurance            Medicaid       Ref                   
## 61                                  Medicare       1.6  [1.2;2.0]    <0.1
## 62                       Medicare & Medicaid       1.4  [1.0;1.9]    <0.1
## 63                              No insurance       1.5  [1.1;2.0]    <0.1
## 64                                   Private       1.3  [1.1;1.7]    <0.1
## 65                        Private & Medicare       1.3  [1.0;1.7]    <0.1
## 66      Respiratory.Diag                  No       Ref                   
## 67                                       Yes       1.2  [1.0;1.4]    <0.1
## 68   Cardiovascular.Diag                  No       Ref                   
## 69                                       Yes       1.2  [1.0;1.4]    <0.1
## 70     Neurological.Diag                  No       Ref                   
## 71                                       Yes       1.5  [1.2;1.9]    <0.1
## 72 Gastrointestinal.Diag                  No       Ref                   
## 73                                       Yes       1.3  [1.1;1.6]    <0.1
## 74            Renal.Diag                  No       Ref                   
## 75                                       Yes       0.8  [0.6;1.1]     0.2
## 76        Metabolic.Diag                  No       Ref                   
## 77                                       Yes       1.0  [0.8;1.4]     0.8
## 78      Hematologic.Diag                  No       Ref                   
## 79                                       Yes       2.7  [2.0;3.8]    <0.1
## 80           Sepsis.Diag                  No       Ref                   
## 81                                       Yes       1.1  [0.9;1.4]     0.2
## 82           Trauma.Diag                  No       Ref                   
## 83                                       Yes       0.8  [0.4;1.4]     0.4
## 84       Orthopedic.Diag                  No       Ref                   
## 85                                       Yes       1.4  [0.2;8.1]     0.7
## 86                  race               white       Ref                   
## 87                                     black       1.0  [0.8;1.2]     0.9
## 88                                     other       1.1  [0.8;1.4]     0.7
## 89                income            $11-$25k       Ref                   
## 90                                  $25-$50k       0.8  [0.7;1.0]    <0.1
## 91                                    > $50k       0.8  [0.6;1.1]     0.2
## 92                                Under $11k       1.2  [1.0;1.4]    <0.1
## 93               RHC.use                           1.4  [1.2;1.6]    <0.1

5.6 Measuring prediction error

5.6.1 AUC

require(pROC)
obs.y2<-ObsData$Death
pred.y2 <- predict(fit2, type = "response")
rocobj <- roc(obs.y2, pred.y2)
## Setting levels: control = No, case = Yes
## Setting direction: controls < cases
rocobj
## 
## Call:
## roc.default(response = obs.y2, predictor = pred.y2)
## 
## Data: pred.y2 in 2013 controls (obs.y2 No) < 3722 cases (obs.y2 Yes).
## Area under the curve: 0.7682
plot(rocobj)

auc(rocobj)
## Area under the curve: 0.7682

5.6.2 Brier Score

require(DescTools)
BrierScore(fit2)
## [1] 0.1812502

5.7 Cross-validation using caret

5.7.1 Basic setup

# Using Caret package
set.seed(504)
# make a 5-fold CV
require(caret)
ctrl<-trainControl(method = "cv", number = 5, 
                   classProbs = TRUE,
                   summaryFunction = twoClassSummary)
# fit the model with formula = out.formula2
# use training method glm (have to specify family)
fit.cv.bin<-train(out.formula2, trControl = ctrl,
               data = ObsData, method = "glm",
              family = binomial(),
              metric="ROC")
fit.cv.bin
## Generalized Linear Model 
## 
## 5735 samples
##   50 predictor
##    2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 4587, 4589, 4587, 4589, 4588 
## Resampling results:
## 
##   ROC        Sens       Spec     
##   0.7545115  0.4659618  0.8535653

5.7.2 Extract results from each test data

summary.res <- fit.cv.bin$resample
summary.res
##         ROC      Sens      Spec Resample
## 1 0.7444835 0.4739454 0.8630872    Fold1
## 2 0.7544836 0.4502488 0.8561828    Fold2
## 3 0.7786734 0.4739454 0.8738255    Fold3
## 4 0.7350679 0.4626866 0.8373656    Fold4
## 5 0.7598488 0.4689826 0.8373656    Fold5
mean(fit.cv.bin$resample$ROC)
## [1] 0.7545115
sd(fit.cv.bin$resample$ROC)
## [1] 0.01651437

5.7.3 More options

ctrl<-trainControl(method = "cv", number = 5, 
                   classProbs = TRUE,
                   summaryFunction = twoClassSummary)
fit.cv.bin<-train(out.formula2, trControl = ctrl,
               data = ObsData, method = "glm",
              family = binomial(),
              metric="ROC",
              preProc = c("center", "scale"))
fit.cv.bin
## Generalized Linear Model 
## 
## 5735 samples
##   50 predictor
##    2 classes: 'No', 'Yes' 
## 
## Pre-processing: centered (63), scaled (63) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 4588, 4589, 4587, 4588, 4588 
## Resampling results:
## 
##   ROC        Sens       Spec     
##   0.7548047  0.4629717  0.8530367

5.8 Variable selection

We can also use stepwise regression that uses AIC as a criterion.

set.seed(504)
ctrl<-trainControl(method = "cv", number = 5, 
                   classProbs = TRUE,
                   summaryFunction = twoClassSummary)
fit.cv.bin.aic<-train(out.formula2, trControl = ctrl,
               data = ObsData, method = "glmStepAIC",
               direction ="backward",
              family = binomial(),
              metric="ROC")
fit.cv.bin.aic
## Generalized Linear Model with Stepwise Feature Selection 
## 
## 5735 samples
##   50 predictor
##    2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 4587, 4589, 4587, 4589, 4588 
## Resampling results:
## 
##   ROC        Sens      Spec     
##   0.7540424  0.464468  0.8562535
summary(fit.cv.bin.aic)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8626  -0.9960   0.5052   0.8638   1.9578  
## 
## Coefficients:
##                                          Estimate Std. Error z value Pr(>|z|)
## (Intercept)                             1.0783624  0.7822168   1.379 0.168019
## Disease.categoryOther                   0.4495099  0.0919860   4.887 1.03e-06
## `CancerLocalized (Yes)`                 1.8942512  0.5501880   3.443 0.000575
## CancerMetastatic                        3.2703316  0.5858715   5.582 2.38e-08
## Cardiovascular1                         0.2386749  0.0939617   2.540 0.011081
## Congestive.HF1                          0.4539010  0.0971624   4.672 2.99e-06
## Dementia1                               0.2380213  0.1162903   2.047 0.040679
## Hepatic1                                0.3593093  0.1541762   2.331 0.019779
## Tumor1                                 -1.2455123  0.5542624  -2.247 0.024630
## Immunosupperssion1                      0.2174294  0.0730803   2.975 0.002928
## Transfer.hx1                           -0.1849029  0.0945679  -1.955 0.050555
## `age[50,60)`                            0.3621248  0.0984288   3.679 0.000234
## `age[60,70)`                            0.6941924  0.0968434   7.168 7.60e-13
## `age[70,80)`                            0.6804939  0.1126637   6.040 1.54e-09
## `age[80, Inf)`                          0.9833851  0.1410563   6.972 3.13e-12
## sexFemale                              -0.2805950  0.0653527  -4.294 1.76e-05
## DASIndex                               -0.0429272  0.0062191  -6.902 5.11e-12
## APACHE.score                            0.0174907  0.0020017   8.738  < 2e-16
## Glasgow.Coma.Score                      0.0093657  0.0012563   7.455 9.00e-14
## WBC                                     0.0044518  0.0030090   1.479 0.139009
## Temperature                            -0.0524703  0.0192757  -2.722 0.006487
## PaO2vs.FIO2                             0.0004741  0.0003054   1.552 0.120548
## Hematocrit                             -0.0154796  0.0041593  -3.722 0.000198
## Bilirubin                               0.0313087  0.0094004   3.331 0.000867
## Weight                                 -0.0031548  0.0011213  -2.813 0.004902
## DNR.statusYes                           0.9347360  0.1326924   7.044 1.86e-12
## Medical.insuranceMedicare               0.4764895  0.1257582   3.789 0.000151
## `Medical.insuranceMedicare & Medicaid`  0.3364916  0.1584757   2.123 0.033729
## `Medical.insuranceNo insurance`         0.3711345  0.1568820   2.366 0.017996
## Medical.insurancePrivate                0.2632637  0.1139805   2.310 0.020903
## `Medical.insurancePrivate & Medicare`   0.2819715  0.1313101   2.147 0.031764
## Respiratory.DiagYes                     0.1393974  0.0769026   1.813 0.069886
## Cardiovascular.DiagYes                  0.1804967  0.0836679   2.157 0.030982
## Neurological.DiagYes                    0.4320266  0.1189357   3.632 0.000281
## Gastrointestinal.DiagYes                0.2819563  0.1092206   2.582 0.009836
## Hematologic.DiagYes                     0.9734424  0.1651363   5.895 3.75e-09
## Sepsis.DiagYes                          0.1539651  0.0943235   1.632 0.102614
## `incomeUnder $11k`                      0.2151437  0.0689392   3.121 0.001804
## RHC.use                                 0.3552053  0.0713632   4.977 6.44e-07
##                                           
## (Intercept)                               
## Disease.categoryOther                  ***
## `CancerLocalized (Yes)`                ***
## CancerMetastatic                       ***
## Cardiovascular1                        *  
## Congestive.HF1                         ***
## Dementia1                              *  
## Hepatic1                               *  
## Tumor1                                 *  
## Immunosupperssion1                     ** 
## Transfer.hx1                           .  
## `age[50,60)`                           ***
## `age[60,70)`                           ***
## `age[70,80)`                           ***
## `age[80, Inf)`                         ***
## sexFemale                              ***
## DASIndex                               ***
## APACHE.score                           ***
## Glasgow.Coma.Score                     ***
## WBC                                       
## Temperature                            ** 
## PaO2vs.FIO2                               
## Hematocrit                             ***
## Bilirubin                              ***
## Weight                                 ** 
## DNR.statusYes                          ***
## Medical.insuranceMedicare              ***
## `Medical.insuranceMedicare & Medicaid` *  
## `Medical.insuranceNo insurance`        *  
## Medical.insurancePrivate               *  
## `Medical.insurancePrivate & Medicare`  *  
## Respiratory.DiagYes                    .  
## Cardiovascular.DiagYes                 *  
## Neurological.DiagYes                   ***
## Gastrointestinal.DiagYes               ** 
## Hematologic.DiagYes                    ***
## Sepsis.DiagYes                            
## `incomeUnder $11k`                     ** 
## RHC.use                                ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7433.3  on 5734  degrees of freedom
## Residual deviance: 6198.0  on 5696  degrees of freedom
## AIC: 6276
## 
## Number of Fisher Scoring iterations: 5