Binary outcome

We focus on statistical analysis and modeling of a binary outcome (cholesterol level) that is categorized as either “healthy” or “unhealthy.”

Explore relationships for binary outcome variable

Load data

load(file = "Data/predictivefactors/cholesterolNHANES15part1.RData")

Creating binary variable

Binary categorization: The cholesterol variable is converted into a binary outcome (“healthy” or “unhealthy”) using the ifelse function based on a threshold value of 200.

Re-leveling: The reference category for the binary variable is changed to “unhealthy.”

# Binary variable
analytic3$cholesterol.bin <- ifelse(analytic3$cholesterol < 200, "healthy", "unhealthy")
table(analytic3$cholesterol.bin)
#> 
#>   healthy unhealthy 
#>      1586      1046

# Changing the reference category
analytic3$cholesterol.bin <- as.factor(analytic3$cholesterol.bin)
analytic3$cholesterol.bin <- relevel(analytic3$cholesterol.bin, ref = "unhealthy")
table(analytic3$cholesterol.bin)
#> 
#> unhealthy   healthy 
#>      1046      1586

Modelling data

A logistic regression is fitted to predict the binary cholesterol outcome from multiple predictor variables.

Tip

We use the glm function to run generalized linear models. The default family is gaussian with identity link. Setting binomial family with logit link (logit link is default for binomial family) means fitting logistic regression.

# Regression model
formula5x <- as.formula("cholesterol.bin~gender + age + born + 
             race + education + married + 
             income + diastolicBP + systolicBP + 
             bmi + bodyweight + bodyheight + waist +  
             triglycerides + uric.acid + 
             protein + bilirubin + phosphorus + sodium + potassium + 
             globulin + calcium + physical.work + physical.recreational + 
             diabetes")

# Summary
fit5x <- glm(formula5x, family = binomial(), data = analytic3)
publish(fit5x)
#>               Variable                            Units OddsRatio       CI.95    p-value 
#>                 gender                           Female       Ref                        
#>                                                    Male      1.68 [1.27;2.23]   0.000313 
#>                    age                                       0.97 [0.97;0.98]    < 1e-04 
#>                   born Born in 50 US states or Washingt       Ref                        
#>                                                  Others      0.69 [0.54;0.88]   0.002636 
#>                   race                            Black       Ref                        
#>                                                Hispanic      1.29 [0.95;1.75]   0.107104 
#>                                                   Other      1.24 [0.87;1.79]   0.234062 
#>                                                   White      1.10 [0.83;1.44]   0.505147 
#>              education                          College       Ref                        
#>                                             High.School      0.84 [0.68;1.02]   0.082168 
#>                                                  School      1.23 [0.84;1.82]   0.292070 
#>                married                          Married       Ref                        
#>                                           Never.married      1.29 [1.00;1.67]   0.052969 
#>                                      Previously.married      0.89 [0.70;1.13]   0.345688 
#>                 income                             <25k       Ref                        
#>                                        Between.25kto54k      1.01 [0.78;1.29]   0.957537 
#>                                        Between.55kto99k      0.90 [0.69;1.19]   0.462854 
#>                                                Over100k      0.90 [0.66;1.21]   0.472137 
#>            diastolicBP                                       0.98 [0.97;0.98]    < 1e-04 
#>             systolicBP                                       1.01 [1.00;1.01]   0.029513 
#>                    bmi                                       1.11 [0.99;1.24]   0.065627 
#>             bodyweight                                       0.96 [0.92;1.00]   0.045338 
#>             bodyheight                                       1.05 [1.00;1.09]   0.030995 
#>                  waist                                       1.01 [0.99;1.02]   0.464825 
#>          triglycerides                                       0.99 [0.99;0.99]    < 1e-04 
#>              uric.acid                                       0.96 [0.89;1.03]   0.273792 
#>                protein                                       0.61 [0.42;0.89]   0.009192 
#>              bilirubin                                       1.19 [0.86;1.66]   0.292632 
#>             phosphorus                                       0.96 [0.81;1.13]   0.610931 
#>                 sodium                                       1.06 [1.02;1.11]   0.007980 
#>              potassium                                       0.95 [0.71;1.26]   0.729218 
#>               globulin                                       1.38 [0.94;2.01]   0.101667 
#>                calcium                                       0.64 [0.47;0.89]   0.007026 
#>          physical.work                               No       Ref                        
#>                                                     Yes      0.91 [0.74;1.12]   0.392539 
#>  physical.recreational                               No       Ref                        
#>                                                     Yes      1.05 [0.85;1.29]   0.681388 
#>               diabetes                               No       Ref                        
#>                                                     Yes      2.68 [2.02;3.56]    < 1e-04

# VIF
car::vif(fit5x)
#>                             GVIF Df GVIF^(1/(2*Df))
#> gender                  2.735258  1        1.653862
#> age                     2.121098  1        1.456399
#> born                    1.664094  1        1.289998
#> race                    2.585539  3        1.171544
#> education               1.458430  2        1.098933
#> married                 1.432595  2        1.094034
#> income                  1.426911  3        1.061043
#> diastolicBP             1.297308  1        1.138994
#> systolicBP              1.614374  1        1.270580
#> bmi                    81.928815  1        9.051454
#> bodyweight            103.125772  1       10.155086
#> bodyheight             22.647853  1        4.758976
#> waist                  11.493710  1        3.390237
#> triglycerides           1.258340  1        1.121758
#> uric.acid               1.636512  1        1.279262
#> protein                 3.684816  1        1.919587
#> bilirubin               1.186181  1        1.089119
#> phosphorus              1.117915  1        1.057315
#> sodium                  1.123193  1        1.059808
#> potassium               1.181358  1        1.086903
#> globulin                3.427401  1        1.851324
#> calcium                 1.543019  1        1.242183
#> physical.work           1.090958  1        1.044490
#> physical.recreational   1.218558  1        1.103883
#> diabetes                1.212365  1        1.101074

AUC

The Area Under the Receiver Operating Characteristic (ROC) Curve (AUC) is calculated to assess the model’s predictive performance in terms of discrimination. The AUC would tell how much the model is capable of distinguishing between healthy and unhealthy levels.

Let us measure the accuracy for classification models fit5x.

Tip

We can use the roc function to build a ROC curve and auc function to calculate the AUC (are under the ROC curve) value.

require(pROC)
pred.y <- predict(fit5x, type = "response")
rocobj <- roc(analytic3$cholesterol.bin, pred.y)
#> Setting levels: control = unhealthy, case = healthy
#> Setting direction: controls < cases
rocobj
#> 
#> Call:
#> roc.default(response = analytic3$cholesterol.bin, predictor = pred.y)
#> 
#> Data: pred.y in 1046 controls (analytic3$cholesterol.bin unhealthy) < 1586 cases (analytic3$cholesterol.bin healthy).
#> Area under the curve: 0.7411

auc(rocobj)
#> Area under the curve: 0.7411

Re-modelling

Let us re-fit the model and measure the AUC. VIF is calculated again for this new model.

formula5 <- as.formula("cholesterol.bin~gender + age + born + 
             race + education + married + 
             income + diastolicBP + systolicBP + 
             bmi +
             triglycerides + uric.acid + 
             protein + bilirubin + phosphorus + sodium + potassium + 
             globulin + calcium + physical.work + physical.recreational + 
             diabetes")
fit5 <- glm(formula5, family = binomial(), data = analytic3)
publish(fit5)
#>               Variable                            Units OddsRatio       CI.95    p-value 
#>                 gender                           Female       Ref                        
#>                                                    Male      1.86 [1.48;2.33]    < 1e-04 
#>                    age                                       0.97 [0.97;0.98]    < 1e-04 
#>                   born Born in 50 US states or Washingt       Ref                        
#>                                                  Others      0.67 [0.52;0.85]   0.001233 
#>                   race                            Black       Ref                        
#>                                                Hispanic      1.26 [0.94;1.69]   0.128165 
#>                                                   Other      1.21 [0.85;1.73]   0.284083 
#>                                                   White      1.11 [0.85;1.45]   0.460652 
#>              education                          College       Ref                        
#>                                             High.School      0.84 [0.68;1.02]   0.083806 
#>                                                  School      1.22 [0.83;1.80]   0.305514 
#>                married                          Married       Ref                        
#>                                           Never.married      1.29 [1.00;1.68]   0.049983 
#>                                      Previously.married      0.89 [0.70;1.13]   0.339401 
#>                 income                             <25k       Ref                        
#>                                        Between.25kto54k      1.00 [0.78;1.28]   0.999445 
#>                                        Between.55kto99k      0.90 [0.68;1.17]   0.425427 
#>                                                Over100k      0.89 [0.66;1.20]   0.447012 
#>            diastolicBP                                       0.98 [0.97;0.99]    < 1e-04 
#>             systolicBP                                       1.01 [1.00;1.01]   0.042769 
#>                    bmi                                       1.01 [0.99;1.02]   0.496430 
#>          triglycerides                                       0.99 [0.99;0.99]    < 1e-04 
#>              uric.acid                                       0.96 [0.89;1.03]   0.242942 
#>                protein                                       0.62 [0.43;0.89]   0.010343 
#>              bilirubin                                       1.24 [0.89;1.72]   0.203993 
#>             phosphorus                                       0.95 [0.80;1.12]   0.539847 
#>                 sodium                                       1.06 [1.02;1.11]   0.006777 
#>              potassium                                       0.96 [0.72;1.28]   0.790080 
#>               globulin                                       1.37 [0.94;2.00]   0.102430 
#>                calcium                                       0.64 [0.46;0.88]   0.005772 
#>          physical.work                               No       Ref                        
#>                                                     Yes      0.91 [0.74;1.12]   0.382281 
#>  physical.recreational                               No       Ref                        
#>                                                     Yes      1.04 [0.85;1.29]   0.682962 
#>               diabetes                               No       Ref                        
#>                                                     Yes      2.69 [2.03;3.57]    < 1e-04

# VIF
car::vif(fit5)
#>                           GVIF Df GVIF^(1/(2*Df))
#> gender                1.749947  1        1.322856
#> age                   1.850160  1        1.360206
#> born                  1.640947  1        1.280994
#> race                  2.345460  3        1.152669
#> education             1.430721  2        1.093676
#> married               1.432015  2        1.093923
#> income                1.409064  3        1.058819
#> diastolicBP           1.289411  1        1.135523
#> systolicBP            1.605248  1        1.266984
#> bmi                   1.477795  1        1.215646
#> triglycerides         1.246395  1        1.116421
#> uric.acid             1.624039  1        1.274378
#> protein               3.648367  1        1.910070
#> bilirubin             1.177643  1        1.085193
#> phosphorus            1.114298  1        1.055603
#> sodium                1.117463  1        1.057101
#> potassium             1.176914  1        1.084857
#> globulin              3.395946  1        1.842809
#> calcium               1.542486  1        1.241969
#> physical.work         1.089742  1        1.043907
#> physical.recreational 1.197719  1        1.094404
#> diabetes              1.200402  1        1.095629

The AUC for this new model is also calculated.

#### AUC
pred.y <- predict(fit5, type = "response")
rocobj <- roc(analytic3$cholesterol.bin, pred.y)
#> Setting levels: control = unhealthy, case = healthy
#> Setting direction: controls < cases
rocobj
#> 
#> Call:
#> roc.default(response = analytic3$cholesterol.bin, predictor = pred.y)
#> 
#> Data: pred.y in 1046 controls (analytic3$cholesterol.bin unhealthy) < 1586 cases (analytic3$cholesterol.bin healthy).
#> Area under the curve: 0.7406
auc(rocobj)
#> Area under the curve: 0.7406

Save data

save.image(file = "Data/predictivefactors/cholesterolNHANES15part2.RData")

References