Collinear predictors

In this tutorial, we’ll continue with the data analysis. We’ll focus on an analysis of an NHANES data. This data contains over 1200 observations and 33 variables. These variables come in various types: numeric, categorical, and binary. Our primary goal is to fit a linear regression model to predict cholesterol levels.

# Load required packages
library(rms)
library(Hmisc)

Load data

Let us load the dataset and see structure of the variables:

load(file = "Data/predictivefactors/cholesterolNHANES15.RData")
#head(analytic)
str(analytic)
#> 'data.frame':    1267 obs. of  33 variables:
#>  $ ID                   : num  83732 83733 83741 83747 83750 ...
#>  $ gender               : chr  "Male" "Male" "Male" "Male" ...
#>  $ age                  : num  62 53 22 46 45 30 60 69 24 70 ...
#>  $ born                 : chr  "Born in 50 US states or Washingt" "Others" "Born in 50 US states or Washingt" "Others" ...
#>  $ race                 : chr  "White" "White" "Black" "White" ...
#>  $ education            : chr  "College" "High.School" "College" "College" ...
#>  $ married              : chr  "Married" "Previously.married" "Never.married" "Married" ...
#>  $ income               : chr  "Between.55kto99k" "<25k" "Between.25kto54k" "<25k" ...
#>  $ weight               : num  135630 25282 39353 35674 97002 ...
#>  $ psu                  : num  1 1 2 1 1 1 1 2 1 2 ...
#>  $ strata               : num  125 125 128 121 125 124 128 120 130 132 ...
#>  $ diastolicBP          : num  70 88 70 94 70 50 74 70 72 54 ...
#>  $ systolicBP           : num  128 146 110 144 116 104 142 146 126 144 ...
#>  $ bodyweight           : num  94.8 90.4 76.6 86.2 76.2 71.2 75.6 84 89.2 81.7 ...
#>  $ bodyheight           : num  184 171 165 177 178 ...
#>  $ bmi                  : num  27.8 30.8 28 27.6 24.1 26.6 35.9 31 26.9 27 ...
#>  $ waist                : num  101.1 107.9 86.6 104.3 90.1 ...
#>  $ smoke                : chr  "Not.at.all" "Every.day" "Some.days" "Every.day" ...
#>  $ alcohol              : num  1 6 8 1 3 2 1 1 2 2 ...
#>  $ cholesterol          : num  173 265 164 242 181 184 205 287 126 192 ...
#>  $ cholesterolM2        : num  4.47 6.85 4.24 6.26 4.68 4.76 5.3 7.42 3.26 4.97 ...
#>  $ triglycerides        : num  158 170 77 497 63 62 169 245 95 64 ...
#>  $ uric.acid            : num  4.2 7 6 6.5 5.4 5.5 5.1 4.3 7.6 7.1 ...
#>  $ protein              : num  7.5 7.4 7.4 6.8 7.4 6.7 7.4 6.8 7.3 7.2 ...
#>  $ bilirubin            : num  0.5 0.6 0.2 0.5 0.7 0.8 0.4 0.6 1.2 1.2 ...
#>  $ phosphorus           : num  4.7 4.4 5.3 3.6 3.9 3.4 3.9 4.4 3.2 3 ...
#>  $ sodium               : num  136 140 139 138 138 136 139 140 140 139 ...
#>  $ potassium            : num  4.3 4.55 4.16 4.27 3.91 3.97 3.99 4.25 3.8 4.63 ...
#>  $ globulin             : num  2.9 2.9 3 2.6 2.8 2.5 3.2 2.3 2.7 2.6 ...
#>  $ calcium              : num  9.8 9.8 9.3 9.3 9.3 9.4 9.6 9.6 9.6 9.6 ...
#>  $ physical.work        : chr  "No" "No" "No" "No" ...
#>  $ physical.recreational: chr  "No" "No" "Yes" "No" ...
#>  $ diabetes             : chr  "Yes" "No" "No" "No" ...
#>  - attr(*, "na.action")= 'omit' Named int [1:3739] 3 4 5 6 8 9 13 14 15 16 ...
#>   ..- attr(*, "names")= chr [1:3739] "3" "4" "5" "6" ...

Describe the data

require(rms)
describe(analytic) 
#> analytic 
#> 
#>  33  Variables      1267  Observations
#> --------------------------------------------------------------------------------
#> ID 
#>        n  missing distinct     Info     Mean      Gmd      .05      .10 
#>     1267        0     1267        1    88660     3366    84250    84687 
#>      .25      .50      .75      .90      .95 
#>    86019    88692    91252    92670    93089 
#> 
#> lowest : 83732 83733 83741 83747 83750, highest: 93617 93633 93643 93659 93685
#> --------------------------------------------------------------------------------
#> gender 
#>        n  missing distinct 
#>     1267        0        2 
#>                         
#> Value      Female   Male
#> Frequency     496    771
#> Proportion  0.391  0.609
#> --------------------------------------------------------------------------------
#> age 
#>        n  missing distinct     Info     Mean      Gmd      .05      .10 
#>     1267        0       61        1    49.91    19.18       24       27 
#>      .25      .50      .75      .90      .95 
#>       36       51       63       72       78 
#> 
#> lowest : 20 21 22 23 24, highest: 76 77 78 79 80
#> --------------------------------------------------------------------------------
#> born 
#>        n  missing distinct 
#>     1267        0        2 
#>                                                                             
#> Value      Born in 50 US states or Washingt                           Others
#> Frequency                               991                              276
#> Proportion                            0.782                            0.218
#> --------------------------------------------------------------------------------
#> race 
#>        n  missing distinct 
#>     1267        0        4 
#>                                               
#> Value         Black Hispanic    Other    White
#> Frequency       246      337      132      552
#> Proportion    0.194    0.266    0.104    0.436
#> --------------------------------------------------------------------------------
#> education 
#>        n  missing distinct 
#>     1267        0        3 
#>                                               
#> Value          College High.School      School
#> Frequency          648         523          96
#> Proportion       0.511       0.413       0.076
#> --------------------------------------------------------------------------------
#> married 
#>        n  missing distinct 
#>     1267        0        3 
#>                                                                    
#> Value                 Married      Never.married Previously.married
#> Frequency                 751                226                290
#> Proportion              0.593              0.178              0.229
#> --------------------------------------------------------------------------------
#> income 
#>        n  missing distinct 
#>     1267        0        4 
#>                                                                               
#> Value                  <25k Between.25kto54k Between.55kto99k         Over100k
#> Frequency               344              435              297              191
#> Proportion            0.272            0.343            0.234            0.151
#> --------------------------------------------------------------------------------
#> weight 
#>        n  missing distinct     Info     Mean      Gmd      .05      .10 
#>     1267        0     1184        1    48904    44337     9158    11549 
#>      .25      .50      .75      .90      .95 
#>    19540    30335    63822   121803   151546 
#> 
#> lowest :   5470.041   5948.955   6197.660   6480.947   6703.837
#> highest: 203562.855 207197.232 213611.345 218138.797 224891.623
#> --------------------------------------------------------------------------------
#> psu 
#>        n  missing distinct     Info     Mean      Gmd 
#>     1267        0        2     0.75    1.493   0.5003 
#>                       
#> Value          1     2
#> Frequency    642   625
#> Proportion 0.507 0.493
#> --------------------------------------------------------------------------------
#> strata 
#>        n  missing distinct     Info     Mean      Gmd      .05      .10 
#>     1267        0       15    0.994    126.3    4.792      120      121 
#>      .25      .50      .75      .90      .95 
#>      123      126      130      132      133 
#> 
#> lowest : 119 120 121 122 123, highest: 129 130 131 132 133
#>                                                                             
#> Value        119   120   121   122   123   124   125   126   127   128   129
#> Frequency     47    74   118    63    77    66   114   104   107    65    53
#> Proportion 0.037 0.058 0.093 0.050 0.061 0.052 0.090 0.082 0.084 0.051 0.042
#>                                   
#> Value        130   131   132   133
#> Frequency     99   120    95    65
#> Proportion 0.078 0.095 0.075 0.051
#> --------------------------------------------------------------------------------
#> diastolicBP 
#>        n  missing distinct     Info     Mean      Gmd      .05      .10 
#>     1267        0       41    0.997    70.37    13.99       52       54 
#>      .25      .50      .75      .90      .95 
#>       62       70       78       86       92 
#> 
#> lowest :   0  26  34  38  40, highest: 104 106 108 110 112
#> --------------------------------------------------------------------------------
#> systolicBP 
#>        n  missing distinct     Info     Mean      Gmd      .05      .10 
#>     1267        0       56    0.998    126.5     19.3    102.0    106.0 
#>      .25      .50      .75      .90      .95 
#>    114.0    124.0    136.0    148.8    160.0 
#> 
#> lowest :  84  88  90  92  94, highest: 194 196 206 218 236
#> --------------------------------------------------------------------------------
#> bodyweight 
#>        n  missing distinct     Info     Mean      Gmd      .05      .10 
#>     1267        0      615        1    84.95    23.56    56.29    61.10 
#>      .25      .50      .75      .90      .95 
#>    69.70    81.40    97.00   113.44   127.47 
#> 
#> lowest :  39.7  39.8  40.7  42.6  42.7, highest: 161.9 166.3 175.7 175.9 178.4
#> --------------------------------------------------------------------------------
#> bodyheight 
#>        n  missing distinct     Info     Mean      Gmd      .05      .10 
#>     1267        0      376        1    169.2    10.66    153.8    157.0 
#>      .25      .50      .75      .90      .95 
#>    162.6    169.3    176.2    181.1    184.2 
#> 
#> lowest : 143.8 144.2 145.2 145.9 146.2, highest: 194.6 195.1 195.6 198.4 201.0
#> --------------------------------------------------------------------------------
#> bmi 
#>        n  missing distinct     Info     Mean      Gmd      .05      .10 
#>     1267        0      284        1    29.58    7.403    20.60    22.06 
#>      .25      .50      .75      .90      .95 
#>    24.80    28.60    33.30    38.24    42.00 
#> 
#> lowest : 16.3 17.5 17.6 17.7 17.9, highest: 57.2 57.6 59.4 60.7 64.5
#> --------------------------------------------------------------------------------
#> waist 
#>        n  missing distinct     Info     Mean      Gmd      .05      .10 
#>     1267        0      544        1    101.8    18.47     77.1     81.4 
#>      .25      .50      .75      .90      .95 
#>     90.5    100.3    111.2    122.8    132.5 
#> 
#> lowest :  65.0  65.5  66.5  68.2  68.7, highest: 159.2 159.8 160.2 160.5 161.5
#> --------------------------------------------------------------------------------
#> smoke 
#>        n  missing distinct 
#>     1267        0        3 
#>                                            
#> Value       Every.day Not.at.all  Some.days
#> Frequency         448        665        154
#> Proportion      0.354      0.525      0.122
#> --------------------------------------------------------------------------------
#> alcohol 
#>        n  missing distinct     Info     Mean      Gmd      .05      .10 
#>     1267        0       14    0.952    3.109    2.419        1        1 
#>      .25      .50      .75      .90      .95 
#>        1        2        4        6        8 
#> 
#> lowest :  1  2  3  4  5, highest: 10 11 12 14 15
#>                                                                             
#> Value          1     2     3     4     5     6     7     8     9    10    11
#> Frequency    336   371   189   106    79    95    10    26     4    20     1
#> Proportion 0.265 0.293 0.149 0.084 0.062 0.075 0.008 0.021 0.003 0.016 0.001
#>                             
#> Value         12    14    15
#> Frequency     23     1     6
#> Proportion 0.018 0.001 0.005
#> --------------------------------------------------------------------------------
#> cholesterol 
#>        n  missing distinct     Info     Mean      Gmd      .05      .10 
#>     1267        0      203        1    193.1    47.47    132.0    142.0 
#>      .25      .50      .75      .90      .95 
#>    162.5    191.0    217.0    248.0    268.0 
#> 
#> lowest :  81  93  97 100 101, highest: 345 348 349 358 545
#> --------------------------------------------------------------------------------
#> cholesterolM2 
#>        n  missing distinct     Info     Mean      Gmd      .05      .10 
#>     1267        0      203        1    4.994    1.228    3.410    3.670 
#>      .25      .50      .75      .90      .95 
#>    4.205    4.940    5.610    6.410    6.930 
#> 
#> lowest :  2.09  2.40  2.51  2.59  2.61, highest:  8.92  9.00  9.03  9.26 14.09
#> --------------------------------------------------------------------------------
#> triglycerides 
#>        n  missing distinct     Info     Mean      Gmd      .05      .10 
#>     1267        0      361        1    165.8    124.1     48.0     59.0 
#>      .25      .50      .75      .90      .95 
#>     84.0    127.0    201.5    309.0    396.6 
#> 
#> lowest :   18   21   24   25   31, highest:  964 1020 1157 1253 3061
#> --------------------------------------------------------------------------------
#> uric.acid 
#>        n  missing distinct     Info     Mean      Gmd      .05      .10 
#>     1267        0       84        1    5.598    1.626     3.43     3.80 
#>      .25      .50      .75      .90      .95 
#>     4.60     5.50     6.50     7.40     8.00 
#> 
#> lowest :  1.6  2.2  2.3  2.4  2.5, highest: 10.2 10.3 11.7 12.2 18.0
#> --------------------------------------------------------------------------------
#> protein 
#>        n  missing distinct     Info     Mean      Gmd      .05      .10 
#>     1267        0       32    0.995    7.126   0.5095      6.4      6.6 
#>      .25      .50      .75      .90      .95 
#>      6.8      7.1      7.4      7.7      7.9 
#> 
#> lowest : 5.7 5.8 5.9 6.0 6.1, highest: 8.4 8.5 8.6 8.8 9.0
#> --------------------------------------------------------------------------------
#> bilirubin 
#>        n  missing distinct     Info     Mean      Gmd      .05      .10 
#>     1267        0       31    0.984   0.5467   0.2949      0.2      0.2 
#>      .25      .50      .75      .90      .95 
#>      0.4      0.5      0.7      0.9      1.0 
#> 
#> lowest : 0.00 0.01 0.02 0.03 0.04, highest: 1.80 2.00 2.10 2.60 3.30
#> --------------------------------------------------------------------------------
#> phosphorus 
#>        n  missing distinct     Info     Mean      Gmd      .05      .10 
#>     1267        0       37    0.996    3.642    0.593      2.8      3.0 
#>      .25      .50      .75      .90      .95 
#>      3.3      3.6      4.0      4.3      4.5 
#> 
#> lowest : 1.8 2.0 2.2 2.3 2.4, highest: 5.2 5.3 5.4 5.6 6.1
#> --------------------------------------------------------------------------------
#> sodium 
#>        n  missing distinct     Info     Mean      Gmd      .05      .10 
#>     1267        0       20    0.977    138.5    2.383      135      136 
#>      .25      .50      .75      .90      .95 
#>      137      139      140      141      142 
#> 
#> lowest : 124 126 129 130 131, highest: 142 143 144 146 148
#>                                                                             
#> Value        124   126   129   130   131   132   133   134   135   136   137
#> Frequency      1     1     1     1     5     4    11    23    46    93   176
#> Proportion 0.001 0.001 0.001 0.001 0.004 0.003 0.009 0.018 0.036 0.073 0.139
#>                                                                 
#> Value        138   139   140   141   142   143   144   146   148
#> Frequency    235   260   206   112    55    29     6     1     1
#> Proportion 0.185 0.205 0.163 0.088 0.043 0.023 0.005 0.001 0.001
#> --------------------------------------------------------------------------------
#> potassium 
#>        n  missing distinct     Info     Mean      Gmd      .05      .10 
#>     1267        0      175        1    3.985   0.3725     3.45     3.57 
#>      .25      .50      .75      .90      .95 
#>     3.78     3.98     4.19     4.40     4.54 
#> 
#> lowest : 2.60 2.92 2.96 3.07 3.09, highest: 5.15 5.21 5.36 5.37 5.51
#> --------------------------------------------------------------------------------
#> globulin 
#>        n  missing distinct     Info     Mean      Gmd      .05      .10 
#>     1267        0       29    0.994    2.799   0.4536      2.2      2.3 
#>      .25      .50      .75      .90      .95 
#>      2.5      2.8      3.0      3.3      3.5 
#> 
#> lowest : 1.6 1.8 1.9 2.0 2.1, highest: 4.1 4.2 4.3 4.5 5.5
#> --------------------------------------------------------------------------------
#> calcium 
#>        n  missing distinct     Info     Mean      Gmd      .05      .10 
#>     1267        0       25    0.991    9.335   0.3786      8.8      8.9 
#>      .25      .50      .75      .90      .95 
#>      9.1      9.3      9.6      9.7      9.9 
#> 
#> lowest :  8.4  8.5  8.6  8.7  8.8, highest: 10.4 10.5 10.7 11.0 11.1
#> --------------------------------------------------------------------------------
#> physical.work 
#>        n  missing distinct 
#>     1267        0        2 
#>                       
#> Value         No   Yes
#> Frequency    895   372
#> Proportion 0.706 0.294
#> --------------------------------------------------------------------------------
#> physical.recreational 
#>        n  missing distinct 
#>     1267        0        2 
#>                       
#> Value         No   Yes
#> Frequency   1002   265
#> Proportion 0.791 0.209
#> --------------------------------------------------------------------------------
#> diabetes 
#>        n  missing distinct 
#>     1267        0        2 
#>                     
#> Value        No  Yes
#> Frequency  1064  203
#> Proportion 0.84 0.16
#> --------------------------------------------------------------------------------

Collinearity

Avoiding collinear variables can result in a more interpretable, stable, and efficient predictive model. Collinearity refers to a situation in which two or more predictor variables in a multiple regression model are highly correlated, meaning that one can be linearly predicted from the others with substantial accuracy. Collinearity poses several issues for predictive analysis:

Reduced Interpretability: When predictor variables are highly correlated, it becomes challenging to isolate the impact of individual predictors on the response variable. In other words, it is difficult to determine which predictor is genuinely influential in explaining variance in the response variable. This reduces the interpretability of the model.

Unstable Coefficients: Collinearity can lead to inflated standard errors of the regression coefficients. This means that the coefficients can be very sensitive to small changes in the data, making the model unstable and less generalizable to new, unseen data.

Overfitting: When predictors are collinear, the model is more likely to fit to the noise in the data rather than the actual signal. This is a manifestation of overfitting, where the model becomes too complex and captures random noise. Overfitted models will perform poorly on new, unseen data.

Inefficiency: Including redundant variables (collinear variables) does not add additional information to the model. This could be inefficient, especially when dealing with large datasets, as it increases computational costs without improving model performance.

Multicollinearity Diagnostics

Several techniques are available for diagnosing multicollinearity, including:

  • Variance Inflation Factor (VIF): The VIF is a measure of multicollinearity, which measures how much the variance of a regression coefficient is increased because of multicollinearity. A general rule of thumb is that if VIF exceeds 4 or 5, multicollinearity is present.

  • Eigenvalues and Eigenvectors of the correlation/covariance matrix: The eigenvalues of the correlation matrix can also be used to measure the presence of multicollinearity, with one or more eigenvalues close to zero indicating multicollinearity.

  • Pairwise correlation matrices: Large correlation coefficients in the correlation matrix of predictors indicate the possibility of multicollinearity.

Remedies

Some common ways to handle collinearity include:

  • Removing one of the correlated predictors
  • Combining correlated predictors into a single composite predictor
  • Using regularization techniques like Ridge Regression, which can help handle collinearity by adding a penalty term. Ridge regression adds a small amount of bias to the regression estimates, which in turn reduces the standard error, addresses the collinearity issue to some extent, and makes the results more reliable.

Identify collinear predictors

Tip

We can also use hclust and varclus or variable clustering, i.e., to identify collinear predictors

hclust is the hierarchical clustering function where default is squared Spearman correlation coefficients to detect monotonic but nonlinear relationships.

require(Hmisc)
sel.names <- c("gender", "age", "born", "race", "education", 
               "married", "income", "diastolicBP", "systolicBP", 
               "bodyweight", "bodyheight", "bmi", "waist", "smoke",
               "alcohol",  "cholesterol", "triglycerides", 
               "uric.acid", "protein", "bilirubin", "phosphorus",
               "sodium", "potassium", "globulin", "calcium", 
               "physical.work", "physical.recreational", 
               "diabetes")
var.cluster <- varclus(~., data = analytic[sel.names])
# var.cluster
plot(var.cluster)

In this plot, Spearman correlation or Spearman’s \(\rho\)(rho) assesses the monotonic but nonlinear relationships predictors. We can also specify linear relationships such as Pearson correlation. A high correlation coefficient indicates predictors are highly correlated. For example, Spearman’s \(\rho\) is more than 0.8 for bodyweight, BMI, and waist, indicating high collinearity between these predictors.

Video content (optional)

Tip

For those who prefer a video walkthrough, feel free to watch the video below, which offers a description of an earlier version of the above content.