Cross-validation

Cross-validation is another important technique used to assess the performance of machine learning models and mitigate the risk of overfitting. This tutorial focuses on k-fold cross-validation as a strategy to obtain a more generalized and robust assessment of the model’s performance. It shows both manual calculations for individual folds and an automated approach using the caret package. This ensures that you aren’t simply fitting your model well to a specific subset of your data but are achieving good performance in a general sense.

# Load required packages
library(caret)
library(knitr)
library(Publish)
library(car)
library(DescTools)

Load data

Load the data saved at the end of previous part of the lab.

load(file="Data/predictivefactors/cholesterolNHANES15part2.RData")

k-fold cross-vaildation

See Wikipedia (2023)

We can set the number of folds to 5 (k = 5). A random seed is used for reproducibility. We use the function createFolds to create the folds. The data is divided based on the cholesterol levels, with each fold having approximately equal numbers of data points. The resulting structure contains training indices for each fold.

We can also examine the approximate size of training and test sets for each fold. The dimensions are displayed to understand the partitioning, and you can examine the length of indices in each fold to confirm the size of the training sets.

k = 5
dim(analytic3)
#> [1] 2632   35
set.seed(567)

# Create folds (based on the outcome)
folds <- createFolds(analytic3$cholesterol, k = k, list = TRUE, 
                     returnTrain = TRUE)
mode(folds)
#> [1] "list"

# Approximate training data size
dim(analytic3)*4/5
#> [1] 2105.6   28.0

# Approximate test data size
dim(analytic3)/5  
#> [1] 526.4   7.0

length(folds[[1]])
#> [1] 2105
length(folds[[2]])
#> [1] 2107
length(folds[[3]])
#> [1] 2106
length(folds[[4]])
#> [1] 2105
length(folds[[5]])
#> [1] 2105

str(folds[[1]])
#>  int [1:2105] 1 3 5 6 8 10 11 12 13 14 ...
str(folds[[2]])
#>  int [1:2107] 1 2 3 4 5 6 7 8 9 12 ...
str(folds[[3]])
#>  int [1:2106] 2 4 5 7 8 9 10 11 12 14 ...
str(folds[[4]])
#>  int [1:2105] 1 2 3 4 6 7 8 9 10 11 ...
str(folds[[5]])
#>  int [1:2105] 1 2 3 4 5 6 7 9 10 11 ...

Calculation for Fold 1

The first fold is used as an example. The indices for the training data in the first fold are extracted and used to subset the main data set into training and test sets for that fold. Then a linear regression model is fitted using the training data, and predictions are made on the test set. The model’s performance is evaluated using the same performance function as before.

fold.index <- 1
fold1.train.ids <- folds[[fold.index]]
head(fold1.train.ids)
#> [1]  1  3  5  6  8 10

fold1.train <- analytic3[fold1.train.ids,]
fold1.test <- analytic3[-fold1.train.ids,]
formula4
#> cholesterol ~ gender + age + born + race + education + married + 
#>     income + diastolicBP + systolicBP + bmi + triglycerides + 
#>     uric.acid + protein + bilirubin + phosphorus + sodium + potassium + 
#>     globulin + calcium + physical.work + physical.recreational + 
#>     diabetes

model.fit <- lm(formula4, data = fold1.train)
predictions <- predict(model.fit, newdata = fold1.test)

perform(new.data=fold1.test, y.name = "cholesterol", 
        model.fit = model.fit)
#>        n  p df.residual      SSE      SST    R2 adjR2  sigma    logLik      AIC
#> [1,] 527 29         498 637317.5 830983.2 0.233  0.19 35.774 -2618.471 5296.942
#>           BIC
#> [1,] 5424.958

Calculation for Fold 2

The same process is repeated for the second fold. This way, you can manually evaluate how the model performs on different subsets of the data, making the performance assessment more robust.

fold.index <- 2
fold1.train.ids <- folds[[fold.index]]
head(fold1.train.ids)
#> [1] 1 2 3 4 5 6

fold1.train <- analytic3[fold1.train.ids,]
fold1.test <- analytic3[-fold1.train.ids,]

model.fit <- lm(formula4, data = fold1.train)

predictions <- predict(model.fit, newdata = fold1.test)
perform(new.data=fold1.test, y.name = "cholesterol", 
        model.fit = model.fit)
#>        n  p df.residual    SSE      SST    R2 adjR2  sigma    logLik      AIC
#> [1,] 525 29         496 615243 785326.6 0.217 0.172 35.219 -2600.282 5260.564
#>           BIC
#> [1,] 5388.466

Using caret package to automate

See Kuhn (2023)

Instead of manually running the process for each fold, the caret package can be used to automate k-fold cross-validation. A control object is set up specifying that 5-fold cross-validation should be used. Then, the train function from the caret package can be used to fit the linear regression model on each fold.

After fitting, you can access summary results for each fold in the resampling results. This summary provides performance metrics such as R-squared for each fold. You can calculate the mean and standard deviation of these metrics to get an overall sense of the model’s performance.

Additionally, an adjusted R-squared can be calculated to consider the number of predictors in the model, giving a more accurate sense of the model’s explanatory power when you have multiple predictors.

# Using Caret package
set.seed(567)

# make a 5-fold CV
ctrl<-trainControl(method = "cv",number = 5)

# fit the model with formula = formula4
# use training method lm
fit4.cv<-train(formula4, trControl = ctrl,
               data = analytic3, method = "lm")
fit4.cv
#> Linear Regression 
#> 
#> 2632 samples
#>   22 predictor
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold) 
#> Summary of sample sizes: 2106, 2105, 2106, 2105, 2106 
#> Resampling results:
#> 
#>   RMSE      Rsquared   MAE     
#>   35.62758  0.2194187  27.85731
#> 
#> Tuning parameter 'intercept' was held constant at a value of TRUE

# extract results from each test data 
summary.res <- fit4.cv$resample
summary.res

mean(fit4.cv$resample$Rsquared)
#> [1] 0.2194187
sd(fit4.cv$resample$Rsquared)
#> [1] 0.02755561

# # extract adj R2
# k <- 5
# p <- 2
# n <- round(nrow(analytic3)/k)
# summary.res$adjR2 <- 1-(1-fit4.cv$resample$Rsquared)*
#  ((n-1)/(n-p))
# summary.res

References

Kuhn, Max. 2023. “Model Training and Tuning.” https://topepo.github.io/caret/model-training-and-tuning.html.

Wikipedia. 2023. “Cross-Validation (Statistics).” https://en.wikipedia.org/wiki/Cross-validation_(statistics).