Data Splitting

This tutorial is focused on a crucial aspect of model building: splitting your data into training and test sets to avoid overfitting. Overfitting occurs when your model learns the noise in the data rather than the underlying trend. As a result, the model performs well on the training data but poorly on new, unseen data. To mitigate this, you often split your data.

Load data anf files

Initially, several libraries are loaded to facilitate data manipulation and analysis.

# Load required packages
library(caret)
library(knitr)
library(Publish)
library(car)
library(DescTools)

Then, previously saved dataset related to cholesterol and other factors is loaded for further use.

load(file="Data/predictivefactors/cholesterolNHANES15part2.RData")

Data splitting to avoid model overfitting

You start by setting a random seed to ensure that the random splitting of data is reproducible. A specified function is then used to partition the data, taking as arguments the outcome variable (cholesterol level in this case) and the percentage of data that you want to allocate to the training set (70% in this example).

KDnuggets (2023)

Kuhn (2023a)

Tip

We can use the createDataPartition function to split a dataset into training and testing datasets. The function will return the row indices that should go into the training set. These indices are stored in a variable, and its dimensions are displayed to provide an understanding of the size of the training set that will be created. Additionally, you can calculate what 70% of your entire dataset would look like to verify the approximation of the training data size, as well as what the remaining 30% (for the test set) would look like.

# Using a seed to randomize in a reproducible way 
set.seed(123)
split <- createDataPartition(y = analytic3$cholesterol, p = 0.7, list = FALSE)
str(split)
#>  int [1:1844, 1] 3 4 5 8 9 13 14 16 20 21 ...
#>  - attr(*, "dimnames")=List of 2
#>   ..$ : NULL
#>   ..$ : chr "Resample1"
dim(split)
#> [1] 1844    1

# Approximate train data
dim(analytic3)*.7 
#> [1] 1842.4   24.5

# Approximate test data
dim(analytic3)*(1-.7) 
#> [1] 789.6  10.5

Split the data

After determining how to partition the data, the next step is actually creating the training and test datasets. The indices are used to subset the original dataset into these two new datasets. The dimensions of each dataset are displayed to confirm their sizes.

# Create train data
train.data <- analytic3[split,]
dim(train.data)
#> [1] 1844   35

# Create test data
test.data <- analytic3[-split,]
dim(test.data)
#> [1] 788  35

Our next task is to fit the model (e.g., linear regression) on the training set and evaluate the performance on the test set.

Train the model

Once the training dataset is created, you can proceed to train the model using the training data. A previously defined formula containing the predictor variables is used in a linear regression model. After fitting the model, a summary is generated to display key statistics that help in evaluating the model’s performance.

formula4
#> cholesterol ~ gender + age + born + race + education + married + 
#>     income + diastolicBP + systolicBP + bmi + triglycerides + 
#>     uric.acid + protein + bilirubin + phosphorus + sodium + potassium + 
#>     globulin + calcium + physical.work + physical.recreational + 
#>     diabetes
fit4.train1 <- lm(formula4, data = train.data)
summary(fit4.train1)
#> 
#> Call:
#> lm(formula = formula4, data = train.data)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -91.973 -23.719  -1.563  20.586 178.542 
#> 
#> Coefficients:
#>                             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)                72.716792  59.916086   1.214  0.22504    
#> genderMale                -11.293629   2.136545  -5.286 1.40e-07 ***
#> age                         0.306235   0.066376   4.614 4.23e-06 ***
#> bornOthers                  7.220858   2.300658   3.139  0.00172 ** 
#> raceHispanic               -6.727473   2.709718  -2.483  0.01313 *  
#> raceOther                  -4.865771   3.237066  -1.503  0.13298    
#> raceWhite                  -1.468522   2.494981  -0.589  0.55621    
#> educationHigh.School        1.626097   1.920289   0.847  0.39722    
#> educationSchool            -4.853095   3.585185  -1.354  0.17602    
#> marriedNever.married       -5.298265   2.332033  -2.272  0.02321 *  
#> marriedPreviously.married   1.202448   2.305191   0.522  0.60199    
#> incomeBetween.25kto54k     -1.736495   2.360385  -0.736  0.46202    
#> incomeBetween.55kto99k      0.170505   2.565896   0.066  0.94703    
#> incomeOver100k              1.712359   2.860226   0.599  0.54946    
#> diastolicBP                 0.355813   0.074380   4.784 1.86e-06 ***
#> systolicBP                  0.037464   0.059848   0.626  0.53140    
#> bmi                        -0.282881   0.139160  -2.033  0.04222 *  
#> triglycerides               0.123797   0.007613  16.261  < 2e-16 ***
#> uric.acid                   1.006499   0.712871   1.412  0.15815    
#> protein                     1.721623   3.468969   0.496  0.61975    
#> bilirubin                  -6.143411   3.006858  -2.043  0.04118 *  
#> phosphorus                  0.093824   1.575489   0.060  0.95252    
#> sodium                     -0.604286   0.400694  -1.508  0.13170    
#> potassium                  -0.583525   2.715189  -0.215  0.82986    
#> globulin                   -0.278970   3.614404  -0.077  0.93849    
#> calcium                    15.679677   3.054968   5.133 3.17e-07 ***
#> physical.workYes           -1.099540   1.960321  -0.561  0.57494    
#> physical.recreationalYes    0.834737   1.953960   0.427  0.66928    
#> diabetesYes               -19.932101   2.580138  -7.725 1.83e-14 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 34.68 on 1815 degrees of freedom
#> Multiple R-squared:  0.2433, Adjusted R-squared:  0.2316 
#> F-statistic: 20.84 on 28 and 1815 DF,  p-value: < 2.2e-16

Extract performance measures

You can use a saved function to measure the performance of the trained model. The function will return performance metrics like R-squared, RMSE, etc. This function is applied not just to the training data but also to the test data, the full dataset, and a separate, fictitious dataset.

Tip

Below we use the perform function that we saved to evaluate the model performances

perform(new.data = train.data,y.name = "cholesterol", model.fit = fit4.train1)
#>         n  p df.residual     SSE     SST    R2 adjR2  sigma   logLik      AIC
#> [1,] 1844 29        1815 2182509 2884109 0.243 0.232 34.677 -9140.98 18341.96
#>           BIC
#> [1,] 18507.55
perform(new.data = test.data,y.name = "cholesterol", model.fit = fit4.train1)
#>        n  p df.residual     SSE     SST    R2 adjR2  sigma    logLik      AIC
#> [1,] 788 29         759 1057454 1372214 0.229 0.201 37.326 -3955.936 7971.873
#>           BIC
#> [1,] 8111.958
perform(new.data = analytic3,y.name = "cholesterol", model.fit = fit4.train1)
#>         n  p df.residual     SSE     SST    R2 adjR2 sigma    logLik      AIC
#> [1,] 2632 29        2603 3239962 4256586 0.239 0.231 35.28 -13098.82 26257.64
#>           BIC
#> [1,] 26433.91
perform(new.data = fictitious.data,y.name = "cholesterol", model.fit = fit4.train1)
#>         n  p df.residual     SSE     SST    R2 adjR2  sigma    logLik      AIC
#> [1,] 4121 29        4092 5306559 6912485 0.232 0.227 36.011 -20601.92 41263.84
#>           BIC
#> [1,] 41453.55

Evaluating the model’s performance on the test data provides insights into how well the model will generalize to new, unseen data. Comparing the performance metrics across different datasets can give you a robust view of your model’s predictive power and reliability.

For more on model training and tuning, see Kuhn (2023b)

References

KDnuggets. 2023. “Dataset Splitting Best Practices in Python.” https://www.kdnuggets.com/2020/05/dataset-splitting-best-practices-python.html.
Kuhn, Max. 2023a. “Data Splitting.” https://topepo.github.io/caret/data-splitting.html.
———. 2023b. “Model Training and Tuning.” https://topepo.github.io/caret/model-training-and-tuning.html.