Data Splitting
This tutorial is focused on a crucial aspect of model building: splitting your data into training and test sets to avoid overfitting. Overfitting occurs when your model learns the noise in the data rather than the underlying trend. As a result, the model performs well on the training data but poorly on new, unseen data. To mitigate this, you often split your data.
Load data anf files
Initially, several libraries are loaded to facilitate data manipulation and analysis.
Then, previously saved dataset related to cholesterol and other factors is loaded for further use.
Data splitting to avoid model overfitting
You start by setting a random seed to ensure that the random splitting of data is reproducible. A specified function is then used to partition the data, taking as arguments the outcome variable (cholesterol level in this case) and the percentage of data that you want to allocate to the training set (70% in this example).
We can use the createDataPartition
function to split a dataset into training and testing datasets. The function will return the row indices that should go into the training set. These indices are stored in a variable, and its dimensions are displayed to provide an understanding of the size of the training set that will be created. Additionally, you can calculate what 70% of your entire dataset would look like to verify the approximation of the training data size, as well as what the remaining 30% (for the test set) would look like.
# Using a seed to randomize in a reproducible way
set.seed(123)
split <- createDataPartition(y = analytic3$cholesterol, p = 0.7, list = FALSE)
str(split)
#> int [1:1844, 1] 3 4 5 8 9 13 14 16 20 21 ...
#> - attr(*, "dimnames")=List of 2
#> ..$ : NULL
#> ..$ : chr "Resample1"
dim(split)
#> [1] 1844 1
# Approximate train data
dim(analytic3)*.7
#> [1] 1842.4 24.5
# Approximate test data
dim(analytic3)*(1-.7)
#> [1] 789.6 10.5
Split the data
After determining how to partition the data, the next step is actually creating the training and test datasets. The indices are used to subset the original dataset into these two new datasets. The dimensions of each dataset are displayed to confirm their sizes.
Our next task is to fit the model (e.g., linear regression) on the training set and evaluate the performance on the test set.
Train the model
Once the training dataset is created, you can proceed to train the model using the training data. A previously defined formula containing the predictor variables is used in a linear regression model. After fitting the model, a summary is generated to display key statistics that help in evaluating the model’s performance.
formula4
#> cholesterol ~ gender + age + born + race + education + married +
#> income + diastolicBP + systolicBP + bmi + triglycerides +
#> uric.acid + protein + bilirubin + phosphorus + sodium + potassium +
#> globulin + calcium + physical.work + physical.recreational +
#> diabetes
fit4.train1 <- lm(formula4, data = train.data)
summary(fit4.train1)
#>
#> Call:
#> lm(formula = formula4, data = train.data)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -91.973 -23.719 -1.563 20.586 178.542
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 72.716792 59.916086 1.214 0.22504
#> genderMale -11.293629 2.136545 -5.286 1.40e-07 ***
#> age 0.306235 0.066376 4.614 4.23e-06 ***
#> bornOthers 7.220858 2.300658 3.139 0.00172 **
#> raceHispanic -6.727473 2.709718 -2.483 0.01313 *
#> raceOther -4.865771 3.237066 -1.503 0.13298
#> raceWhite -1.468522 2.494981 -0.589 0.55621
#> educationHigh.School 1.626097 1.920289 0.847 0.39722
#> educationSchool -4.853095 3.585185 -1.354 0.17602
#> marriedNever.married -5.298265 2.332033 -2.272 0.02321 *
#> marriedPreviously.married 1.202448 2.305191 0.522 0.60199
#> incomeBetween.25kto54k -1.736495 2.360385 -0.736 0.46202
#> incomeBetween.55kto99k 0.170505 2.565896 0.066 0.94703
#> incomeOver100k 1.712359 2.860226 0.599 0.54946
#> diastolicBP 0.355813 0.074380 4.784 1.86e-06 ***
#> systolicBP 0.037464 0.059848 0.626 0.53140
#> bmi -0.282881 0.139160 -2.033 0.04222 *
#> triglycerides 0.123797 0.007613 16.261 < 2e-16 ***
#> uric.acid 1.006499 0.712871 1.412 0.15815
#> protein 1.721623 3.468969 0.496 0.61975
#> bilirubin -6.143411 3.006858 -2.043 0.04118 *
#> phosphorus 0.093824 1.575489 0.060 0.95252
#> sodium -0.604286 0.400694 -1.508 0.13170
#> potassium -0.583525 2.715189 -0.215 0.82986
#> globulin -0.278970 3.614404 -0.077 0.93849
#> calcium 15.679677 3.054968 5.133 3.17e-07 ***
#> physical.workYes -1.099540 1.960321 -0.561 0.57494
#> physical.recreationalYes 0.834737 1.953960 0.427 0.66928
#> diabetesYes -19.932101 2.580138 -7.725 1.83e-14 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 34.68 on 1815 degrees of freedom
#> Multiple R-squared: 0.2433, Adjusted R-squared: 0.2316
#> F-statistic: 20.84 on 28 and 1815 DF, p-value: < 2.2e-16
Extract performance measures
You can use a saved function to measure the performance of the trained model. The function will return performance metrics like R-squared, RMSE, etc. This function is applied not just to the training data but also to the test data, the full dataset, and a separate, fictitious dataset.
Below we use the perform
function that we saved to evaluate the model performances
perform(new.data = train.data,y.name = "cholesterol", model.fit = fit4.train1)
#> n p df.residual SSE SST R2 adjR2 sigma logLik AIC
#> [1,] 1844 29 1815 2182509 2884109 0.243 0.232 34.677 -9140.98 18341.96
#> BIC
#> [1,] 18507.55
perform(new.data = test.data,y.name = "cholesterol", model.fit = fit4.train1)
#> n p df.residual SSE SST R2 adjR2 sigma logLik AIC
#> [1,] 788 29 759 1057454 1372214 0.229 0.201 37.326 -3955.936 7971.873
#> BIC
#> [1,] 8111.958
perform(new.data = analytic3,y.name = "cholesterol", model.fit = fit4.train1)
#> n p df.residual SSE SST R2 adjR2 sigma logLik AIC
#> [1,] 2632 29 2603 3239962 4256586 0.239 0.231 35.28 -13098.82 26257.64
#> BIC
#> [1,] 26433.91
perform(new.data = fictitious.data,y.name = "cholesterol", model.fit = fit4.train1)
#> n p df.residual SSE SST R2 adjR2 sigma logLik AIC
#> [1,] 4121 29 4092 5306559 6912485 0.232 0.227 36.011 -20601.92 41263.84
#> BIC
#> [1,] 41453.55
Evaluating the model’s performance on the test data provides insights into how well the model will generalize to new, unseen data. Comparing the performance metrics across different datasets can give you a robust view of your model’s predictive power and reliability.
For more on model training and tuning, see Kuhn (2023b)