Say, we aim to develop a prediction model to predict diabetes (a binary variable) based on some sociodemographic and clinical risk factors. \n We fit logistic regression model as follows: mod <- glm(diabetes ~ age + sex + race + education + triglycerides + protein + bilirubin + phosphorus + sodium + potassium + globulin + calcium, data = dat.train, family = binomial). \n The predicted probability of diabetes is calculated as: pred.diabetes <- predict(mod, type = 'response', newdata = dat.test). How would you calculate the area under the curve (AUC) values on the test data (dat.test)?
- A. pROC::roc(dat.train$diabetes, pred.diabetes)
- B. pROC::roc(mod)
- C. pROC::roc(dat.test$diabetes, pred.diabetes)
- D. pROC::roc(dat.test$diabetes)
- E. pROC::roc(dat.test$pred.diabetes)
Say, you aim to build a prediction model to predict CVD among Canadian adults using logistic regression. Which methods could be used to deal with model overfitting? (select ALL that apply)
- A. Model fitting on full dataset
- B. Selecting 20% of data
- C. Spliting the dataset into training and testing sets
- D. 1-fold/leave-one-out cross-validation
- E. Increasing the number of predictors in the model
- F. 10-fold cross-validation
- G. Bootstrapping
Data splitting is one of the techniques to get optimism-corrected predictions. But there are other methods as well that can be used to deal with model overfitting, such as cross-validation and bootstrapping.