Ideas related to Prediction

Overfitting: A modeling error that occurs when a model captures noise or random fluctuations in the training data instead of the underlying data distribution. An overfitted model performs well on training data but poorly on unseen data.
- Performance measures
  - C-statistic: a measure of discrimination. Also known as the area under the ROC curve (AUC), it measures the ability of a model to discriminate between different outcomes (dead vs alive). A higher C-statistic indicates better model performance.
  - Calibration Slope: A measure of how well predicted probabilities agree with actual outcomes. A calibration slope close to 1 indicates good calibration.
Sample splitting: Dividing a dataset into separate parts, typically training and testing sets, to build and evaluate a model. This helps in assessing how well the model generalizes to new, unseen data.

Figure: Diagram illustrating the concept of sample splitting.

Cross-validation: A technique for evaluating machine learning models by training multiple models on different subsets of the data and validating them on the complementary subset to ensure that the model’s performance is consistent and not dependent on a particular split of the data.

Figure: Diagram illustrating the concept of cross-validation.

More ML terms (Karim 2021):
- Shrinkage: A technique in regression analysis that reduces the size of the regression coefficients by adding a penalty to the regression model. This helps in preventing overfitting by discouraging complex models.
- Hyperparameters: Parameters whose values are set before the learning process begins. They control the behavior of the learning algorithm and can be tuned to improve model performance (e.g., learning rate, number of trees in a random forest).
- Variable importance measures: Metrics used to assess the significance of each feature in predicting the outcome. These measures help in understanding which variables have the most impact on the model’s predictions.
Ensemble Learning: A technique that combines predictions from multiple models to produce a single, improved prediction. Below is an example of type 2 ensemble learning.

Figure: Diagram illustrating the concept of super learning.

Data pre-processing
- Multi-collinearity: A situation in which several independent variables in a model are highly correlated, making it difficult to isolate the individual effect of each predictor on the dependent variable. This can lead to unreliable and unstable estimates of regression coefficients.
- Missing data handling
  - Multiple imputation (Karim, M. E. and Epi-OER team (2024)): A statistical technique where each missing value is replaced with a set of plausible values, reflecting the uncertainty about the right value to impute. This process is repeated multiple times to create several complete datasets, which are then analyzed separately, and the results are combined to produce estimates and confidence intervals that account for missing-data uncertainty.

Figure: Diagram illustrating the concept of multiple imputation.

References

Karim, Ehsan M. E. 2021. “Understanding Basics and Usage of Machine Learning in Medical Literature.” https://ehsanx.github.io/into2ML/.

Karim, M. E. and Epi-OER team. 2024. “Advanced Epidemiological Methods.” https://ehsanx.github.io/EpiMethods/.