Machine learning (ML)

Background

The chapter encompasses a series of instructional content that sequentially explores various facets of predictive modeling and machine learning, connecting them with a previous chapter. Beginning with a tutorial that revisits the application of regression for predicting continuous outcomes, it underscores the importance of understanding prediction error and overfitting, and introduces foundational machine learning concepts. The subsequent tutorial emphasizes the pivotal role of data splitting in predictive modeling, illustrating how to partition data into training and test sets and evaluate model performance across different data scenarios. Moving forward, the concept of cross-validation is explored, detailing the k-fold cross-validation method and demonstrating its implementation both manually and using the caret package. Another tutorial navigates through predicting binary outcomes using logistic regression, evaluating model performance using various metrics, and employing k-fold cross-validation.

The series then delves into supervised learning, exploring regularization techniques, decision trees, and ensemble methods, while employing various model evaluation metrics and cross-validation techniques. Lastly, unsupervised learning is introduced with a focus on the k-means clustering algorithm, discussing its implementation, determining the optimal number of clusters, and addressing associated challenges. Throughout, the tutorials provide practical examples, code snippets, and visual aids, offering a comprehensive and applied exploration of predictive modeling and machine learning concepts.

Important

Datasets:

All of the datasets used in this tutorial can be accessed from this GitHub repository folder

Overview of tutorials

Building upon the foundational concepts introduced in a previous chapter about prediction research, this chapter takes our understanding to the next level. We’ve already explored the fundamentals of propensity score methods, and now it is time to harness the power of machine learning and prediction techniques within a causal inference context, a journey that will ultimately lead us to the concept of double robust estimation methods, such as TMLE.

Before we embark on this exciting journey, we will bridge the gap by dedicating this chapter to a deeper exploration of machine learning. By discussing various types of machine learning algorithms and diving into the intricacies of predictive modeling, we aim to provide you with a robust foundation.

Revisiting: Explore Relationships for Continuous Outcomes

In this tutorial, the focus is on utilizing regression to predict continuous outcomes, specifically employing multiple linear regression to construct an initial prediction model. The tutorial revisits fundamental concepts related to prediction and introduces foundational ideas pertinent to machine learning, all while using a distinct dataset compared to previous tutorials. The process involves loading a dataset, defining variables, and fitting a model using linear regression. Subsequent sections delve into the creation of a design matrix, obtaining predictions, and measuring prediction error through various metrics like R^2 and RMSE. The tutorial also addresses the critical concept of overfitting, discussing its causes, consequences, and potential solutions, such as internal and external validation methods.

Revisiting: Data Spliting

This tutorial emphasizes the crucial concept of data splitting in the context of predictive modeling and machine learning, utilizing a different dataset than previous tutorials. The process begins with loading a dataset and then strategically splitting it into training and test subsets, ensuring a robust approach to model validation. A model is trained using the training data, and its performance is evaluated using various metrics, such as R^2 and RMSE, through a custom function that extracts these performance measures. This function facilitates the evaluation of the model’s predictive accuracy and fit by applying it to different datasets (training, test, and the entire dataset), thereby enabling a comprehensive understanding of the model’s performance across different data scenarios.

Revisiting: Cross-vaildation

This tutorial delves into the concept of cross-validation, a pivotal technique in predictive modeling and machine learning, using a distinct dataset for illustrative purposes. The process of k-fold cross-validation is explored, wherein the data is partitioned into ‘k’ subsets, and the model is trained ‘k’ times, each time using a different subset as the test set and the remaining data as the training set. This method is employed to assess the model’s predictive performance and to mitigate the risk of results being dependent on the initial data split. The tutorial demonstrates both manual calculations for individual folds and the utilization of the caret package to automate the cross-validation process, thereby providing a comprehensive overview of the method.

Revisiting: Explore Relationships for Binary Outcomes

The tutorial navigates through the concept of predicting binary outcomes using logistic regression, emphasizing the application of various model evaluation metrics and methodologies in a machine learning context. It begins by ensuring that the outcome variable is treated as a factor and then proceeds to model fitting, where logistic regression is applied to predict a binary outcome. The model’s predictive performance is evaluated using metrics like the Area Under the Curve (AUC) and the Brier Score, which respectively assess the model’s classification accuracy and the mean squared difference between predicted probabilities and the actual outcomes. Furthermore, the tutorial explores k-fold cross-validation using the caret package, providing a robust method to assess the model’s predictive performance while avoiding overfitting. It also touches upon variable selection using stepwise regression with the Akaike Information Criterion (AIC) as a selection criterion.

Supervised Learning

This tutorial delves into the realm of supervised learning, exploring beyond statistical regression and introducing various machine learning methods tailored for both continuous and binary outcomes. The tutorial explores different regularization techniques, such as LASSO, Ridge, and Elastic Net, which are used to prevent overfitting by penalizing large coefficients in regression models. It also introduces decision trees (CART), which provide a flexible, hierarchical approach to modeling data, and can automatically incorporate non-linear effects and interactions. The tutorial further explores ensemble methods, which combine predictions from multiple models to improve predictive accuracy. Two types of ensemble methods are discussed: Type I, which trains the same model on different samples of the data (e.g., bagging and boosting), and Type II, which trains different models on the same data (e.g., Super Learner). Various model evaluation metrics and cross-validation techniques are utilized throughout to assess and enhance the predictive performance of the models.

Unsupervised Learning

The tutorial introduces unsupervised learning, with a focus on clustering, a technique that categorizes data into distinct groups based on similarity without using predefined labels. The k-means clustering algorithm is highlighted, which partitions data into k groups by minimizing within-cluster variation, typically using the sum of squares of Euclidean distances. The algorithm iteratively assigns data points to clusters based on the mean of the data points in each cluster and recalculates the cluster means until the cluster assignments no longer change. Various examples illustrate how to apply k-means clustering to different datasets and variable combinations. The tutorial also discusses determining the optimal number of clusters, k, and addresses challenges such as the influence of outliers and the sensitivity to the initial assignment of cluster means.

Machine Learning for Health Survey Data using NHIS data

This tutorial provides a step-by-step guide to fitting machine learning models with health survey data, specifically using the National Health Interview Survey (NHIS) 2016 dataset to predict high impact chronic pain (HICP) among adults aged 65 years or older. The tutorial covers the use of (1) LASSO and (2) random forest models with sampling weights for population-level predictions. It also discusses the split-sample approach for internal validation, though it acknowledges that cross-validation and bootstrapping may be better alternatives. The tutorial includes the exploration of the analytic dataset, weight normalization, split-sample creation, defining regression formulas, fitting LASSO models with optimal lambda, and evaluating model performance with metrics such as AUC, calibration slope, and Brier score. The same process is then repeated for fitting random forest models, and a variable importance plot is generated to identify influential predictors. Finally, a performance comparison table is provided.

Replicate Results from a Published Article

The tutorial guides users on implementing machine learning techniques using health survey data, specifically for predicting high impact chronic pain (HICP). It seeks to replicate results from a 2023 article, which utilized the National Health Interview Survey (NHIS) 2016 dataset. For simplicity, complete case data was used in this tutorial. In the original research, the authors developed prediction models for HICP and evaluated their performance within specific sociodemographic groups, such as gender, age groups, and race/ethnicity. They adopted LASSO and random forest models, applying 5-fold cross-validation. They also factored in survey weights in both models to achieve population-level predictions.

Tip

Optional Content:

You’ll find that some sections conclude with an optional video walkthrough that demonstrates the code. Keep in mind that the content might have been updated since these videos were recorded. Watching these videos is optional.

Warning

Bug Report:

Fill out this form to report any issues with the tutorial.

References