ML in causal inference

Background

This chapter provides a detailed exploration of Targeted Maximum Likelihood Estimation (TMLE) in causal inference. The first tutorial motivates the use of TMLE by highlighting its advantages over traditional methods, focusing on the limitations of these approaches and introducing the RHC dataset. The second tutorial delves into the SuperLearner method for ensemble modeling, discussing the importance of algorithm diversity, cross-validation, and adaptable libraries. The third tutorial offers a comprehensive guide to applying TMLE for binary outcomes, emphasizing diverse SuperLearner libraries, effective sample sizes, and candidate learner selection. The fourth tutorial extends TMLE to continuous outcomes, covering transformations, interpretation, and comparisons with default TMLE libraries and traditional regression. These tutorials should equip readers with a robust understanding of TMLE and its practical applications in causal inference and epidemiology.

Important

Datasets:

All of the datasets used in this tutorial can be accessed from this GitHub repository folder

Overview of tutorials

In the preceding chapters, we have developed a solid understanding of predictive modeling, data splitting/cross-validation, propensity scores and various machine learning algorithms. These insights have prepared us for the advanced causal inference techniques we will encounter in this chapter. As we explore TMLE and SuperLearner, we will draw upon our knowledge of propensity score and machine learning to unlock the potential of their use in building powerful causal inference tools.

Motivation for learning and using TMLE

This tutorial discusses the motivation for using the TMLE method in causal inference. It highlights the limitations of traditional methods such as propensity score approaches and direct application of machine learning in terms of assumptions and statistical inference. TMLE is presented as a doubly robust method that incorporates machine learning while allowing straightforward statistical inference. The tutorial reintroduces RHC data for demonstration.

Understanding SuperLearner

This tutorial focuses on using the SuperLearner method for ensemble modeling. SuperLearner is a type 2 ensemble method that combines various predictive algorithms to create a robust predictive model. It employs cross-validation to determine the best-weighted combination of algorithms, based on specified predictive performance metrics. The tutorial provides guidelines for selecting a diverse set of algorithms, considering computational feasibility, and adapting the library of algorithms based on sample characteristics. Examples of different types of learners that can be included in the SuperLearner library are provided, ranging from parametric to highly data-adaptive and non-linear models. The tutorial also mentions the default libraries for estimating outcomes and propensity scores in the context of TMLE.

Dealing with binary outcomes within TMLE framework

This tutorial is a comprehensive guide to applying TMLE for binary outcomes. It covers the entire TMLE process, from constructing initial outcome and exposure models to targeted adjustment via propensity scores and treatment effect estimation. It emphasizes the importance of specifying a diverse SuperLearner library for both exposure and outcome models, determining effective sample sizes, and selecting candidate learners. The tutorial demonstrates TMLE application using the tmle package and includes a thorough comparison with default SuperLearner libraries and traditional regression, presenting estimates, confidence intervals, and a comparative table of results.

Dealing with continuous outcomes within TMLE framework

This tutorial offers comprehensive guidance on implementing TMLE for continuous outcomes, including transformations and result interpretation. It introduces the application of TMLE for continuous outcomes, emphasizing the process of constructing a SuperLearner and key considerations such as effective sample size and learner selection. Notably, it highlights the essential transformation of continuous outcome variables to a standardized range (0 to 1) using min-max normalization before applying TMLE with a Gaussian family. The tutorial demonstrates the post-TMLE rescaling of treatment effect estimates and confidence intervals to the original scale and offers a comparative analysis with default TMLE libraries and traditional regression.

Comparing results

In this comprehensive tutorial, various statistical methods were applied to investigate the association between RHC and death using the RHC dataset. These methods included logistic regression, propensity score matching and weighting with both logistic regression and Super Learner, as well as TMLE. The results consistently indicated that participants with RHC use had higher odds of death compared to those without RHC use, with odds ratios ranging similarly across different modeling approaches, but also showing a trend when machine learners are incorporated.

Tip

For those who prefer a video walkthrough of the additional resources, feel free to watch the video below.

Warning

Bug Report:

Fill out this form to report any issues with the tutorial.

References