Binary Outcomes

For this example we will be looking at the binary outcome variable death.

# Data
load(file = "Data/machinelearningCausal/cl2.RData")

# Table 1
tab1 <- CreateTableOne(vars = c("Death"),
                       data = ObsData, 
                       strata = "RHC.use", 
                       test = FALSE)
print(tab1, showAllLevels = FALSE, )
#>                    Stratified by RHC.use
#>                     0           1          
#>   n                 3551        2184       
#>   Death (mean (SD)) 0.63 (0.48) 0.68 (0.47)

TMLE

TMLE works by first constructing an initial outcome and extracting a crude estimate of the treatment effect. Then, TMLE aims to refine the initial estimate in the direction of the true value of the parameter of interest through use of the exposure model.

To label the treatment effect given by TMLE as causal, the same conditional exchangeability, positivity, and consistency assumptions must be met as for other modeling strategies (see introduction to propensity score). TMLE also assumes that at least one of the exposure or outcome model is correctly specified. If this does not hold, TMLE does not necessarily produce a consistent estimator.

Luque-Fernandez et al. (2018) discussed the implementation of TMLE, and providing a detailed step-by-step guide, primarily focusing on a binary outcome.

The basic steps are:

  1. Construct initial outcome model & get crude estimate
  2. Construct exposure model and use propensity scores to update the initial outcome model through a targeted adjustment
  3. Extract treatment effect estimate
  4. Estimate confidence interval based on a closed-form formula

The tmle package implements TMLE for both binary and continuous outcomes, and uses the SuperLearner to construct the exposure and outcome models.

The tmle method takes a number of parameters, including:

#> Warning: package 'knitr' was built under R version 4.2.3
Term Description
Y Outcome vector
A Exposure vector
W Matrix that includes vectors of all covariates
family Distribution
V Cross-validation folds for exposure and outcome modeling
Q.SL.library Set of machine learning methods to use for SuperLearner for outcome modeling
g.SL.library Set of machine learning methods to use for SuperLearner for exposure modeling

Constructing SuperLearner

We will need to specify two SuperLearners, one for the exposure and one for the outcome model. We will need to consider the characteristics of our sample in order to decide the number of cross-validation folds and construct a diverse and computationally feasible library of algorithms.

Number of folds

First, we need to define the number of cross-validation folds to use for each model. This depends on our effective sample size (Phillips et al. 2023).

Our effective sample size for the outcome model is:

n <- nrow(ObsData) 
p <- nrow(ObsData[ObsData$Death == 1,])/n 
n_eff <- min(n, 5*(n*min(p, 1-p))) 
n_eff
#> [1] 5735

Our effective sample size for the exposure model is:

p_exp <- nrow(ObsData[ObsData$RHC.use == 1,])/n 
n_eff_exp <- min(n, 5*(n*min(p, 1-p))) 
n_eff_exp
#> [1] 5735

For both models, the effective sample size is the same as our sample size, \(n = 5,735\).

Since \(5,000 \leq n_{eff} \leq 10,000\), we should use 5 or more cross-validation folds according to Phillips et al. (2023). For the sake of computational feasibility, we will use 5 folds in this example.

Candidate learners

The second step is to define the library of learners we will feed in to SuperLearner as potential options for each model (exposure and outcome). In this example, some of our covariates are continuous variables, such as temperature and blood pressure, so we need to include learners that allow non-linear/monotonic relationships.

Since \(n\) is large (\(>5000\)), we should include as many learners as is computationally feasible in our libraries.

Furthermore, we have 50 covariates:

length(c(baselinevars, "Length.of.Stay"))
#> [1] 50

\(5735/20 = 286.75\), and \(50<286.75\), so we do not have high-dimensional data and including screeners is optional (Phillips et al. 2023).

Since the requirements for the exposure and outcome models are the same in this example, we will use the same SuperLearner library for both. Overall for this example we need to make sure to include:

  • Parametric learners

  • Highly data-adaptive learners

  • Multiple variants of the same learner with different parameter specifications

  • Learners that allow non-linear/monotonic relationships

For this example, we will include the following learners:

  • Parametric

    • SL.mean: only mean

    • SL.glm: generalized linear model

  • Highly data-adaptive

    • SL.glmnet: penalized regression such as lasso

    • SL.xgboost: extreme gradient boosting

  • Allowing non-linear/monotonic relationships

    • SL.randomForest: random forest

    • tmle.SL.dbarts2: bayesian additive regression tree

    • SL.svm: support vector machine

# Construct the SuperLearner library
SL.library <- c("SL.mean", 
                "SL.glm", 
                "SL.glmnet", 
                "SL.xgboost", 
                "SL.randomForest", 
                "tmle.SL.dbarts2", 
                "SL.svm")

TMLE with SuperLearner

To run TMLE, we need to install the tmle package and load it on the R environment.

#install.packages(c('tmle', 'xgboost'))
require(tmle)
require(xgboost)

We also need to create a data frame containing only the covariates:

ObsData.noYA <- dplyr::select(ObsData, 
                              !c(Death, RHC.use))
ObsData$Death <- as.numeric(ObsData$Death)

Then we can run TMLE using the tmle method from the tmle package:

set.seed(1444) 

tmle.fit <- tmle::tmle(Y = ObsData$Death, 
                   A = ObsData$RHC.use, 
                   W = ObsData.noYA, 
                   family = "binomial", 
                   V.Q = 5,
                   V.g = 5,
                   Q.SL.library = SL.library, 
                   g.SL.library = SL.library)

tmle.est.bin <- tmle.fit$estimates$OR$psi
tmle.ci.bin <- tmle.fit$estimates$OR$CI

ATE for binary outcome using user-specified library: 1.28 and 95% CI is 1.1992981, 1.3761869

These results show those who received RHC had odds of death that were 1.28 times as high as the odds of death in those who did not receive RHC.

Understanding defaults

We can compare the results using our specified SuperLearner library to the results we would get when using the tmle package’s default SuperLearner libraries. To do this we simply do not specify libraries for the Q.SL.library and g.SL.library arguments.

# small test library 
# with only glm just 
# for sake of making this work
SL.library.test <- c("SL.glm")
set.seed(1444) 

tmle.fit.def <- tmle::tmle(Y = ObsData$Death, 
                           A = ObsData$RHC.use, 
                           W = ObsData.noYA, 
                           family = "binomial", 
                           V.Q = 5,
                           V.g = 5)
# Q.SL.library = SL.library.test,  ## removed this line
# g.SL.library = SL.library.test)  ## removed this line

tmle.est.bin.def <- tmle.fit.def$estimates$OR$psi
tmle.ci.bin.def <- tmle.fit.def$estimates$OR$CI

ATE for binary outcome using default library: 1.32 with 95% CI 1.1466162, 1.5219877.

These ATE when using the default SuperLearner library (1.31) is very close to the ATE when using our user-specified SuperLearner library (1.29). However, the confidence interval from TMLE using the default SuperLearner library (1.17, 1.46) is slightly wider than the confidence interval from TMLE using our user-specified SuperLearner library (1.20, 1.39).

Comparison of results

We can also compare these results to those from a basic regression and from the literature.

# adjust the exposure variable 
# (primary interest) + covariates
baselineVars.Death <- c(baselinevars, "Length.of.Stay")
out.formula.bin <- as.formula(
  paste("Death~ RHC.use +",
        paste(baselineVars.Death, 
              collapse = "+")))
fit1.bin <- glm(out.formula.bin, data = ObsData, family="binomial")

Connors et al. (1996) conducted a propensity score matching analysis. Table 4 showed that, after propensity score pair (1-to-1) matching, the odds of in-hospital mortality were 39% higher in those who received RHC (OR: 1.39 (1.15, 1.67)).

method.list Estimate 2.5 % 97.5 %
Adjusted Regression 0.07 0.04 0.09
TMLE (user-specified SL library) 1.28 1.20 1.38
TMLE (default SL library) 1.32 1.15 1.52
Connors et al. (1996) paper 1.39 1.15 1.67

References

Connors, Alfred F, Theodore Speroff, Neal V Dawson, Charles Thomas, Frank E Harrell, Douglas Wagner, Norman Desbiens, et al. 1996. “The Effectiveness of Right Heart Catheterization in the Initial Care of Critically III Patients.” Jama 276 (11): 889–97. https://tinyurl.com/Connors1996.
Luque-Fernandez, Miguel Angel, Michael Schomaker, Bernard Rachet, and Mireille E Schnitzer. 2018. “Targeted Maximum Likelihood Estimation for a Binary Treatment: A Tutorial.” Statistics in Medicine 37 (16): 2530–46.
Phillips, Rachael V., Mark J. van der Laan, Hana Lee, and Susan Gruber. 2023. “Practical Considerations for Specifying a Super Learner.” International Journal of Epidemiology 52: 1276–85. https://doi.org/10.1093/ije/dyad023.