Continuous Outcomes

We will now go through an example of using TMLE for a continuous outcome. The setup for SuperLearner in this case is similar to that for binary outcomes, so rather than going through the SuperLearner steps again, we will instead focus on the additional steps that are necessary for running the tmle method on continuous outcomes.

Frank and Karim (2023) extensively discussed the implementation of TMLE for continuous outcomes, providing a detailed step-by-step guide using the openly accessible RHC dataset. In this tutorial, we will revisit the same example with additional explanations.

Note

Only outcome variable (Length of stay); slightly different than Table 2 in Connors et al. (1996) (means were 20.5 vs. 25.7; and medians were 16 vs. 17).

tab1 <- CreateTableOne(vars = c("Length.of.Stay"),
                       data = ObsData, 
                       strata = "RHC.use", 
                       test = FALSE)
print(tab1, showAllLevels = FALSE, )
#>                             Stratified by RHC.use
#>                              0             1            
#>   n                           3551          2184        
#>   Length.of.Stay (mean (SD)) 19.53 (23.59) 24.86 (28.90)
median(ObsData$Length.of.Stay[ObsData$RHC.use==0])
#> [1] 12
median(ObsData$Length.of.Stay[ObsData$RHC.use==1])
#> [1] 16

Constructing SuperLearner

Just as we did for a binary outcome, we will need to specify two SuperLearners, one for the exposure and one for the outcome model.

The effective sample size for a continuous outcome is just \(n_{eff}=n=5,735\). We calculated the effective sample size for the exposure model earlier, which also turned out to be \(n_{eff}=n=5,735\). So once again we will use 5 folds because \(5,000 \leq n_{eff} \leq 10,000\) (Phillips et al. 2023).

Similarly to our example with the binary outcome, the key considerations for the library of learners are:

  • We have some continuous covariates, and should therefore include learners that allow non-linear/monotonic relationships.

  • We have a large \(n\), so should include as many learners as is computationally feasible.

  • We have 49 covariates and 5,735 observations, so we do not have high-dimensional data and including screeners is optional.

Again the requirements for the exposure and outcome models are the same and we can use the same library for both models. Note that even though one model will have a binary dependent variable, and one will have a continuous dependent variable, most of the available learners automatically adapt to binary and continuous dependent variables.

For this example, we will use the same SuperLearner library as for the binary outcome example.

# Construct the SuperLearner library
SL.library <- c("SL.mean", 
                "SL.glm", 
                "SL.glmnet", 
                "SL.xgboost", 
                "SL.randomForest", 
                "tmle.SL.dbarts2", 
                "SL.svm")

Dealing with continuous outcomes

For this example, we will be examining the length of stay in hospital outcome.

The key difference between running TMLE on a continuous outcome in comparison to running it with a binary outcome, is that we must transform the outcome to fall within the range of 0 to 1, so that the modeled outcomes fall within the range of the outcome’s true distribution (Gruber and Laan 2010).

To transform the outcome, we can use min-max normalization:

\[ Y_{transformed} = \frac{Y-Y_{min}}{Y_{max}-Y_{min}} \]

set.seed(1444) 
# transform the outcome to fall within the range [0,1]
min.Y <- min(ObsData$Length.of.Stay)
max.Y <- max(ObsData$Length.of.Stay)
ObsData$Length.of.Stay_transf <- 
  (ObsData$Length.of.Stay-min.Y)/
  (max.Y-min.Y)

Once we have transformed the outcome to fall within the range of 0 to 1, we can run TMLE as before, using the tmle method in the tmle package:

# create data frame containing only covariates
ObsData.noYA <- dplyr::select(ObsData, 
                              !c(Length.of.Stay_transf, 
                                 Length.of.Stay, 
                                 RHC.use))
set.seed(1444) 

# run tmle
tmle.fit.cont <- tmle::tmle(Y = ObsData$Length.of.Stay_transf, 
                       A = ObsData$RHC.use, 
                       W = ObsData.noYA, 
                       family = "gaussian", 
                       V.Q = 5,
                       V.g = 5,
                       Q.SL.library = SL.library,
                       g.SL.library = SL.library)

Once the tmle method has run, we still have one step to complete to get our final estimate. At this point, we must transform the average treatment effect generated by the tmle method (\(\widehat{ATE}\)) back to the outcome’s original scale:

\[ \widehat{ATE}_{rescaled} = (Y_{max}-Y_{min})*\widehat{ATE} \]

# transform back the ATE estimate
tmle.est.cont <- (max.Y-min.Y)*
  tmle.fit.cont$estimates$ATE$psi
tmle.est.cont
#> [1] 2.939622

We also have to transform the confidence interval back to the original scale:

tmle.ci.cont <- (max.Y-min.Y)*
  tmle.fit.cont$estimates$ATE$CI

ATE for continuous outcome: 2.9396218, and 95 % CI is 1.959698, 3.9195455.

The results indicate that if all participants had received RHC, the average length of stay in hospital would be 2.95 (1.99, 3.91) days longer than if no participants had received RHC.

Understanding defaults

Transform outcome:

set.seed(1444) 
# transform the outcome to fall within the range [0,1]
min.Y <- min(ObsData$Length.of.Stay)
max.Y <- max(ObsData$Length.of.Stay)
ObsData$Length.of.Stay_transf <- 
  (ObsData$Length.of.Stay-min.Y)/
  (max.Y-min.Y)

Run TMLE, using the tmle package’s default SuperLearner library:

# create data frame containing only covariates
ObsData.noYA <- dplyr::select(ObsData, 
                              !c(Length.of.Stay_transf, 
                                 Length.of.Stay, 
                                 RHC.use))
set.seed(1444) 

# run tmle
tmle.fit.cont.def <- tmle::tmle(
  Y = ObsData$Length.of.Stay_transf, 
  A = ObsData$RHC.use, 
  W = ObsData.noYA,
  family = "gaussian",
  V.Q = 5,
  V.g = 5)
# Q.SL.library = SL.library.test,  
## removed this line
# g.SL.library = SL.library.test)  
## removed this line

Transform the average treatment effect generated by the tmle method (\(\widehat{ATE}\)) back to the outcome’s original scale:

\[ \widehat{ATE}_{rescaled} = (Y_{max}-Y_{min})*\widehat{ATE} \]

# transform back the ATE estimate
tmle.est.cont.def <- (max.Y-min.Y)*
  tmle.fit.cont.def$estimates$ATE$psi
tmle.est.cont.def
#> [1] 3.352891

Transform the confidence interval back to the original scale:

tmle.ci.cont.def <- (max.Y-min.Y)*
  tmle.fit.cont.def$estimates$ATE$CI

ATE for continuous outcome using default library: 3.0362984, and 95% CI 1.2686301, 4.8039667.

The estimate using the default SuperLearner library (2.18) is similar to the estimate we got when using our user-specified SuperLearner library (2.95). However, the confidence interval using the default SuperLearner library (1.25, 4.37) was much wider than that using our user-specified SuperLearner library (1.99, 3.91).

Comparison of results

Adjusted regression:

# adjust the exposure variable 
# (primary interest) + covariates
baselineVars.LoS <- c(baselinevars, "Death")
out.formula.cont <- as.formula(
  paste("Length.of.Stay~ RHC.use +", 
        paste(baselineVars.LoS,
              collapse = "+")))
fit1.cont <- lm(out.formula.cont, data = ObsData)
publish(fit1.cont, digits=1)$regressionTable[2,]

Connors et al. (1996) conducted a propensity score matching analysis. Table 5 showed that, after propensity score pair (1-to-1) matching, means of length of stay (\(Y\)), when stratified by RHC (\(A\)) were not significantly different (\(p = 0.14\)).

method.list Estimate 2.5 % 97.5 %
Adjusted Regression 3.04 1.51 4.58
TMLE (user-specified SL library) 5.34 3.90 6.78
TMLE (default SL library) 3.35 1.43 4.86
Keele and Small (2021) paper 2.01 0.60 3.41

Differences in results can likely be attributed to the use of different SuperLearner libraries, the use of different combinations of variables used, or random sampling associated with the cross-validation used in the SuperLearner algorithm.

References

Connors, Alfred F, Theodore Speroff, Neal V Dawson, Charles Thomas, Frank E Harrell, Douglas Wagner, Norman Desbiens, et al. 1996. “The Effectiveness of Right Heart Catheterization in the Initial Care of Critically III Patients.” Jama 276 (11): 889–97. https://tinyurl.com/Connors1996.
Frank, Hanna A, and Mohammad Ehsanul Karim. 2023. “Implementing TMLE in the Presence of a Continuous Outcome.” Research Methods in Medicine & Health Sciences, 26320843231176662.
Gruber, Susan, and Mark J van der Laan. 2010. “A Targeted Maximum Likelihood Estimator of a Causal Effect on a Bounded Continuous Outcome.” The International Journal of Biostatistics 6 (1).
Phillips, Rachael V., Mark J. van der Laan, Hana Lee, and Susan Gruber. 2023. “Practical Considerations for Specifying a Super Learner.” International Journal of Epidemiology 52: 1276–85. https://doi.org/10.1093/ije/dyad023.