Chapter 7 Pre-packaged software
7.1 tmle
- The tmle package can handle
- both binary and
- continuous outcomes, and
- uses the SuperLearner package to construct both models just like we did in the steps above.
- The default SuperLearner library for estimating the outcome includes (S. Gruber, Van Der Laan, and Kennedy 2020)
SL.glm
: generalized linear models (GLMs)SL.glmnet
: LASSOtmle.SL.dbarts2
: modeling and prediction using BART
- The default library for estimating the propensity scores includes
SL.glm
: generalized linear models (GLMs)tmle.SL.dbarts.k.5
: SL wrappers for modeling and prediction using BARTSL.gam
: generalized additive models: (GAMs)
- It is certainly possible to use different set of learners
- More methods can be added by
- specifying lists of models in the Q.SL.library (for the outcome model) and
- g.SL.library (for the propensity score model) arguments.
- More methods can be added by
- Note also that the outcome \(Y\) is required to be within the range of \([0,1]\) for this method as well,
- so we need to pass in the transformed data, then transform back the estimate.
set.seed(1444)
# transform the outcome to fall within the range [0,1]
<- min(ObsData$Y)
min.Y <- max(ObsData$Y)
max.Y $Y_transf <- (ObsData$Y-min.Y)/(max.Y-min.Y)
ObsData
# run tmle from the tmle package
<- dplyr::select(ObsData,
ObsData.noYA !c(Y_transf, Y, A))
= c("SL.glm",
SL.library "SL.glmnet",
"SL.xgboost")
<- tmle::tmle(Y = ObsData$Y_transf,
tmle.fit A = ObsData$A,
W = ObsData.noYA,
family = "gaussian",
V = 3,
Q.SL.library = SL.library,
g.SL.library = SL.library)
tmle.fit
## Additive Effect
## Parameter Estimate: 0.0073229
## Estimated Variance: 2.0642e-06
## p-value: 3.4526e-07
## 95% Conf Interval: (0.0045069, 0.010139)
##
## Additive Effect among the Treated
## Parameter Estimate: 0.0054449
## Estimated Variance: 3.4095e-06
## p-value: 0.00319
## 95% Conf Interval: (0.0018258, 0.0090641)
##
## Additive Effect among the Controls
## Parameter Estimate: 0.01266
## Estimated Variance: 1.9251e-06
## p-value: <2e-16
## 95% Conf Interval: (0.0099407, 0.01538)
summary(tmle.fit)
## Initial estimation of Q
## Procedure: cv-SuperLearner, ensemble
## Model:
## Y ~ SL.glm_All + SL.glmnet_All + SL.xgboost_All
##
## Coefficients:
## SL.glm_All 0.316376
## SL.glmnet_All 0.4996009
## SL.xgboost_All 0.1840231
##
## Cross-validated R squared : 0.0607
##
## Estimation of g (treatment mechanism)
## Procedure: SuperLearner, ensemble Empirical AUC = 0.9388
##
## Model:
## A ~ SL.glm_All + SL.glmnet_All + SL.xgboost_All
##
## Coefficients:
## SL.glm_All 0
## SL.glmnet_All 0.6490267
## SL.xgboost_All 0.3509733
##
## Estimation of g.Z (intermediate variable assignment mechanism)
## Procedure: No intermediate variable
##
## Estimation of g.Delta (missingness mechanism)
## Procedure: No missingness, ensemble
##
## Bounds on g: (0.0076, 1)
##
## Bounds on g for ATT/ATE: (0.0076, 0.9924)
##
## Additive Effect
## Parameter Estimate: 0.0073229
## Estimated Variance: 2.0642e-06
## p-value: 3.4526e-07
## 95% Conf Interval: (0.0045069, 0.010139)
##
## Additive Effect among the Treated
## Parameter Estimate: 0.0054449
## Estimated Variance: 3.4095e-06
## p-value: 0.00319
## 95% Conf Interval: (0.0018258, 0.0090641)
##
## Additive Effect among the Controls
## Parameter Estimate: 0.01266
## Estimated Variance: 1.9251e-06
## p-value: <2e-16
## 95% Conf Interval: (0.0099407, 0.01538)
<- tmle.fit$estimates$ATE$psi
tmle_est_tr tmle_est_tr
## [1] 0.00732285
# transform back the ATE estimate
<- (max.Y-min.Y)*tmle_est_tr
tmle_est tmle_est
## [1] 2.870557
saveRDS(tmle_est, file = "data/tmle.RDS")
<- paste("(",
tmle_ci round((max.Y-min.Y)*tmle.fit$estimates$ATE$CI[1], 3), ", ",
round((max.Y-min.Y)*tmle.fit$estimates$ATE$CI[2], 3), ")", sep = "")
<- (max.Y-min.Y)*tmle.fit$estimates$ATE$CI
tmle.ci saveRDS(tmle.ci, file = "data/tmleci.RDS")
## ATE from tmle package: 2.870557(1.767, 3.974)
Notes about the tmle package:
- does not scale the outcome for you
- can give some error messages when dealing with variable types it is not expecting
- practically all steps are nicely packed up in one function, very easy to use but need to dig a little to truly understand what it does
Most helpful resources:
7.2 tmle (reduced computation)
We can use the previously calculated propensity score predictions from SL (calculated using WeightIt
package) in the tmle
to reduce some computing time.
<- readRDS(file = "data/ipwslps.RDS")
ps.obj <- ps.obj$weights
ps.SL <- tmle::tmle(Y = ObsData$Y_transf,
tmle.fit2 A = ObsData$A,
W = ObsData.noYA,
family = "gaussian",
V = 3,
Q.SL.library = SL.library,
g1W = ps.SL)
tmle.fit2
## Additive Effect
## Parameter Estimate: 0.0079113
## Estimated Variance: 0.0063697
## p-value: 0.92104
## 95% Conf Interval: (-0.14852, 0.16434)
##
## Additive Effect among the Treated
## Parameter Estimate: 0.016964
## Estimated Variance: 0.043265
## p-value: 0.935
## 95% Conf Interval: (-0.39072, 0.42465)
##
## Additive Effect among the Controls
## Warning: Procedure failed to converge
## Parameter Estimate: 0.00063631
## Estimated Variance: 9.7303e-07
## p-value: 0.51888
## 95% Conf Interval: (-0.0012971, 0.0025697)
# transform back ATE estimate
-min.Y)*tmle.fit2$estimates$ATE$psi (max.Y
## [1] 3.101232
7.3 sl3 (optional)
# install sl3 if not done so
# remotes::install_github("tlverse/sl3")
The sl3 package is a newer package, that implements two types of Super Learning:
- discrete Super Learning,
- in which the best prediction algorithm (based on cross-validation) from a specified library is returned, and
- ensemble Super Learning,
- in which the best linear combination of the specified algorithms is returned (Coyle, Hejazi, Malenica, et al. (2021)).
The first step is to create a sl3 task which keeps track of the roles of the variables in our problem (Coyle, Hejazi, Melencia, et al. (2021)).
require(sl3)
# create sl3 task, specifying outcome and covariates
<- make_sl3_Task(
rhc_task data = ObsData,
covariates = colnames(ObsData)[-which(names(ObsData) == "Y")],
outcome = "Y"
)
rhc_task
## A sl3 Task with 5735 obs and these nodes:
## $covariates
## [1] "Disease.category" "Cancer" "Cardiovascular"
## [4] "Congestive.HF" "Dementia" "Psychiatric"
## [7] "Pulmonary" "Renal" "Hepatic"
## [10] "GI.Bleed" "Tumor" "Immunosupperssion"
## [13] "Transfer.hx" "MI" "age"
## [16] "sex" "edu" "DASIndex"
## [19] "APACHE.score" "Glasgow.Coma.Score" "blood.pressure"
## [22] "WBC" "Heart.rate" "Respiratory.rate"
## [25] "Temperature" "PaO2vs.FIO2" "Albumin"
## [28] "Hematocrit" "Bilirubin" "Creatinine"
## [31] "Sodium" "Potassium" "PaCo2"
## [34] "PH" "Weight" "DNR.status"
## [37] "Medical.insurance" "Respiratory.Diag" "Cardiovascular.Diag"
## [40] "Neurological.Diag" "Gastrointestinal.Diag" "Renal.Diag"
## [43] "Metabolic.Diag" "Hematologic.Diag" "Sepsis.Diag"
## [46] "Trauma.Diag" "Orthopedic.Diag" "race"
## [49] "income" "A" "Y_transf"
##
## $outcome
## [1] "Y"
##
## $id
## NULL
##
## $weights
## NULL
##
## $offset
## NULL
##
## $time
## NULL
Next, we create our SuperLearner. To do this,
- we need to specify a selection of machine learning algorithms we want to include as candidates, as well as
- a metalearner that the SuperLearner will use to combine or choose from the machine learning algorithms provided (Coyle, Hejazi, Melencia, et al. (2021)).
# see what algorithms are available for a continuous outcome
# (similar can be done for a binary outcome)
sl3_list_learners("continuous")
## [1] "Lrnr_arima" "Lrnr_bartMachine"
## [3] "Lrnr_bilstm" "Lrnr_bound"
## [5] "Lrnr_caret" "Lrnr_cv_selector"
## [7] "Lrnr_dbarts" "Lrnr_earth"
## [9] "Lrnr_expSmooth" "Lrnr_gam"
## [11] "Lrnr_gbm" "Lrnr_glm"
## [13] "Lrnr_glm_fast" "Lrnr_glmnet"
## [15] "Lrnr_grf" "Lrnr_gru_keras"
## [17] "Lrnr_gts" "Lrnr_h2o_glm"
## [19] "Lrnr_h2o_grid" "Lrnr_hal9001"
## [21] "Lrnr_HarmonicReg" "Lrnr_hts"
## [23] "Lrnr_lstm" "Lrnr_lstm_keras"
## [25] "Lrnr_mean" "Lrnr_multiple_ts"
## [27] "Lrnr_nnet" "Lrnr_nnls"
## [29] "Lrnr_optim" "Lrnr_pkg_SuperLearner"
## [31] "Lrnr_pkg_SuperLearner_method" "Lrnr_pkg_SuperLearner_screener"
## [33] "Lrnr_polspline" "Lrnr_randomForest"
## [35] "Lrnr_ranger" "Lrnr_rpart"
## [37] "Lrnr_rugarch" "Lrnr_screener_correlation"
## [39] "Lrnr_solnp" "Lrnr_stratified"
## [41] "Lrnr_svm" "Lrnr_tsDyn"
## [43] "Lrnr_xgboost"
The chosen candidate algorithms can be created and collected in a Stack.
# initialize candidate learners
<- make_learner(Lrnr_glm)
lrn_glm <- make_learner(Lrnr_glmnet) # alpha default is 1
lrn_lasso <- Lrnr_xgboost$new(nrounds = 5)
xgb_5
# collect learners in stack
<- make_learner(
stack
Stack, lrn_glm, lrn_lasso, xgb_5 )
The stack is then given to the SuperLearner.
# to make an ensemble SuperLearner
<- Lrnr_nnls$new()
sl_meta <- Lrnr_sl$new(
sl learners = stack,
metalearner = sl_meta)
# or a discrete SuperLearner
<- Lrnr_cv_selector$new()
sl_disc_meta <- Lrnr_sl$new(
sl_disc learners = stack,
metalearner = sl_disc_meta
)
The SuperLearner is then trained on the sl3 task we created at the start and then it can be used to make predictions.
set.seed(1444)
# train SL
<- sl$train(rhc_task)
sl_fit # or for discrete SL
# sl_fit <- sl_disc$train(rhc_task)
# make predictions
<- ObsData
sl3_data $sl_preds <- sl_fit$predict()
sl3_data
<- mean(sl3_data$sl_preds[sl3_data$A == 1]) -
sl3_est mean(sl3_data$sl_preds[sl3_data$A == 0])
sl3_est
## [1] 5.331201
saveRDS(sl3_est, file = "data/sl3.RDS")
Notes about the sl3 package:
- fairly easy to implement & understand structure
- large selection of candidate algorithms provided
- unsure why result is so different
- very different structure from SuperLearner library, but very customizable
- could use more explanations of when to use what metalearner and what exactly the structure of the metalearner construction means
Most helpful resources:
- tlverse sl3 page
- sl3 GitHub repository
- tlverse handbook chapter 6
- Vignettes in R
7.4 RHC results
Gathering previously saved results:
<- readRDS(file = "data/adjreg.RDS")
fit.reg <- fit.reg$coefficients[2]
TEr <- as.numeric(confint(fit.reg, 'A'))
CIr <- readRDS(file = "data/match.RDS")
fit.matched <- fit.matched$coefficients[2]
TEm <- as.numeric(confint(fit.matched, 'A'))
CIm <- readRDS(file = "data/gcomp.RDS")
TEg <- readRDS(file = "data/gcompci.RDS")
CIg <- CIg$percent[4:5]
CIgc <- readRDS(file = "data/gcompxg.RDS")
TE1g <- readRDS(file = "data/gcompls.RDS")
TE2g <- readRDS(file = "data/gcompsl.RDS")
TE3g <- readRDS(file = "data/ipw.RDS")
ipw <- ipw$coefficients[2]
TEi <- as.numeric(confint(ipw, 'A'))
CIi <- readRDS(file = "data/ipwsl.RDS")
ipwsl <- ipwsl$coefficients[2]
TEsli <- as.numeric(confint(ipwsl, 'A'))
CIsli <- readRDS(file = "data/tmlepointh.RDS")
tmleh <- readRDS(file = "data/tmlecih.RDS")
tmlecih <- readRDS(file = "data/tmle.RDS")
tmlesl <- readRDS(file = "data/tmleci.RDS")
tmlecisl <- readRDS(file = "data/sl3.RDS")
slp <- rep(NA,2)
ci.b <- 2.01
ks <- c(0.6,3.41)
ci.ks <- as.numeric(c(TEr, TEm, TEg, TE1g, TE2g,
point
TE3g, TEi, TEsli, tmleh,
tmlesl, slp, ks))<- cbind(CIr, CIm, CIgc, ci.b, ci.b, ci.b,
CIs
CIi, CIsli, tmlecih, tmlecisl, ci.b, ci.ks)
<- c("Adj. Reg","PS match",
method.list "G-comp (linear reg.)","G-comp (xgboost)",
"G-comp (lasso)", "G-comp (SL)",
"IPW (logistic)", "IPW (SL)",
"TMLE (9 steps)", "TMLE (package)",
"sl3 (package)", "Keele and Small (2021) paper")
<- data.frame(method.list)
results $Estimate <- round(point,2)
results$`2.5 %` <- CIs[1,]
results$`97.5 %` <- CIs[2,]
resultskable(results,digits = 2)%>%
row_spec(10, bold = TRUE, color = "white", background = "#D7261E")
method.list | Estimate | 2.5 % | 97.5 % |
---|---|---|---|
Adj. Reg | 2.90 | 1.37 | 4.43 |
PS match | 3.25 | 1.45 | 5.05 |
G-comp (linear reg.) | 2.90 | 1.52 | 4.59 |
G-comp (xgboost) | 4.11 | ||
G-comp (lasso) | 2.72 | ||
G-comp (SL) | 1.91 | ||
IPW (logistic) | 3.24 | 1.88 | 4.60 |
IPW (SL) | 2.90 | 1.51 | 4.29 |
TMLE (9 steps) | 2.73 | 1.59 | 3.87 |
TMLE (package) | 2.87 | 1.77 | 3.97 |
sl3 (package) | 5.33 | ||
Keele and Small (2021) paper | 2.01 | 0.60 | 3.41 |
7.5 Other packages
Other packages that may be useful:
Package | Resources | Notes |
---|---|---|
ltmle | CRAN vignette | Longitudinal |
tmle3 | GitHub, framework overview, tlverse handbook | tmle3 is still under development |
aipw | GitHub, CRAN vignette | Newer package for AIPW (another DR method) |
Others | van der Laan research group |
You can find many other related packages on CRAN or GitHub.
Keele and Small (2021) used TMLE-SL based on an ensemble of 3 different learners: (1) GLM, (2) random forests, and (3) LASSO.