ObsData <- readRDS(file =
"Data/machinelearningCausal/rhcAnalyticTest.RDS")
baselinevars <- names(dplyr::select(ObsData,
!c(RHC.use,Length.of.Stay,Death)))
head(ObsData)
Motivation
When using methods like propensity score approaches, we are making assumptions about the model specification. For example, we must specify any interaction terms.
With machine learning methods, these assumptions can be relaxed somewhat, as some machine learning methods allow automatic detection of data structures such as interactions.
However, machine learning was created for prediction modeling, not with causal inference in mind. Statistical inference such as calculating standard errors and confidence intervals is not as straightforward since the estimator given by machine learning methods does not follow a known statistical distribution. By contrast, the estimators resulting from a standard regression using maximum likelihood estimation will follow an approximately normal distribution, where it is easy to calculate standard errors and confidence intervals.
-
Targeted maximum likelihood estimation (TMLE) is a causal inference method that can incorporate machine learning in a way that still allows straightforward statistical inference based on theoretical development grounded in semi-parametric theory.
TMLE is a doubly robust method. This means it uses both the exposure (AKA propensity score) model and the outcome model. As long as one of these models is correctly specified, TMLE will give a consistent estimator, meaning it gets closer and closer to the true value as the sample size increases.
Since TMLE uses both the exposure and the outcome model, machine learning can be used in each of these intermediary modeling steps while allowing straightforward statistical inference.
It has been shown that TMLE outperform singly robust methods with machine learning, such as IPTW.
Revisiting RHC Data
This tutorial uses the same data as some of the previous tutorials, including working with a predictive question, machine learning with a continuous outocome, and machine learning with a binary outcome.
Table 1
Only for some demographic and comorbidity variables; matches with Table 1 in Connors et al. (1996).
tab0 <- CreateTableOne(vars = c("age", "sex", "race",
"Disease.category", "Cancer"),
data = ObsData,
strata = "RHC.use",
test = FALSE)
print(tab0, showAllLevels = FALSE)
#> Stratified by RHC.use
#> 0 1
#> n 3551 2184
#> age (%)
#> [-Inf,50) 884 (24.9) 540 (24.7)
#> [50,60) 546 (15.4) 371 (17.0)
#> [60,70) 812 (22.9) 577 (26.4)
#> [70,80) 809 (22.8) 529 (24.2)
#> [80, Inf) 500 (14.1) 167 ( 7.6)
#> sex = Female (%) 1637 (46.1) 906 (41.5)
#> race (%)
#> white 2753 (77.5) 1707 (78.2)
#> black 585 (16.5) 335 (15.3)
#> other 213 ( 6.0) 142 ( 6.5)
#> Disease.category (%)
#> ARF 1581 (44.5) 909 (41.6)
#> CHF 247 ( 7.0) 209 ( 9.6)
#> Other 955 (26.9) 208 ( 9.5)
#> MOSF 768 (21.6) 858 (39.3)
#> Cancer (%)
#> None 2652 (74.7) 1727 (79.1)
#> Localized (Yes) 638 (18.0) 334 (15.3)
#> Metastatic 261 ( 7.4) 123 ( 5.6)