Chapter 9 PS vs. Double robust methods
Understanding sources of bias is important
Bias = Average of our estimates in repeated sampling - true treatment effect = bias due to finite sample + bias due to not designing the study properly + bias due to model misspecification
The past part is often less talked about. Nonparametric or machine learning methods could be an advantage if used properly to reduce bias due to this source. A topic for a different tutorial, but here we will show an example of the implication of avoiding the possibility of model-misspecification.
9.1 Complex Data Simulation
Complex simulation example, so that we know the true parameter \(\theta\).
\(Y\) : Outcome | (continuous) |
\(A\) : Exposure | (binary) |
\(C\) (1-4): true confounders | (continuous) |
\(L\) (1-4): measured and transformed version of true confounders | (continuous) |
True Exposure Model
\[\begin{equation} \begin{aligned} \pi = P(A=1|C) &= f(C) \\ &= \frac{1}{1+exp[\alpha_0 + \alpha_1 C_1 + \alpha_2 C_2 + \alpha_3 C_3 + \alpha_4 C_4 ]}\nonumber \end{aligned} \end{equation}\]
True Outcome Model
\[\begin{equation} \begin{aligned} E(Y|A,C) &= g(A,C)\\ &= \beta_0 + \theta A + \beta_1 C_1 + \beta_2 C_2 + \beta_3 C_3 + \beta_4 C_4 \nonumber \end{aligned} \end{equation}\]
In our example, \(\theta\) = 6, which is our treatment effect.
Outcomes and exposures are complex functions of measured covariates
Using example from Kang and Schafer (2007), Naimi, Mishler, and Kennedy (2017) and Balzer and Westling (2021), let us assume that we don’t observed the \(C\) variables directly, and we only have access to a set of transformed covariates \(L\)s.
\[\begin{equation} \begin{aligned} L_1 &= \exp(C1/2) \nonumber\\ L_2 &= [(C2/(1+\exp(C1)) + 10)]\nonumber\\ L_3 &= [(C1*C3/25 + 0.6)^3] \nonumber\\ L_4 &= [(C2 + C4 + 20)^2\nonumber \end{aligned} \end{equation}\]
<- function(n, te = 6){
create.data # original covariates that generates A and Y
<- rnorm(n,0,1)
C1 <- rnorm(n,0,1)
C2 <- rnorm(n,0,1)
C3 <- rnorm(n,0,1)
C4 <- plogis(-1 +log(1.75)*(C1+C2+C3+C4) )
pscore # treatment generation model
<- rbinom(n, 1, pscore)
A <- rnorm(n,0,6) # error term
eps <- function(A, TE=te) {120+TE*A+3*(C1+C2+C3+C4)+eps}
get.Y <- get.Y(A=1, TE=te)
Y1 <- get.Y(A=0, TE=te)
Y0 # potential outcomes
<- data.frame(cbind(pscore, Y1,Y0, C1, C2, C3, C4))
PO # outcome generation model
<- get.Y(A=A, TE=te)
Y # misspecified versions of covariates
<- exp(C1/2)
L1 <- C2/(1+exp(C1)) + 10
L2 <- (C1*C3/25 + 0.6)^3
L3 <- (C2 + C4 + 20)^2
L4 <- data.frame(cbind(L1,L2,L3,L4,A,Y))
obs return(list(observed=obs,
Potential.Outcome= PO,
ate = mean(PO$Y1)-mean(PO$Y0)))
}
create.data(6)
## $observed
## L1 L2 L3 L4 A Y
## 1 2.2374436 10.06754 0.1625330 419.8183 1 120.3914
## 2 1.7679300 10.02612 0.1537718 441.0633 1 113.7917
## 3 1.8338882 10.06432 0.2636953 489.8636 0 133.9537
## 4 1.3741692 10.04336 0.1926407 389.0893 0 124.1439
## 5 0.9565624 11.25774 0.2182527 556.0792 1 138.0037
## 6 0.8111285 10.77576 0.2057078 512.2704 0 127.6085
##
## $Potential.Outcome
## pscore Y1 Y0 C1 C2 C3 C4
## 1 0.4265123 120.3914 114.3914 1.6106679 0.4056800 -0.8423000 0.08378755
## 2 0.3564975 113.7917 107.7917 1.1396187 0.1077560 -1.4095507 0.89375158
## 3 0.7938446 139.9537 133.9537 1.2128768 0.2806335 0.8504563 1.85222880
## 4 0.2154750 130.1439 124.1439 0.6356986 0.1252420 -0.8832483 -0.39989653
## 5 0.6518198 138.0037 132.0037 -0.0888186 2.4085868 -0.5850746 1.17274430
## 6 0.6371594 133.6085 127.6085 -0.4186576 1.2861567 0.5783549 1.34723419
##
## $ate
## [1] 6
9.2 Understanding finite sample bias
First try:
set.seed(1)
<- create.data(100)
result.data round(coef(glm(Y ~ A + L1 + L2 + L3 + L4, family="gaussian",
data=result.data$observed)),2)
## (Intercept) A L1 L2 L3 L4
## 91.23 8.46 3.50 0.81 -19.83 0.05
Second try:
set.seed(22)
<- create.data(100)
result.data round(coef(glm(Y ~ A + L1 + L2 + L3 + L4, family="gaussian",
data=result.data$observed)),2)
## (Intercept) A L1 L2 L3 L4
## 70.05 10.20 4.48 2.29 9.53 0.05
Third try:
set.seed(33)
<- create.data(100)
result.data round(coef(glm(Y ~ A + L1 + L2 + L3 + L4, family="gaussian",
data=result.data$observed)),2)
## (Intercept) A L1 L2 L3 L4
## 97.51 7.13 2.32 -0.40 12.16 0.05
9.3 Estimation using different methods
We now work with a larger data: \(n = 10,000\)
set.seed(123)
<- create.data(10000) result.data
9.3.1 Regression
round(coef(glm(Y ~ A + L1 + L2 + L3 + L4, family="gaussian",
data=result.data$observed)),2)
## (Intercept) A L1 L2 L3 L4
## 88.23 7.70 4.40 -0.12 -3.13 0.07
9.3.2 Propensity score
Propensity score model fitting:
require(MatchIt)
<- matchit(A ~ L1 + L2 + L3 + L4, method = "nearest",
match.obj data = result.data$observed,
distance = 'logit',
caliper = 0.2,
replace = FALSE,
ratio = 1)
match.obj
## A matchit object
## - method: 1:1 nearest neighbor matching without replacement
## - distance: Propensity score [caliper]
## - estimated with logistic regression
## - caliper: <distance> (0.034)
## - number of obs.: 10000 (original), 5790 (matched)
## - target estimand: ATT
## - covariates: L1, L2, L3, L4
<- match.data(match.obj) matched.data
Results from step 4: crude
round(coef(glm(Y ~ A,
family="gaussian",
data=matched.data)),2)
## (Intercept) A
## 121.51 7.93
Results from step 4: double adjustment
round(coef(glm(Y ~ A + L1 + L2 + L3 + L4,
family="gaussian",
data=matched.data)),2)
## (Intercept) A L1 L2 L3 L4
## 89.56 7.66 4.11 -0.36 8.38 0.07
9.3.3 Double machine learning method
We are using DML with only one learner:
<- result.data$observed
res require(DoubleML)
<- DoubleMLData$new(res,
dd y_col = "Y",
d_cols = "A",
x_cols = c("L1","L2","L3","L4"))
library(mlr3)
library(mlr3learners)
::get_logger("mlr3")$set_threshold("warn")
lgr
= lrn("regr.ranger",
learner num.trees=500,
mtry=floor(sqrt(5)),
max.depth=5,
min.node.size=2)
= learner$clone() # outcome modelling
ml_g = learner$clone() # exposure modelling
ml_m
set.seed(1234)
<- DoubleMLPLR$new(dd, ml_g=ml_g, ml_m=ml_m)
dml.res $fit()
dml.resprint(dml.res)
## ================= DoubleMLPLR Object ==================
##
##
## ------------------ Data summary ------------------
## Outcome variable: Y
## Treatment variable(s): A
## Covariates: L1, L2, L3, L4
## Instrument(s):
## No. Observations: 10000
##
## ------------------ Score & algorithm ------------------
## Score function: partialling out
## DML algorithm: dml2
##
## ------------------ Machine learner ------------------
## ml_g: regr.ranger
## ml_m: regr.ranger
##
## ------------------ Resampling ------------------
## No. folds: 5
## No. repeated sample splits: 1
## Apply cross-fitting: TRUE
##
## ------------------ Fit summary ------------------
## Estimates and significance testing of the effect of target variables
## Estimate. Std. Error t value Pr(>|t|)
## A 7.0114 0.1532 45.76 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
$summary() dml.res
## Estimates and significance testing of the effect of target variables
## Estimate. Std. Error t value Pr(>|t|)
## A 7.0114 0.1532 45.76 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
9.3.4 Augmented Inverse probability weighting
=c('SL.glm', 'SL.step.interaction', 'SL.earth', 'SL.mean')
SL.librequire(AIPW)
require(ggplot2)
<- result.data$observed
res <- AIPW$new(Y=res$Y, A=res$A,
aipw_sl W=res[c("L1","L2","L3","L4")],
Q.SL.library=SL.lib,g.SL.library=SL.lib,
k_split=1,verbose=TRUE)
$fit()
aipw_sl$summary(g.bound = 0.025) aipw_sl
print(aipw_sl$result, digits = 2)
## Estimate SE 95% LCL 95% UCL N
## Risk of exposure 126.3 0.18 126.0 126.7 3055
## Risk of control 119.9 0.10 119.7 120.1 6945
## Risk Difference 6.4 0.19 6.1 6.8 10000
9.3.5 Double robust method (TMLE)
TMLE with superlearner
<- ltmle(data=result.data$observed, Anodes='A', Ynodes='Y', abar=list(1,0),
tmle.res SL.library=c('SL.glm', 'SL.step.interaction', 'SL.earth', 'SL.mean'),
estimate.time=F,
SL.cvControl=list(V=10),
gbounds=c(0.025, 0.975))
summary(tmle.res)
## Estimator: tmle
## Call:
## ltmle(data = result.data$observed, Anodes = "A", Ynodes = "Y",
## abar = list(1, 0), gbounds = c(0.025, 0.975), SL.library = c("SL.glm",
## "SL.step.interaction", "SL.earth", "SL.mean"), SL.cvControl = list(V = 10),
## estimate.time = F)
##
## Treatment Estimate:
## Parameter Estimate: 126.3
## Estimated Std Err: 0.15795
## p-value: <2e-16
## 95% Conf Interval: (125.99, 126.61)
##
## Control Estimate:
## Parameter Estimate: 119.92
## Estimated Std Err: 0.099995
## p-value: <2e-16
## 95% Conf Interval: (119.73, 120.12)
##
## Additive Treatment Effect:
## Parameter Estimate: 6.382
## Estimated Std Err: 0.16961
## p-value: <2e-16
## 95% Conf Interval: (6.0496, 6.7144)
You can also try with more candidate learners within super learner
# Time consuming
# tmle.res1 <- ltmle(data=result.data$observed, Anodes='A', Ynodes='Y', abar=list(1,0),
# SL.library=c('SL.glm', 'SL.step.interaction',
# 'SL.gam', 'SL.glmnet', 'SL.cforest',
# "SL.ranger", "SL.xgboost",
# "SL.rpart", "SL.rpartPrune",
# 'SL.earth', 'SL.mean'),
# estimate.time=F,
# SL.cvControl=list(V=10),
# gbounds=c(0.025, 0.975))
# summary(tmle.res1)