Chapter 2 G-computation

2.1 Closer look at the data

# Read the data saved at the last chapter
ObsData <- readRDS(file = "data/rhcAnalytic.RDS")
dim(ObsData)
## [1] 5735   51

In this dataset, we have

  • 5,735 subjects,
  • 1 outcome variable (\(Y\) = length of stay),
  • 1 exposure variable (\(A\) = RHC status), and
  • 49 covariates.

2.1.1 View data from 6 participants

Let’s focus on only first 6 columns, with only 3 variables.

small.data <- ObsData[1:6,c("sex","A","Y")]
kable(small.data)
sex A Y
Male 0 9
Female 1 45
Female 1 60
Female 0 37
Male 1 2
Female 0 7

2.1.2 New notations

Notations Example in RHC study
\(A\): Exposure status RHC
\(Y\): Observed outcome length of stay
\(Y(A=1)\) = potential outcome when exposed length of stay when RHC used
\(Y(A=0)\) = potential outcome when not exposed length of stay when RHC not used
\(L\): covariates \(49\) covariates

For explaining the concepts in this chapter, we will convert our data representation

  • from
Covariate Exposure Observed outcome
\(L\) \(A\) \(Y\)
sex RHC length of stay
  • to the following representation:
Covariate Exposure Outcome under exposed Outcome under unexposed
\(L\) \(A\) \(Y(A=1)\) \(Y(A=0)\)
sex RHC length of stay under RHC length of stay under no RHC

2.1.3 Restructure the data to estimate treatment effect

In causal inference literature, often the data is structured in such a way that the outcomes \(Y\) under different treatments \(A\) are in different columns. What we are doing here is we are distinguishing \(Y(A=1)\) from \(Y(A=0)\).

small.data$id <- c("John","Emma","Isabella","Sophia","Luke", "Mia")
small.data$Y1 <- ifelse(small.data$A==1, small.data$Y, NA)
small.data$Y0 <- ifelse(small.data$A==0, small.data$Y, NA)
small.data$TE <- small.data$Y1 - small.data$Y0
small.data <- small.data[c("id", "sex","A","Y1","Y0", "TE")]
small.data$Y <- NULL
small.data$sex <- as.character(small.data$sex)
m.Y1 <- mean(small.data$Y1, na.rm = TRUE)
m.Y0 <- mean(small.data$Y0, na.rm = TRUE)
mean.values <- round(c(NA,NA, NA, m.Y1, m.Y0,
                 m.Y1 - m.Y0),0)
small.data2 <- rbind(small.data, mean.values)
kable(small.data2, booktabs = TRUE, digits=1,
             col.names = c("Subject ID","Sex",
                           "RHC status (A)", 
                           "Y when A=1 (RHC)", 
                           "Y when A=0 (no RHC)", 
                           "Treatment Effect"))%>%
  row_spec(7, bold = TRUE, color = "white", 
           background = "#D7261E")
Subject ID Sex RHC status (A) Y when A=1 (RHC) Y when A=0 (no RHC) Treatment Effect
John Male 0 9
Emma Female 1 45
Isabella Female 1 60
Sophia Female 0 37
Luke Male 1 2
Mia Female 0 7
36 18 18

Then it is easy to see

  • the mean outcome under treated group (RHC)
  • the mean outcome under untreated group (no RHC)

and the difference between these two means is the treatment effect.

2.1.4 Treat the problem as a missing value problem

Restructure the problem as a missing data problem.

Instead of just estimating treatment effect on an average level, an alternate could be to

  • impute mean outcomes for the treated subjects
  • impute mean outcomes for the untreated subjects
  • Calculate individual treatment effect estimate
  • then calculate the average treatment effect
small.data0 <- small.data
small.data$Y1[is.na(small.data$Y1)] <- round(m.Y1)
small.data$Y0[is.na(small.data$Y0)] <- round(m.Y0)
small.data$TE <- small.data$Y1 - small.data$Y0
m.Y1 <- mean(small.data$Y1)
m.Y1
## [1] 35.83333
m.Y0 <- mean(small.data$Y0)
m.Y0
## [1] 17.83333
m.TE <- mean(small.data$TE)
mean.values <- round(c(NA,NA, NA, m.Y1, m.Y0, m.TE),0)
small.data2 <- rbind(small.data, mean.values)
small.data2$Y1[1:6] <- cell_spec(small.data2$Y1[1:6], 
                            color = ifelse(small.data2$Y1[1:6] 
                                           == round(m.Y1), 
                                           "red", "black"),
                            background = ifelse(small.data2$Y1[1:6] 
                                           == round(m.Y1), 
                                           "yellow", "white"),
                            bold = ifelse(small.data2$Y1[1:6] 
                                           == round(m.Y1), 
                                           TRUE, FALSE))
small.data2$Y0[1:6] <- cell_spec(small.data2$Y0[1:6], 
                            color = ifelse(small.data2$Y0[1:6] 
                                           == round(m.Y0), 
                                           "red", "black"),
                            background = ifelse(small.data2$Y0[1:6] 
                                           == round(m.Y0), 
                                           "yellow", "white"),
                            bold = ifelse(small.data2$Y0[1:6] 
                                           == round(m.Y0), 
                                           TRUE, FALSE))
kable(small.data2, booktabs = TRUE, 
      digits=1, escape = FALSE,
      col.names = c("Subject ID","Sex",
                           "RHC status (A)", 
                           "Y when A=1 (RHC)", 
                           "Y when A=0 (no RHC)", 
                           "Treatment Effect"))%>%
  row_spec(7, bold = TRUE, color = "white", 
           background = "#D7261E")
Subject ID Sex RHC status (A) Y when A=1 (RHC) Y when A=0 (no RHC) Treatment Effect
John Male 0 36 9 27
Emma Female 1 45 18 27
Isabella Female 1 60 18 42
Sophia Female 0 36 37 -1
Luke Male 1 2 18 -16
Mia Female 0 36 7 29
36 18 18

2.1.5 Impute better value?

Assume that sex variable is acting as a confounder. Then, it might make more sense to restrict imputing outcome values specific to male and female participants.

  • impute means specific to males for male subjects, and separately
  • impute means specific to females for female subjects.
small.data <- small.data0
m.Y1m <- mean(small.data$Y1[small.data$sex == "Male"], na.rm = TRUE)
m.Y1m
## [1] 2
m.Y1f <- mean(small.data$Y1[small.data$sex == "Female"], na.rm = TRUE)
m.Y1f
## [1] 52.5
m.Y0m <- mean(small.data$Y0[small.data$sex == "Male"], na.rm = TRUE)
m.Y0m
## [1] 9
m.Y0f <- mean(small.data$Y0[small.data$sex == "Female"], na.rm = TRUE)
m.Y0f
## [1] 22
m.TE.m <- m.Y1m-m.Y0m
m.TE.f <- m.Y1f-m.Y0f
mean.values.m <- c(NA,"Mean for males", NA, round(c(m.Y1m, m.Y0m, m.TE.m),1))
mean.values.f <- c(NA,"Mean for females", NA, round(c(m.Y1f, m.Y0f, m.TE.f),1))
small.data$Y1[small.data$sex == 
                "Male"][is.na(small.data$Y1[small.data$sex == 
                                              "Male"])] <- round(m.Y1m,1)
small.data$Y0[small.data$sex == 
                "Male"][is.na(small.data$Y0[small.data$sex == 
                                              "Male"])] <- round(m.Y0m,1)
small.data$Y1[small.data$sex == 
                "Female"][is.na(small.data$Y1[small.data$sex == 
                                                "Female"])] <- round(m.Y1f,1)
small.data$Y0[small.data$sex == 
                "Female"][is.na(small.data$Y0[small.data$sex == 
                                                "Female"])] <- round(m.Y0f,1)
small.data$TE <- small.data$Y1 - small.data$Y0
small.data2 <- rbind(small.data, mean.values.m,mean.values.f)
small.data2$Y1[1] <- cell_spec(round(m.Y1m,1), bold = TRUE,
                            color = "red", background = "yellow")
small.data2$Y0[5] <- cell_spec(round(m.Y0m,1), bold = TRUE,
                            color = "red", background = "yellow")
small.data2$Y1[c(4,6)] <- cell_spec(round(m.Y1f,1), bold = TRUE,
                            color = "blue", background = "yellow")
small.data2$Y0[c(2,3)] <- cell_spec(round(m.Y0f,1), bold = TRUE,
                            color = "blue", background = "yellow")
kable(small.data2, booktabs = TRUE, 
      digits=1, escape = FALSE, 
      col.names = c("Subject ID","Sex","RHC status (A)", 
                           "Y when A=1 (RHC)", "Y when A=0 (no RHC)", 
                           "Treatment Effect"))%>%
  row_spec(7, bold = TRUE, color = "white", background = "red")%>%
  row_spec(8, bold = TRUE, color = "white", background = "blue")
Subject ID Sex RHC status (A) Y when A=1 (RHC) Y when A=0 (no RHC) Treatment Effect
John Male 0 2 9 -7
Emma Female 1 45 22 23
Isabella Female 1 60 22 38
Sophia Female 0 52.5 37 15.5
Luke Male 1 2 9 -7
Mia Female 0 52.5 7 45.5
Mean for males 2 9 -7
Mean for females 52.5 22 30.5
  • Extending the problem to other covariates, you can see that we could condition on rest of the covariates (such as age, income, race, disease category) to get better imputation values.

Regression is a generalized method to take mean conditional on many covariates.

2.2 Use Regression for predicting outcome

Let us fit the outcome with all covariates, including the exposure status.

# isolate the names of baseline covariates
baselinevars <- names(dplyr::select(ObsData, !c(A,Y)))
# adjust the exposure variable (primary interest) + covariates
out.formula <- as.formula(paste("Y~ A +", 
                               paste(baselinevars, 
                                     collapse = "+")))
fit1 <- lm(out.formula, data = ObsData)
coef(fit1)
##                          (Intercept)                                    A 
##                        -7.680847e+01                         2.902030e+00 
##                  Disease.categoryCHF                Disease.categoryOther 
##                        -5.594331e+00                        -4.421893e+00 
##                 Disease.categoryMOSF                CancerLocalized (Yes) 
##                         2.873451e+00                        -7.794459e+00 
##                     CancerMetastatic                      Cardiovascular1 
##                        -1.056549e+01                         6.605038e-01 
##                       Congestive.HF1                            Dementia1 
##                        -1.754818e+00                        -1.261136e+00 
##                         Psychiatric1                           Pulmonary1 
##                        -4.841489e-01                         2.063282e+00 
##                               Renal1                             Hepatic1 
##                        -6.935923e+00                        -1.523238e+00 
##                            GI.Bleed1                               Tumor1 
##                        -5.096253e+00                         4.573818e+00 
##                   Immunosupperssion1                         Transfer.hx1 
##                         1.103694e-01                         1.161342e+00 
##                                  MI1                           age[50,60) 
##                        -1.650935e+00                         1.429833e-01 
##                           age[60,70)                           age[70,80) 
##                        -4.055267e-01                        -1.103439e+00 
##                         age[80, Inf)                            sexFemale 
##                        -2.757278e+00                         8.272236e-01 
##                                  edu                             DASIndex 
##                         4.775891e-02                        -5.343588e-02 
##                         APACHE.score                   Glasgow.Coma.Score 
##                        -7.020692e-02                         1.563055e-02 
##                       blood.pressure                                  WBC 
##                        -1.323182e-02                         3.940879e-02 
##                           Heart.rate                     Respiratory.rate 
##                         2.244431e-02                        -1.467861e-03 
##                          Temperature                          PaO2vs.FIO2 
##                         5.086475e-01                        -8.517735e-03 
##                              Albumin                           Hematocrit 
##                        -2.570965e+00                        -1.951544e-01 
##                            Bilirubin                           Creatinine 
##                        -9.814574e-02                         5.210509e-01 
##                               Sodium                            Potassium 
##                         1.365534e-01                         3.447162e-01 
##                                PaCo2                                   PH 
##                         1.165866e-01                         1.005261e+01 
##                               Weight                        DNR.statusYes 
##                         2.257116e-04                        -7.959037e+00 
##            Medical.insuranceMedicare Medical.insuranceMedicare & Medicaid 
##                        -5.174593e-01                        -2.422199e+00 
##        Medical.insuranceNo insurance             Medical.insurancePrivate 
##                        -1.785085e+00                        -2.086480e+00 
##  Medical.insurancePrivate & Medicare                  Respiratory.DiagYes 
##                        -2.018369e+00                         3.404743e-01 
##               Cardiovascular.DiagYes                 Neurological.DiagYes 
##                         3.784972e-01                         3.541516e+00 
##             Gastrointestinal.DiagYes                        Renal.DiagYes 
##                         2.551541e+00                         1.784893e+00 
##                    Metabolic.DiagYes                  Hematologic.DiagYes 
##                        -1.161415e+00                        -3.858024e+00 
##                       Sepsis.DiagYes                       Trauma.DiagYes 
##                         2.716148e-03                         1.112049e+00 
##                   Orthopedic.DiagYes                            raceblack 
##                         3.543464e+00                        -1.149936e+00 
##                            raceother                       income$25-$50k 
##                         2.467487e-01                         2.459547e+00 
##                         income> $50k                     incomeUnder $11k 
##                         4.214815e-01                        -4.284414e-01

2.2.1 Predict outcome for treated

  • Using the regression fit, we can obtain predicted outcome values for the treated.
  • We are not only predicting for the unobserved, but also for the observed values when a person was treated.
ObsData$Pred.Y1 <- predict(fit1, 
                           newdata = data.frame(A = 1, 
                                                dplyr::select(ObsData, !A)), 
                           type = "response")

Mean predicted outcome for treated

mean(ObsData$Pred.Y1)
## [1] 23.35625
hist(ObsData$Pred.Y1, 
     main = "Histogram for predicted outcome for treated", 
     xlab = "Y(A=1)")
abline(v=mean(ObsData$Pred.Y1),col="blue", lwd = 4)

2.2.2 Look at the predicted outcome data for treated

small.data1 <- ObsData[1:6,c("A","Pred.Y1")]
small.data1$id <- c("John","Emma","Isabella","Sophia","Luke", "Mia")
small.data1 <- small.data1[c("id", "A","Pred.Y1")]
kable(small.data1, booktabs = TRUE, digits=1,
             col.names = c("id","RHC status (A)", 
                           "Y.hat when A=1 (RHC)"))
id RHC status (A) Y.hat when A=1 (RHC)
John 0 17.5
Emma 1 28.7
Isabella 1 24.6
Sophia 0 21.6
Luke 1 13.6
Mia 0 25.5

2.2.3 Predict outcome for untreated

ObsData$Pred.Y0 <- predict(fit1, 
                           newdata = data.frame(A = 0, 
                                                dplyr::select(ObsData, !A)), 
                           type = "response")

Mean predicted outcome for untreated

mean(ObsData$Pred.Y0)
## [1] 20.45422
hist(ObsData$Pred.Y0, 
     main = "Histogram for predicted outcome for untreated", 
     xlab = "Y(A=0)")
abline(v=mean(ObsData$Pred.Y0),col="blue", lwd = 4)

2.2.4 Look at the predicted outcome data for untreated

small.data0 <- ObsData[1:6,c("A","Pred.Y0")]
small.data0$id <- c("John","Emma","Isabella","Sophia","Luke", "Mia")
small.data0 <- small.data0[c("id", "A","Pred.Y0")]
kable(small.data0, booktabs = TRUE, digits=1,
             col.names = c("id","RHC status (A)", 
                           "Y.hat when A=0 (no RHC)"))
id RHC status (A) Y.hat when A=0 (no RHC)
John 0 14.6
Emma 1 25.8
Isabella 1 21.7
Sophia 0 18.7
Luke 1 10.7
Mia 0 22.6

2.2.5 Look at the predicted outcome data for all!

small.data01 <- small.data1
small.data01$Pred.Y0 <- small.data0$Pred.Y0
small.data01$Pred.TE <- small.data01$Pred.Y1 - small.data01$Pred.Y0
m.Y1 <- mean(small.data01$Pred.Y1)
m.Y0 <- mean(small.data01$Pred.Y0)
mean.values <- round(c(NA,NA, m.Y1, m.Y0, m.Y1 -m.Y0),1)
small.data2 <- rbind(small.data01, mean.values) 
kable(small.data2, booktabs = TRUE, digits=1,
             col.names = c("id","RHC status (A)",
                           "Y.hat when A=1 (RHC)",
                           "Y.hat when A=0 (no RHC)",
                           "Treatment Effect"))%>%
  row_spec(7, bold = TRUE, color = "white", background = "#D7261E")
id RHC status (A) Y.hat when A=1 (RHC) Y.hat when A=0 (no RHC) Treatment Effect
John 0 17.5 14.6 2.9
Emma 1 28.7 25.8 2.9
Isabella 1 24.6 21.7 2.9
Sophia 0 21.6 18.7 2.9
Luke 1 13.6 10.7 2.9
Mia 0 25.5 22.6 2.9
21.9 19.0 2.9

From this table, it is easy to calculate treatment effect estimate.

The process we just went through, is a version of parametric G-computation!

2.3 Parametric G-computation

Defining treatment effect in terms of potential outcomes and observations

Figure 2.1: Defining treatment effect in terms of potential outcomes and observations

2.3.1 Steps

Step 1 Fit the outcome regression on the exposure and covariates: \(Y \sim A + L\)
Step 2 Extract outcome prediction for treated \(\hat{Y}_{A=1}\) by setting all \(A=1\)
Step 3 Extract outcome prediction for untreated \(\hat{Y}_{A=0}\) by setting all \(A=0\)
Step 4 Subtract the mean of these two outcome predictions to get treatment effect estimate: \(TE = E(\hat{Y}_{A=1}) - E(\hat{Y}_{A=0})\)

2.3.1.1 Step 1

Fit the outcome regression on the exposure and covariates: \(Y \sim A + L\)

out.formula <- as.formula(paste("Y~ A +",
                               paste(baselinevars,
                                     collapse = "+")))
fit1 <- lm(out.formula, data = ObsData)

\(Q(A,L)\) is often used to represent the predictions from the G-comp model.

2.3.1.2 Step 2

Extract outcome prediction for treated \(\hat{Y}_{A=1}\) by setting all \(A=1\)|

ObsData$Pred.Y1 <- predict(fit1, 
                           newdata = data.frame(A = 1, 
                                                dplyr::select(ObsData, !A)), 
                           type = "response")

2.3.1.3 Step 3

Extract outcome prediction for untreated \(\hat{Y}_{A=0}\) by setting all \(A=0\)

ObsData$Pred.Y0 <- predict(fit1, 
                           newdata = data.frame(A = 0, 
                                                dplyr::select(ObsData, !A)), 
                           type = "response")

2.3.1.4 Step 4

Subtract the mean of these two outcome predictions to get treatment effect estimate: \(TE = E(\hat{Y}_{A=1}) - E(\hat{Y}_{A=0})\)

ObsData$Pred.TE <- ObsData$Pred.Y1 - ObsData$Pred.Y0  

2.3.2 Treatment effect estimate

Mean value of predicted treatment effect

TE <- mean(ObsData$Pred.TE)
TE
## [1] 2.90203

SD of treatment effect

sd(ObsData$Pred.TE)
## [1] 5.132733e-15
hist(ObsData$Pred.TE, 
     main = "Histogram for predicted treatment effect", 
     xlab = "Y(A=1) - Y(A=0)")
## Warning in plot.window(xlim, ylim, "", ...): relative range of values ( 93 *
## EPS) is small (axis 1)
abline(v=mean(ObsData$Pred.TE),col="blue", lwd = 4)

This shows that the SD estimate is useless from g-computation method directly.

2.4 Estimating the confidence intervals

We already have an idea about the point estimate of the treatment effect:

mean(ObsData$Pred.TE)
## [1] 2.90203

For confidence interval estimates for G-computation, bootstrap would be necessary.

In the following example, we use \(R = 250\).

require(boot)
gcomp.boot <- function(formula = out.formula, data = ObsData, indices) {
  boot_sample <- data[indices, ]
  fit.boot <- lm(formula, data = boot_sample)
  Pred.Y1 <- predict(fit.boot, 
                     newdata = data.frame(A = 1, 
                                          dplyr::select(boot_sample, !A)), 
                           type = "response")
  Pred.Y0 <- predict(fit.boot, 
                     newdata = data.frame(A = 0, 
                                          dplyr::select(boot_sample, !A)), 
                           type = "response")
  Pred.TE <- mean(Pred.Y1) - mean(Pred.Y0)
  return(Pred.TE)
}
set.seed(123)
gcomp.res <- boot(data=ObsData, 
                  statistic=gcomp.boot,
                  R=250, 
                  formula=out.formula)

Below we show the resulting estimates from \(R\) bootstrap samples.

plot(gcomp.res) 

Below are two versions of confidence interval.

  • One is based on normality assumption: point estimate - and + with 1.96 multiplied by SD estimate
  • Another is based on percentiles
CI1 <- boot.ci(gcomp.res, type="norm") 
CI1
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 250 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = gcomp.res, type = "norm")
## 
## Intervals : 
## Level      Normal        
## 95%   ( 1.315,  4.444 )  
## Calculations and Intervals on Original Scale
CI2 <- boot.ci(gcomp.res, type="perc")
CI2
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 250 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = gcomp.res, type = "perc")
## 
## Intervals : 
## Level     Percentile     
## 95%   ( 1.515,  4.589 )  
## Calculations and Intervals on Original Scale
## Some percentile intervals may be unstable
saveRDS(TE, file = "data/gcomp.RDS")
saveRDS(CI2, file = "data/gcompci.RDS")