Chapter 4 Step 1: Exposure modelling

4.1 Model specification

Specify the propensity score model to estimate propensity scores, and fit the model:

\(A \sim L\)

baselinevars <- c("gender", "age", "race", "education", "married", "bmi")
ps.formula <- as.formula(paste("diabetes", "~", paste(baselinevars, collapse = "+")))
ps.formula
## diabetes ~ gender + age + race + education + married + bmi
# fit logistic regression to estimate propensity scores
PS.fit <- glm(ps.formula,family="binomial", data=analytic)
require(jtools)
summ(PS.fit)
Observations 1562
Dependent variable diabetes
Type Generalized linear model
Family binomial
Link logit
χ²(10) 282.89
Pseudo-R² (Cragg-Uhler) 0.26
Pseudo-R² (McFadden) 0.18
AIC 1349.94
BIC 1408.83
Est. S.E. z val. p
(Intercept) -8.38 0.58 -14.49 0.00
genderMale 0.34 0.15 2.26 0.02
age 0.06 0.01 11.26 0.00
raceHispanic 0.15 0.23 0.64 0.52
raceOther 0.76 0.23 3.25 0.00
raceWhite -0.23 0.18 -1.23 0.22
educationHigh.School 0.14 0.15 0.95 0.34
educationSchool 0.52 0.27 1.92 0.05
marriedNever.married -0.04 0.25 -0.16 0.88
marriedPreviously.married -0.02 0.16 -0.15 0.88
bmi 0.10 0.01 10.14 0.00
Standard errors: MLE
  • Coef of PS model fit is not of concern
  • Model can be rich: to the extent that prediction is better
  • But look for multi-collinearity issues
    • SE too high?

4.1.1 Updating model specification

  • What other model specifications are possible?

4.1.1.1 Interactions

  • Common terms to add (indeed based on biological plausibility; requiring subject area knowledge)
# Interactions
ps.formula2 <- as.formula(paste("diabetes", "~", 
                                paste(baselinevars, collapse = "+"), 
                                "+ education:bmi + gender:age"))
ps.formula2
## diabetes ~ gender + age + race + education + married + bmi + 
##     education:bmi + gender:age

4.1.1.2 Polynomial terms

# polynomials or splines
ps.formula3 <- as.formula(paste("diabetes", "~", 
                                paste(baselinevars, collapse = "+"), 
                                "+ age^2 + age^3"))
ps.formula3
## diabetes ~ gender + age + race + education + married + bmi + 
##     age^2 + age^3

4.1.1.3 More complex functions

# transformations
ps.formula4 <- as.formula(paste("diabetes", "~", 
                                paste(baselinevars, collapse = "+"), 
                                "+ log(age)"))
ps.formula4
## diabetes ~ gender + age + race + education + married + bmi + 
##     log(age)

4.1.2 Stability of PS

  • How many variables in PS model are too many?
    • Depends on the sample size
      • Too many variables (and too many interaction + polynomials) means too many parameters \(p\) to be estimated
      • If large data is available, might not be a problem
    • Again look at the stability: the exposure model coef SEs

4.2 Variables to adjust

4.2.1 Best approach

  • Subject area expertise
  • known from literature
  • Try drawing causal diagram to determine which variables to include

4.2.2 General guideline of type of variables

See Brookhart et al. (2006) for a guideline (not based on empirical association in the same data)

  • Observed covariates are used to fix design
  • Which covariates should be selected (based on subject area expertise; not based on empirical correlation analysis):
    • known to be a confounder (causes of \(Y\) and \(A\))
    • known to be a cause of the outcome (risk factors of \(Y\))

4.2.3 What NOT to include

  • Two types
    • avoid known instruments or noise variables: SE suffers
    • mediating factors should be avoided (total effect = goal)

4.2.4 Mediators

  • Why not incorporate mediator variable in the PS analysis?
    • One example of a mediator variable in out analysis could be ‘physical exercise’.
    • In the current framework, we do not include mediator variables as we are primarily interested about ‘total effect’, not any decomposition.

4.2.5 Unmeasured confounding

  • What if an important variable is unmeasured / not available in the data?
    • Find a proxy variable that may be associated with that unmeasured variable’s concept.
    • Exists some sensitivity analysis to assess the impact of the unmeasured variable in the analysis.

4.3 Model selection

Not encouraged, but popularly done!

  • not encouraged, as this is utilizing empirical associations
    • creates post-selection problem
  • There are debate about this (ideal vs. pragmatism)
    • see Karim, Pang, and Platt (2018) for an example.

4.3.1 Based on association with outcome

  • Selecting just based on association with the outcome (\(Y\)) in your data to select covariates is not encouraged
    • separation between outcome and exposure modelling is broken!
  • Usually done in a situation when
    • we are not sure whether a variable should be included in the PS model
    • no clear indication in the literature, or based on subject area knowledge.
    • we are unsure if this is a confounder or risk factor or noise
  • We show here an example that can be considered as a middle-ground
    • keep known confounders + risk factors of outcome
    • use variable selection only on the variables about which we are unsure
# Assuming that you are not sure if education and married 
# variables should be included in the PS analysis
# Try outcome modelling as follows:
formula.full <- as.formula(paste("cholesterol", "~", "gender + 
                                 age + race + education+ married + bmi"))
fit.0 <- glm(formula.full,
            family=binomial, data = analytic)

scope <- list(upper = ~ gender + age + race + education+ married + bmi,
              # upper included all variables (known + unsure)
              lower = ~ gender + age + race + bmi)
              # lower included only known confounders + risk factors of outcome
fitstep <- step(fit.0, scope = scope, trace = FALSE,
                k = 2, direction = "backward")
              # k = 2 is equivalant to AIC
formula(fitstep)
## cholesterol ~ gender + age + race + bmi
# if education, married (one or both) survives this 
# stepwise, then consider adding that/those in the PS model
# If not, discard from the PS model.
formula.chosen <- as.formula(paste("diabetes", "~", "gender + 
                                 age + race + bmi"))
formula.chosen
## diabetes ~ gender + age + race + bmi
# We, however, did not use this approach below.
  • Stepwise (p-value or criterion based) not recommended
    • depending on sample size, different values can get selected

4.3.2 Based on association with exposure

  • Selecting based on association with the exposure (\(A\)) in your data to select covariates can be the worst!
    • May attract instruments
    • strongly discouraged!
    • Below is an example of what NOT to do.
# Assume again that you are not sure if education and married 
# variables should be included in the PS analysis
# Try exposure modelling as follows:
formula.full.e <- as.formula(paste("diabetes", "~", "gender + 
                                 age + race + education+ married + bmi"))
fit.1 <- glm(formula.full.e,
            family=binomial, data = analytic)

scope <- list(upper = ~ gender + age + race + education+ married + bmi,
              lower = ~ gender + age + race + bmi)
fitstep.e <- step(fit.1, scope = scope, trace = FALSE,
                k = 2, direction = "backward")
formula(fitstep.e)
## diabetes ~ gender + age + race + bmi
# This is the chosen PS model by this approach.
# We, however, did not use this approach below.

4.4 Alternative modelling strategies

  • Other machine learning alternatives are possible to use instead of logistic regression.
    • tree based methods have better ability to detect non-linearity / non-additivity (model-specification aspect)
    • shrinkage methods - lasso / elastic net may better deal with multi-collinearity
    • ensemble learners / super learners were successfully used
    • shallow/deep learning!

4.5 PS estimation

PS is unknown, and needs to be estimated from the fitted exposure model:

# extract estimated propensity scores from the fit
analytic$PS <- predict(PS.fit, newdata = analytic, type="response")
require(cobalt)
bal.plot(analytic, var.name = "PS", 
         treat = "diabetes", 
         which = "both", 
         data = analytic)

Don’t loose sight that better balance is the ultimate goal for propensity score
Prediction of \(A\) is just a means to that end (as true PS is unknown)
May attract variables highly associated with \(A\)

References

Brookhart, M Alan, Sebastian Schneeweiss, Kenneth J Rothman, Robert J Glynn, Jerry Avorn, and Til Stürmer. 2006. “Variable Selection for Propensity Score Models.” American Journal of Epidemiology 163 (12): 1149–56.
Karim, Mohammad Ehsanul, Menglan Pang, and Robert W Platt. 2018. “Can We Train Machine Learning Methods to Outperform the High-Dimensional Propensity Score Algorithm?” Epidemiology 29 (2): 191–98.