PSM in BMI-diabetes

Propensity analysis problem

  • Perform a propensity score matching (pair matching, without replacement, using SMD cut-point 0.2, may adjust for imbalanced and/or all covariates in the outcome model, if any) analysis as per the following recommendations
    • Zanutto (2006)
    • DuGoff, Schuler, and Stuart (2014)
    • Austin, Jembere, and Chiu (2018)

See explained in the previous chapter on PSM in OA-CVD (US), there are four steps of the propensity score matching:

  • Step 1: PS model specification
  • Step 2: Matching based on the estimated propensity scores
  • Step 3: Balance checking
  • Step 4: Outcome modelling

Only Step 1 is different for Zanutto (2006), DuGoff et al. (2014), and Austin et al. (2018) approaches.

Dataset

  • The following modified NHANES dataset
    • NHANES15lab5.RData
  • use the data set “analytic.with.miss” within this file.
    • for obtaining the final treatment effect estimates, you can omit missing values, but only after creating the design (e.g., subset the design, not the data itself directly).

Variables

  • Outcome: diabetes
    • ‘No’ as the reference category
  • Exposure: bmi
    • convert to binary with >25 vs. <= 25,
    • with > 25 as the reference category
  • Confounder list:
    • gender
    • age
      • assume continuous
    • race
    • income
    • education
    • married
    • cholesterol
    • diastolicBP
    • systolicBP
  • Mediator:
    • physical.work
      • ‘No’ as the reference category
  • Survey features
    • psu
    • strata
    • weight

Pre-processing

Load data

load(file="Data/propensityscore/NHANES15lab5.RData")

Variable summary

# Exposure
analytic.with.miss$bmi <- with(analytic.with.miss, 
                                   ifelse(bmi>25, "Overweight", 
                                          ifelse(bmi<=25, "Not overweight", NA)))
analytic.with.miss$bmi <- as.factor(analytic.with.miss$bmi)
analytic.with.miss$bmi <- relevel(analytic.with.miss$bmi, ref = "Overweight")

# Drop unnecessary variables 
analytic.with.miss$born <- NULL
analytic.with.miss$physical.work <- NULL

# Rename the weight variable into interview.weight
names(analytic.with.miss)[names(analytic.with.miss) == "weight"] <- "interview.weight"

Reproducibility

set.seed(504)

Approach by Zanutto (2006)

Step 1

# Specify the PS model to estimate propensity scores
ps.formula <- as.formula(I(bmi=="Not overweight") ~ gender + age + race + income + education + 
                           married + cholesterol + diastolicBP + systolicBP)

Step 2

require(MatchIt)
# Complete data for matching
analytic.data <- analytic.with.miss[complete.cases(analytic.with.miss),]

# Create a missing variable in the original dataset
analytic.with.miss$miss <- 1
analytic.with.miss$miss[analytic.with.miss$ID %in% analytic.data$ID] <- 0

# Propensity scores
ps.fit <- glm(ps.formula, data = analytic.data, family = binomial("logit"))
analytic.data$PS <- predict(ps.fit, type = "response", newdata = analytic.data)

# Caliper fixing to 0.2*sd(logit of PS)
caliper <- 0.2*sd(log(analytic.data$PS/(1-analytic.data$PS)))

# Match exposed and unexposed subjects 
match.obj <- matchit(ps.formula, data = analytic.data,
                     distance = analytic.data$PS, 
                     method = "nearest", 
                     replace = FALSE,
                     caliper = caliper, 
                     ratio = 1)
analytic.data$PS <- match.obj$distance

# Extract matched data
matched.data <- match.data(match.obj) 

Step 3

# Balance checking
cov <- c("gender", "age", "race", "income", 
         "education", "married", "cholesterol", 
         "diastolicBP", "systolicBP")

tab1m <- CreateTableOne(strata = "bmi", vars = cov, data = matched.data, test = F)
print(tab1m, smd = TRUE)
#>                          Stratified by bmi
#>                           Overweight     Not overweight SMD   
#>   n                         2047           2047               
#>   gender = Male (%)          987 (48.2)     993 (48.5)   0.006
#>   age (mean (SD))          46.96 (17.55)  46.35 (17.64)  0.035
#>   race (%)                                               0.041
#>      Black                   452 (22.1)     424 (20.7)        
#>      Hispanic                558 (27.3)     584 (28.5)        
#>      Other                   328 (16.0)     338 (16.5)        
#>      White                   709 (34.6)     701 (34.2)        
#>   income (%)                                             0.042
#>      <25k                    537 (26.2)     509 (24.9)        
#>      Between.25kto54k        670 (32.7)     661 (32.3)        
#>      Between.55kto99k        472 (23.1)     484 (23.6)        
#>      Over100k                368 (18.0)     393 (19.2)        
#>   education (%)                                          0.023
#>      College                1199 (58.6)    1202 (58.7)        
#>      High.School             667 (32.6)     652 (31.9)        
#>      School                  181 ( 8.8)     193 ( 9.4)        
#>   married (%)                                            0.036
#>      Married                1215 (59.4)    1243 (60.7)        
#>      Never.married           402 (19.6)     374 (18.3)        
#>      Previously.married      430 (21.0)     430 (21.0)        
#>   cholesterol (mean (SD)) 176.45 (39.57) 174.62 (38.70)  0.047
#>   diastolicBP (mean (SD))  64.97 (13.65)  64.19 (13.47)  0.057
#>   systolicBP (mean (SD))  118.07 (15.23) 116.19 (17.60)  0.114

All SMDs are less than your specified cut-point? If yes, that indicates that there is good covariate balancing.

Step 4

require(survey)
# Setup the design with survey features
analytic.with.miss$matched <- 0
analytic.with.miss$matched[analytic.with.miss$ID %in% matched.data$ID] <- 1

# Survey setup for full data
w.design0 <- svydesign(strata = ~strata, id = ~psu, weights = ~interview.weight, 
                      data = analytic.with.miss, nest = TRUE)

# Subset matched data
w.design.m <- subset(w.design0, matched == 1)

# Outcome model
out.formula <- as.formula(I(diabetes == "Yes") ~ bmi)
fit <- svyglm(out.formula, design = w.design.m, family = binomial("logit"))
require(jtools)
summ(fit, exp = TRUE, confint = TRUE, digits = 3)
Observations 4094
Dependent variable I(diabetes == "Yes")
Type Survey-weighted generalized linear model
Family binomial
Link logit
Pseudo-R² (Cragg-Uhler) 0.071
Pseudo-R² (McFadden) 0.054
AIC 2336.725
exp(Est.) 2.5% 97.5% t val. p
(Intercept) 0.165 0.141 0.195 -21.678 0.000
bmiNot overweight 0.245 0.173 0.346 -7.980 0.000
Standard errors: Robust

Double adjustment

library(survey)
# Outcome model with covariates adjustment
fit.DA <- svyglm(I(diabetes == "Yes") ~ bmi + gender + age + race + income + 
                 education + married + cholesterol + diastolicBP + systolicBP, 
               design = w.design.m, family = binomial("logit"))
summ(fit.DA, exp = TRUE, confint = TRUE, digits = 3)
Observations 4094
Dependent variable I(diabetes == "Yes")
Type Survey-weighted generalized linear model
Family binomial
Link logit
Pseudo-R² (Cragg-Uhler) 0.212
Pseudo-R² (McFadden) 0.166
AIC 2063.466
exp(Est.) 2.5% 97.5% t val. p
(Intercept) 0.045 0.010 0.199 -4.101 NA
bmiNot overweight 0.235 0.159 0.348 -7.219 NA
genderMale 1.317 1.082 1.604 2.740 NA
age 1.051 1.044 1.059 13.280 NA
raceHispanic 0.896 0.649 1.237 -0.665 NA
raceOther 1.158 0.565 2.374 0.401 NA
raceWhite 0.778 0.507 1.193 -1.152 NA
incomeBetween.25kto54k 0.830 0.551 1.251 -0.889 NA
incomeBetween.55kto99k 0.872 0.621 1.224 -0.794 NA
incomeOver100k 0.669 0.418 1.072 -1.669 NA
educationHigh.School 0.948 0.733 1.225 -0.408 NA
educationSchool 0.872 0.448 1.695 -0.405 NA
marriedNever.married 1.055 0.677 1.646 0.237 NA
marriedPreviously.married 0.849 0.534 1.348 -0.694 NA
cholesterol 0.991 0.986 0.995 -3.810 NA
diastolicBP 1.005 0.992 1.018 0.749 NA
systolicBP 1.003 0.993 1.012 0.559 NA
Standard errors: Robust

Approach by DuGoff et al. (2014)

Step 1

# Specify the PS model to estimate propensity scores
ps.formula2 <- as.formula(I(bmi == "Not overweight") ~ gender + age + race + income + education + 
                           married + cholesterol + diastolicBP + systolicBP + 
                           psu + strata + interview.weight)

Step 2

# Propensity scores
ps.fit2 <- glm(ps.formula2, data = analytic.data, family = binomial("logit"))
analytic.data$PS2 <- fitted(ps.fit2)

# Caliper fixing to 0.2*sd(logit of PS)
caliper2 <- 0.2*sd(log(analytic.data$PS2/(1-analytic.data$PS2)))

# Match exposed and unexposed subjects 
set.seed(504)
match.obj2 <- matchit(ps.formula2, data = analytic.data,
                     distance = analytic.data$PS2, 
                     method = "nearest", 
                     replace = FALSE,
                     caliper = caliper2, 
                     ratio = 1)
analytic.data$PS2 <- match.obj2$distance

# Extract matched data
matched.data2 <- match.data(match.obj2) 

Step 3

# Balance checking
cov2 <- c("gender", "age", "race", "income", 
          "education", "married", "cholesterol", 
         "diastolicBP", "systolicBP", 
         "psu", "strata", "interview.weight")

tab1m2 <- CreateTableOne(strata = "bmi", vars = cov2, data = matched.data2, test = F)
print(tab1m2, smd = TRUE)
#>                               Stratified by bmi
#>                                Overweight          Not overweight      SMD   
#>   n                                2037                2037                  
#>   gender = Male (%)                 991 (48.6)          985 (48.4)      0.006
#>   age (mean (SD))                 47.10 (17.50)       46.41 (17.72)     0.039
#>   race (%)                                                              0.044
#>      Black                          453 (22.2)          438 (21.5)           
#>      Hispanic                       553 (27.1)          579 (28.4)           
#>      Other                          324 (15.9)          342 (16.8)           
#>      White                          707 (34.7)          678 (33.3)           
#>   income (%)                                                            0.023
#>      <25k                           522 (25.6)          522 (25.6)           
#>      Between.25kto54k               679 (33.3)          662 (32.5)           
#>      Between.55kto99k               471 (23.1)          473 (23.2)           
#>      Over100k                       365 (17.9)          380 (18.7)           
#>   education (%)                                                         0.007
#>      College                       1199 (58.9)         1192 (58.5)           
#>      High.School                    644 (31.6)          648 (31.8)           
#>      School                         194 ( 9.5)          197 ( 9.7)           
#>   married (%)                                                           0.018
#>      Married                       1209 (59.4)         1224 (60.1)           
#>      Never.married                  395 (19.4)          394 (19.3)           
#>      Previously.married             433 (21.3)          419 (20.6)           
#>   cholesterol (mean (SD))        176.73 (39.68)      174.82 (38.77)     0.049
#>   diastolicBP (mean (SD))         65.16 (13.49)       64.39 (13.22)     0.057
#>   systolicBP (mean (SD))         118.27 (15.07)      116.33 (17.57)     0.119
#>   psu (mean (SD))                  1.49 (0.50)         1.49 (0.50)      0.008
#>   strata (mean (SD))             126.39 (4.16)       126.27 (4.29)      0.029
#>   interview.weight (mean (SD)) 38621.02 (34640.92) 37195.99 (36180.33)  0.040

All SMDs are less than your specified cut-point? If yes, that indicates that there is good covariate balancing.

Step 4

# Setup the design with survey features
analytic.with.miss$matched2 <- 0
analytic.with.miss$matched2[analytic.with.miss$ID %in% matched.data2$ID] <- 1

# Survey setup for full data
w.design02 <- svydesign(strata = ~strata, id = ~psu, weights = ~interview.weight, 
                       data = analytic.with.miss, nest = TRUE)

# Subset matched data
w.design.m2 <- subset(w.design02, matched2 == 1)

# Outcome model
out.formula <- as.formula(I(diabetes == "Yes") ~ bmi)
fit2 <- svyglm(out.formula, design = w.design.m2, family = binomial("logit"))
summ(fit2, exp = TRUE, confint = TRUE, digits = 3)
Observations 4074
Dependent variable I(diabetes == "Yes")
Type Survey-weighted generalized linear model
Family binomial
Link logit
Pseudo-R² (Cragg-Uhler) 0.083
Pseudo-R² (McFadden) 0.063
AIC 2322.836
exp(Est.) 2.5% 97.5% t val. p
(Intercept) 0.174 0.149 0.202 -22.325 0.000
bmiNot overweight 0.221 0.163 0.299 -9.749 0.000
Standard errors: Robust

Double adjustment

# Outcome model with covariates adjustment
fit2.DA <- svyglm(I(diabetes == "Yes") ~ bmi + gender + age + race + income + 
                 education + married + cholesterol + diastolicBP + systolicBP, 
               design = w.design.m, family = binomial("logit"))
summ(fit2.DA, exp = TRUE, confint = TRUE, digits = 3)
Observations 4094
Dependent variable I(diabetes == "Yes")
Type Survey-weighted generalized linear model
Family binomial
Link logit
Pseudo-R² (Cragg-Uhler) 0.212
Pseudo-R² (McFadden) 0.166
AIC 2063.466
exp(Est.) 2.5% 97.5% t val. p
(Intercept) 0.045 0.010 0.199 -4.101 NA
bmiNot overweight 0.235 0.159 0.348 -7.219 NA
genderMale 1.317 1.082 1.604 2.740 NA
age 1.051 1.044 1.059 13.280 NA
raceHispanic 0.896 0.649 1.237 -0.665 NA
raceOther 1.158 0.565 2.374 0.401 NA
raceWhite 0.778 0.507 1.193 -1.152 NA
incomeBetween.25kto54k 0.830 0.551 1.251 -0.889 NA
incomeBetween.55kto99k 0.872 0.621 1.224 -0.794 NA
incomeOver100k 0.669 0.418 1.072 -1.669 NA
educationHigh.School 0.948 0.733 1.225 -0.408 NA
educationSchool 0.872 0.448 1.695 -0.405 NA
marriedNever.married 1.055 0.677 1.646 0.237 NA
marriedPreviously.married 0.849 0.534 1.348 -0.694 NA
cholesterol 0.991 0.986 0.995 -3.810 NA
diastolicBP 1.005 0.992 1.018 0.749 NA
systolicBP 1.003 0.993 1.012 0.559 NA
Standard errors: Robust
Tip

Double adjustment:

Double adjustment in PSM can offer an additional layer of control for confounding and enhance the robustness of treatment effect estimates. The primary goal of double adjustment is to further minimize bias in the estimated treatment effect, especially when the propensity score model may not fully balance all covariates across treatment groups. in this approach, The propensity score is estimated, typically using logistic regression, where the treatment assignment is regressed on observed covariates. Units (e.g., individuals) in the treatment group are matched with units in the control group based on their propensity scores. This aims to create comparable groups and balance observed covariates across treatment and control groups. After matching, an outcome model is fitted to estimate the treatment effect while additionally adjusting for the same covariates used in the propensity score model. This second adjustment in the outcome model is what constitutes the “double” in “double adjustment.”

However, it should be applied thoughtfully, with careful consideration of model specification, covariate selection, and underlying assumptions to ensure valid and reliable results. Always consider the specific context of the study and consult statistical guidelines or experts when applying advanced methods like double adjustment in PSM.

Approach by Austin et al. (2018)

Step 1

# Specify the PS model to estimate propensity scores
ps.formula3 <- as.formula(I(bmi == "Not overweight") ~ gender + age + race + income + education + 
                           married + cholesterol + diastolicBP + systolicBP)

# Survey design
require(survey)
analytic.design <- svydesign(id = ~psu, weights = ~interview.weight, strata = ~strata,
                             data = analytic.data, nest = TRUE)

Step 2

# Propensity scores
ps.fit3 <- svyglm(ps.formula3, design = analytic.design, family = binomial("logit"))
analytic.data$PS3 <- fitted(ps.fit3)

# Caliper fixing to 0.2*sd(logit of PS)
caliper3 <- 0.2*sd(log(analytic.data$PS3/(1-analytic.data$PS3)))

# Match exposed and unexposed subjects 
set.seed(504)
match.obj3 <- matchit(ps.formula, data = analytic.data,
                     distance = analytic.data$PS3, 
                     method = "nearest", 
                     replace = FALSE,
                     caliper = caliper3, 
                     ratio = 1)
analytic.data$PS3 <- match.obj3$distance

# Extract matched data
matched.data3 <- match.data(match.obj3) 

Step 3

# Balance checking
cov <- c("gender", "age", "race", "income", 
         "education", "married", "cholesterol", 
         "diastolicBP", "systolicBP")

tab1m3 <- CreateTableOne(strata = "bmi", vars = cov, data = matched.data3, test = F)
print(tab1m3, smd = TRUE)
#>                          Stratified by bmi
#>                           Overweight     Not overweight SMD   
#>   n                         2029           2029               
#>   gender = Male (%)          917 (45.2)    1009 (49.7)   0.091
#>   age (mean (SD))          47.20 (17.50)  46.26 (17.62)  0.054
#>   race (%)                                               0.071
#>      Black                   443 (21.8)     416 (20.5)        
#>      Hispanic                552 (27.2)     583 (28.7)        
#>      Other                   313 (15.4)     351 (17.3)        
#>      White                   721 (35.5)     679 (33.5)        
#>   income (%)                                             0.067
#>      <25k                    528 (26.0)     506 (24.9)        
#>      Between.25kto54k        688 (33.9)     656 (32.3)        
#>      Between.55kto99k        439 (21.6)     495 (24.4)        
#>      Over100k                374 (18.4)     372 (18.3)        
#>   education (%)                                          0.034
#>      College                1165 (57.4)    1186 (58.5)        
#>      High.School             654 (32.2)     653 (32.2)        
#>      School                  210 (10.3)     190 ( 9.4)        
#>   married (%)                                            0.026
#>      Married                1216 (59.9)    1231 (60.7)        
#>      Never.married           396 (19.5)     375 (18.5)        
#>      Previously.married      417 (20.6)     423 (20.8)        
#>   cholesterol (mean (SD)) 176.62 (38.32) 175.13 (38.65)  0.039
#>   diastolicBP (mean (SD))  65.02 (13.44)  64.48 (13.37)  0.041
#>   systolicBP (mean (SD))  117.98 (15.46) 116.37 (17.62)  0.097

All SMDs are less than your specified cut-point? If yes, that indicates that there is good covariate balancing.

Step 4

# Setup the design with survey features
analytic.with.miss$matched3 <- 0
analytic.with.miss$matched3[analytic.with.miss$ID %in% matched.data3$ID] <- 1

# Survey setup for full data
w.design03 <- svydesign(strata = ~strata, id = ~psu, weights = ~interview.weight, 
                       data = analytic.with.miss, nest = TRUE)

# Subset matched data
w.design.m3 <- subset(w.design03, matched3 == 1)

# Outcome model
out.formula <- as.formula(I(diabetes == "Yes") ~ bmi)
fit3 <- svyglm(out.formula, design = w.design.m3, family = binomial("logit"))
summ(fit3, exp = TRUE, confint = TRUE, digits = 3)
Observations 4058
Dependent variable I(diabetes == "Yes")
Type Survey-weighted generalized linear model
Family binomial
Link logit
Pseudo-R² (Cragg-Uhler) 0.071
Pseudo-R² (McFadden) 0.054
AIC 2364.398
exp(Est.) 2.5% 97.5% t val. p
(Intercept) 0.169 0.142 0.202 -19.891 0.000
bmiNot overweight 0.244 0.178 0.334 -8.825 0.000
Standard errors: Robust

Double adjustment

# Outcome model with covariates adjustment
fit3.DA <- svyglm(I(diabetes == "Yes") ~ bmi + gender + age + race + income + 
                 education + married + cholesterol + diastolicBP + systolicBP, 
               design = w.design.m, family = binomial("logit"))
summ(fit3.DA, exp = TRUE, confint = TRUE, digits = 3)
Observations 4094
Dependent variable I(diabetes == "Yes")
Type Survey-weighted generalized linear model
Family binomial
Link logit
Pseudo-R² (Cragg-Uhler) 0.212
Pseudo-R² (McFadden) 0.166
AIC 2063.466
exp(Est.) 2.5% 97.5% t val. p
(Intercept) 0.045 0.010 0.199 -4.101 NA
bmiNot overweight 0.235 0.159 0.348 -7.219 NA
genderMale 1.317 1.082 1.604 2.740 NA
age 1.051 1.044 1.059 13.280 NA
raceHispanic 0.896 0.649 1.237 -0.665 NA
raceOther 1.158 0.565 2.374 0.401 NA
raceWhite 0.778 0.507 1.193 -1.152 NA
incomeBetween.25kto54k 0.830 0.551 1.251 -0.889 NA
incomeBetween.55kto99k 0.872 0.621 1.224 -0.794 NA
incomeOver100k 0.669 0.418 1.072 -1.669 NA
educationHigh.School 0.948 0.733 1.225 -0.408 NA
educationSchool 0.872 0.448 1.695 -0.405 NA
marriedNever.married 1.055 0.677 1.646 0.237 NA
marriedPreviously.married 0.849 0.534 1.348 -0.694 NA
cholesterol 0.991 0.986 0.995 -3.810 NA
diastolicBP 1.005 0.992 1.018 0.749 NA
systolicBP 1.003 0.993 1.012 0.559 NA
Standard errors: Robust

References

Austin, Peter C, Nathaniel Jembere, and Maria Chiu. 2018. “Propensity Score Matching and Complex Surveys.” Statistical Methods in Medical Research 27 (4): 1240–57.
DuGoff, Eva H, Megan Schuler, and Elizabeth A Stuart. 2014. “Generalizing Observational Study Results: Applying Propensity Score Methods to Complex Surveys.” Health Services Research 49 (1): 284–303.
Zanutto, Elaine L. 2006. “A Comparison of Propensity Score and Linear Regression Analysis of Complex Survey Data.” Journal of Data Science 4 (1): 67–91.